Depth Estimation with an Event Camera (Cross-Modality Autoencoder + PoE)

This project studies monocular depth estimation from event camera streams (events → depth).
Event cameras output asynchronous per-pixel brightness changes (“events”) rather than intensity frames, which makes standard vision pipelines difficult. Meanwhile, dense and reliable depth supervision is scarce. To address these challenges, we build a cross-modality autoencoder that aligns events and depth in a shared latent space, enabling training on both paired and unpaired data via weak supervision. We also include a teacher–student distillation component to generate dense pseudo-labels when ground truth is incomplete.

TL;DR (What we built)

Two modality-specific VAEs (events VAE + depth VAE) mapped into a shared, geometry-aware latent space with ~4× compression.
Product-of-Experts (PoE) fusion combines event/depth encoder posteriors when both are available; falls back to unimodal when one modality is missing.
Event branch uses a weighted NLL-style loss + learnable log-variance to respect event sparsity.
Depth branch leverages a Marigold-compatible pretrained VAE for stable depth encoding/decoding.
Designed as a front end for latent-space diffusion (U-Net denoiser operates in the aligned latent space).

Highlights (Depth Prediction)

Synthetic Top to bottom: event input → predicted depth → ground truth.

Real (DSEC) Top to bottom: event input → predicted depth → ground truth.

Key challenges we target

Input size restriction
Pixel-space diffusion often assumes ~256×256 inputs, but real event datasets can be larger (e.g., DSEC depth at 640×480).
→ We diffuse in latent space (compressed grids).
Data scarcity / incomplete depth labels
Paired event–depth data is limited and depth maps often contain NaNs / invalid regions.
→ We train with PoE + weak supervision on paired and unimodal samples, and use distillation for denser supervision.

Method overview

Cross-modality autoencoder (shared latent)

Event encoder/decoder: learns to reconstruct sparse event tensors.
Depth encoder/decoder: uses a pretrained VAE (compatible with latent diffusion depth pipelines).
PoE fusion: merges encoder posteriors into a single latent distribution when both modalities exist.

Pipeline sketch

Product-of-Experts fusion (PoE)

When both modalities are present, PoE combines Gaussian posteriors by adding precisions; when missing, it defaults to the available modality posterior.

PoE Reconstruction Results

Reconstruction quality reflects how well the PoE framework learns a shared latent representation of events and depth.

MVSEC Reconstructions

Event reconstruction

Depth reconstruction

DSEC Reconstructions

Event reconstruction

Depth reconstruction

Teacher–student distillation (dense pseudo-depth)

We use a pretrained RGB → depth model to generate dense pseudo-labels aligned to available depth where possible. This helps supervision when ground truth is sparse or contains NaNs.

Datasets

Synthetic CARLA: dense depth + controllable scenes (simulated event streams).
MVSEC (real): event + depth sequences in indoor/outdoor settings.
DSEC (real): driving sequences, larger resolution; depth derived from LiDAR disparity.

Depth Prediction (Latent-space Diffusion)

After learning a shared latent space for events and depth, we use a U-Net denoiser to perform latent-space generation and decode the predicted latent back to a depth map. This addresses:

Input-size constraints of pixel-space diffusion (operate on compressed latents)
Alignment issues by denoising in a modality-aligned latent space

Current results on synthetic and real datasets

Synthetic

Real (DSEC)

Reproducibility

Random seeds are fixed for Python / NumPy / PyTorch.
Hydra config logging: each run records the full resolved configuration (model, data, losses, training).
Checkpointing: the best validation checkpoint is saved via PyTorch Lightning.
Experiment tracking: runs are logged to Weights & Biases when enabled.
Metric scripts: evaluation and plotting scripts are stored alongside training code.

Hardware: NVIDIA A100 GPU with 80GB VRAM.

Installation

Requirements

Python 3.9+ (recommended 3.10)
CUDA-capable GPU recommended
PyTorch + PyTorch Lightning
Hydra
Weights & Biases (optional)

Dataset paths

Dataset root paths are configured via Hydra (e.g., configs/data/*.yaml) rather than hard-coded in scripts.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
configs		configs
data_split		data_split
notebooks		notebooks
scripts		scripts
src		src
tests		tests
.DS_Store		.DS_Store
Makefile		Makefile
README.md		README.md
environment.yaml		environment.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Depth Estimation with an Event Camera (Cross-Modality Autoencoder + PoE)

TL;DR (What we built)

Highlights (Depth Prediction)

Key challenges we target

Method overview