This project studies monocular depth estimation from event camera streams (events → depth).
Event cameras output asynchronous per-pixel brightness changes (“events”) rather than intensity frames, which makes standard vision pipelines difficult. Meanwhile, dense and reliable depth supervision is scarce. To address these challenges, we build a cross-modality autoencoder that aligns events and depth in a shared latent space, enabling training on both paired and unpaired data via weak supervision. We also include a teacher–student distillation component to generate dense pseudo-labels when ground truth is incomplete.
- Two modality-specific VAEs (events VAE + depth VAE) mapped into a shared, geometry-aware latent space with ~4× compression.
- Product-of-Experts (PoE) fusion combines event/depth encoder posteriors when both are available; falls back to unimodal when one modality is missing.
- Event branch uses a weighted NLL-style loss + learnable log-variance to respect event sparsity.
- Depth branch leverages a Marigold-compatible pretrained VAE for stable depth encoding/decoding.
- Designed as a front end for latent-space diffusion (U-Net denoiser operates in the aligned latent space).
Synthetic
Top to bottom: event input → predicted depth → ground truth.

Real (DSEC)
Top to bottom: event input → predicted depth → ground truth.

-
Input size restriction
Pixel-space diffusion often assumes ~256×256 inputs, but real event datasets can be larger (e.g., DSEC depth at 640×480).
→ We diffuse in latent space (compressed grids). -
Data scarcity / incomplete depth labels
Paired event–depth data is limited and depth maps often contain NaNs / invalid regions.
→ We train with PoE + weak supervision on paired and unimodal samples, and use distillation for denser supervision.
- Event encoder/decoder: learns to reconstruct sparse event tensors.
- Depth encoder/decoder: uses a pretrained VAE (compatible with latent diffusion depth pipelines).
- PoE fusion: merges encoder posteriors into a single latent distribution when both modalities exist.
When both modalities are present, PoE combines Gaussian posteriors by adding precisions; when missing, it defaults to the available modality posterior.
Reconstruction quality reflects how well the PoE framework learns a shared latent representation of events and depth.
We use a pretrained RGB → depth model to generate dense pseudo-labels aligned to available depth where possible. This helps supervision when ground truth is sparse or contains NaNs.
- Synthetic CARLA: dense depth + controllable scenes (simulated event streams).
- MVSEC (real): event + depth sequences in indoor/outdoor settings.
- DSEC (real): driving sequences, larger resolution; depth derived from LiDAR disparity.
After learning a shared latent space for events and depth, we use a U-Net denoiser to perform latent-space generation and decode the predicted latent back to a depth map. This addresses:
- Input-size constraints of pixel-space diffusion (operate on compressed latents)
- Alignment issues by denoising in a modality-aligned latent space
- Random seeds are fixed for Python / NumPy / PyTorch.
- Hydra config logging: each run records the full resolved configuration (model, data, losses, training).
- Checkpointing: the best validation checkpoint is saved via PyTorch Lightning.
- Experiment tracking: runs are logged to Weights & Biases when enabled.
- Metric scripts: evaluation and plotting scripts are stored alongside training code.
Hardware: NVIDIA A100 GPU with 80GB VRAM.
- Python 3.9+ (recommended 3.10)
- CUDA-capable GPU recommended
- PyTorch + PyTorch Lightning
- Hydra
- Weights & Biases (optional)
Dataset root paths are configured via Hydra (e.g., configs/data/*.yaml) rather than hard-coded in scripts.



