Skip to content

RogerDai2026/Event-Diffusion

Repository files navigation

Depth Estimation with an Event Camera (Cross-Modality Autoencoder + PoE)

This project studies monocular depth estimation from event camera streams (events → depth).
Event cameras output asynchronous per-pixel brightness changes (“events”) rather than intensity frames, which makes standard vision pipelines difficult. Meanwhile, dense and reliable depth supervision is scarce. To address these challenges, we build a cross-modality autoencoder that aligns events and depth in a shared latent space, enabling training on both paired and unpaired data via weak supervision. We also include a teacher–student distillation component to generate dense pseudo-labels when ground truth is incomplete.


TL;DR (What we built)

  • Two modality-specific VAEs (events VAE + depth VAE) mapped into a shared, geometry-aware latent space with ~4× compression.
  • Product-of-Experts (PoE) fusion combines event/depth encoder posteriors when both are available; falls back to unimodal when one modality is missing.
  • Event branch uses a weighted NLL-style loss + learnable log-variance to respect event sparsity.
  • Depth branch leverages a Marigold-compatible pretrained VAE for stable depth encoding/decoding.
  • Designed as a front end for latent-space diffusion (U-Net denoiser operates in the aligned latent space).

Highlights (Depth Prediction)

Synthetic Top to bottom: event input → predicted depth → ground truth.
synthetic_depth_pred

Real (DSEC) Top to bottom: event input → predicted depth → ground truth.
real_depth_pred


Key challenges we target

  1. Input size restriction
    Pixel-space diffusion often assumes ~256×256 inputs, but real event datasets can be larger (e.g., DSEC depth at 640×480).
    → We diffuse in latent space (compressed grids).

  2. Data scarcity / incomplete depth labels
    Paired event–depth data is limited and depth maps often contain NaNs / invalid regions.
    → We train with PoE + weak supervision on paired and unimodal samples, and use distillation for denser supervision.


Method overview

Cross-modality autoencoder (shared latent)

  • Event encoder/decoder: learns to reconstruct sparse event tensors.
  • Depth encoder/decoder: uses a pretrained VAE (compatible with latent diffusion depth pipelines).
  • PoE fusion: merges encoder posteriors into a single latent distribution when both modalities exist.

Pipeline sketch

pipeline (1)

Product-of-Experts fusion (PoE)

When both modalities are present, PoE combines Gaussian posteriors by adding precisions; when missing, it defaults to the available modality posterior.

poe

PoE Reconstruction Results

Reconstruction quality reflects how well the PoE framework learns a shared latent representation of events and depth.

MVSEC Reconstructions

Event reconstruction
mvsec_event

Depth reconstruction
mvsec_depth

DSEC Reconstructions

Event reconstruction
dsec_fintune_event

Depth reconstruction
dsec_depth


Teacher–student distillation (dense pseudo-depth)

We use a pretrained RGB → depth model to generate dense pseudo-labels aligned to available depth where possible. This helps supervision when ground truth is sparse or contains NaNs.

distillation

Datasets

  • Synthetic CARLA: dense depth + controllable scenes (simulated event streams).
  • MVSEC (real): event + depth sequences in indoor/outdoor settings.
  • DSEC (real): driving sequences, larger resolution; depth derived from LiDAR disparity.

Depth Prediction (Latent-space Diffusion)

After learning a shared latent space for events and depth, we use a U-Net denoiser to perform latent-space generation and decode the predicted latent back to a depth map. This addresses:

  • Input-size constraints of pixel-space diffusion (operate on compressed latents)
  • Alignment issues by denoising in a modality-aligned latent space

Current results on synthetic and real datasets

Synthetic synthetic_depth_pred
Real (DSEC) real_depth_pred

Reproducibility

  • Random seeds are fixed for Python / NumPy / PyTorch.
  • Hydra config logging: each run records the full resolved configuration (model, data, losses, training).
  • Checkpointing: the best validation checkpoint is saved via PyTorch Lightning.
  • Experiment tracking: runs are logged to Weights & Biases when enabled.
  • Metric scripts: evaluation and plotting scripts are stored alongside training code.

Hardware: NVIDIA A100 GPU with 80GB VRAM.


Installation

Requirements

  • Python 3.9+ (recommended 3.10)
  • CUDA-capable GPU recommended
  • PyTorch + PyTorch Lightning
  • Hydra
  • Weights & Biases (optional)

Dataset paths

Dataset root paths are configured via Hydra (e.g., configs/data/*.yaml) rather than hard-coded in scripts.

About

Monocular depth estimation using event camera

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages