Ruijie Zhu,
Chuxin Wang,
Ziyang Song,
Li Liu,
Tianzhu Zhang,
Yongdong Zhang,
University of Science and Technology of China
Arxiv 2024
Within a unified framework, our method ScaleDepth achieves both accurate indoor and outdoor metric depth estimation without setting depth ranges or finetuning models. Left: the input RGB image and corresponding depth prediction. Right: the comparison of model parameters and performance. With overall fewer parameters, our model ScaleDepth-NK significantly outperforms the state-of-the-art methods under same experimental settings.
Without any finetuning, our model can generalize to scenes with different scales and accurately estimate depth from indoors to outdoors.
The overall architecture of the proposed ScaleDepth. We design bin queries to predict relative depth distribution and scale queries to predict scene scale. During training, we preset text prompts containing 28 scene categories as input to the frozen CLIP text encoder. We then calculate the similarity between the updated scale queries and text embedding, and utilize the scene category as its auxiliary supervision. During inference, only a single image is required to obtain the relative depth and scene scale, thereby synthesizing a metric depth map.
Please refer to get_started.md for installation and dataset_prepare.md for dataset preparation.
You may also need to install these packages:
pip install "mmdet>=3.0.0rc4"
pip install open_clip_torch
pip install future tensorboard
pip install -r requirements/albu.txtAnd download the checkpoint of text embeddings from Google Drive and place it to projects/ScaleDepth/pretrained_weights folder.
We provide train.md and inference.md for the instruction of training and inference.
# ScaleDepth-N
bash tools/dist_train.sh projects/ScaleDepth/configs/ScaleDepth/scaledepth_clip_NYU_480x480.py 4
# ScaleDepth-K
bash tools/dist_train.sh projects/ScaleDepth/configs/ScaleDepth/scaledepth_clip_KITTI_352x1120.py 4
# ScaleDepth-NK
bash tools/dist_train.sh projects/ScaleDepth/configs/ScaleDepth/scaledepth_clip_NYU_KITTI_352x512.py 4# ScaleDepth-N
python tools/test.py projects/ScaleDepth/configs/ScaleDepth/scaledepth_clip_NYU_480x480.py work_dirs/scaledepth_clip_NYU_KITTI_352x512/iter_40000.pth
# ScaleDepth-K
python tools/test.py projects/ScaleDepth/configs/ScaleDepth/scaledepth_clip_KITTI_352x1120.py work_dirs/scaledepth_clip_NYU_KITTI_352x512/iter_40000.pth
# ScaleDepth-NK
python tools/test.py projects/ScaleDepth/configs/ScaleDepth/scaledepth_clip_NYU_KITTI_352x512.py work_dirs/scaledepth_clip_NYU_KITTI_352x512/iter_40000.pth| Method | Backbone | Train Iters | Results | Config | Checkpoint | GPUs |
|---|---|---|---|---|---|---|
| ScaleDepth-NK | CLIP(ConvNext-Large) | 40000 | log | config | iter_40000.pth | 4 RTX 3090 |
If you like our work and use the codebase or models for your research, please cite our work as follows.
@ARTICLE{zhu2024scale,
title={ScaleDepth: Decomposing Metric Depth Estimation into Scale Prediction and Relative Depth Estimation},
author={Zhu, Ruijie and Wang, Chuxin and Song, Ziyang and Liu, Li and Zhang, Tianzhu and Zhang, Yongdong},
journal={arXiv preprint arXiv:2407.08187},
year={2024}
}
We thank Jianfeng He and Jiacheng Deng for their thoughtful and valuable suggestions. We thank the authors of Binsformer and Zoedepth for their code.


