CODERS: Category-level Object Detection, Pose Estimation and Reconstruction from Stereo Images

Chuanrui Zhang * · Yonggen Ling * · Minglei Lu · Minghan Qin · Haoqian Wang †

ECCV 2024

Paper | Project Page | Pretrained Models | Dataset

Installation

⚠️ Important Note: We strongly recommend strictly following the exact versions of Python, PyTorch, MMCV, and mmdet3d specified below. Other versions have not been tested and will likely cause compilation failures or runtime errors due to library incompatibilities. Furthermore, local CUDA versions ranging from 11.1 to 11.8 have been successfully tested and are fully compatible with the exact same installation commands provided below.

a. Create a conda virtual environment and activate it.

conda create --name coders -y python=3.8
conda activate coders

b. Install PyTorch and torchvision.

# Please ensure the PyTorch version matches your CUDA environment
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

c. Install mmcv and mmdet3d.

pip install mmcv-full==1.4.0 -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.9.0/index.html
pip install mmdet==2.24.1
pip install mmsegmentation==0.20.2

git clone https://github.com/open-mmlab/mmdetection3d.git
cd mmdetection3d
git checkout v0.17.1 
pip install -e .
cd ..

d. Install other requirements.

pip install trimesh==4.0.0
pip install einops
pip install setuptools==58.0.4
pip install yapf==0.40.1

pip install wandb

Acquiring Datasets

SS3D Dataset

We use the SS3D dataset for training and evaluation. You can download the dataset files from our Hugging Face repository.

Please download object_embeddings.npy, ss3d_val.zip, and all parts of the training set (ss3d_train.zip.001, 002, 003). Merge and extract the zip files, and place everything into the data/ss3d/ directory.

cat ss3d_train.zip.* > ss3d_train.zip && unzip ss3d_train.zip

Data Structure

After extraction, your directory structure should look like this (ignoring the debug files):

data/ss3d/
├── ss3d_train/
├── ss3d_val/
├── object_embeddings.npy
├── ss3d_infos_train.pkl
└── ss3d_infos_val.pkl

Data Preparation (Annotation Generation)

You can either use the pre-processed .pkl annotation files directly, or generate them from scratch using our provided data converter scripts:

# Generate full dataset annotation files (ss3d_infos_train.pkl & ss3d_infos_val.pkl)
python tools/data_converter/ss3d_converter.py

For rapid environment verification or code debugging, we also provide a script to generate a small-scale version of the dataset (ss3d_infos_train_debug.pkl & ss3d_infos_val_debug.pkl):

# Generate small-scale debug annotation files
python tools/data_converter/ss3d_converter_debug.py

Running the Code

Pre-trained Models

Before running inference or training, please download all the required pre-trained weights from our Hugging Face repository and place them in the ./ckpts/ directory.

Directory Structure

After downloading, your ./ckpts/ directory should be organized exactly like this:

ckpts/
├── coders_sdf/
│   ├── settings.yaml
│   └── weights.pt
├── coders_ss3d_convnext.pth
├── coders_dinov2_convnext.pth
└── dinov2_vitb14_reg4_pretrain.npz

Weight Descriptions

coders_ss3d_convnext.pth: The main pre-trained Coders [convnext] model weights used for inference and evaluation on SS3D dataset.
coders_dinov2_convnext.pth: The main pre-trained CODERS [dinov2] model weights used for inference and evaluation on SS3D dataset.
dinov2_vitb14_reg4_pretrain.npz: Pre-trained weights for the DINOv2 backbone. Because our framework strictly relies on PyTorch < 2.0 (v1.9.0), the official PyTorch 2.0+ DINOv2 checkpoints are incompatible. We converted them to .npz format for stable loading during the pre-training initialization phase.
coders_sdf/: The pre-trained DeepSDF model directory. It contains settings.yaml and the core weights.pt.

Please ensure this entire folder is placed at the specified path so the implicit reconstruction module can load it correctly.

Evaluation

Quick Start: Single-mode Inference

You can easily run the inference test using our provided shell script:

sh infer.sh

Inside infer.sh: This script provides a ready-to-use pipeline. It passes default arguments including the configuration, model checkpoints, SDF path, and reconstruction resolution. Most importantly, it allows you to specify the input stereo images (image_left and image_right) along with their corresponding camera parameters (interocular_distance and cam_intrinsic):

CUDA_VISIBLE_DEVICES=0 python inference.py \
    --config "$CONFIG" \
    --checkpoint "$CHECKPOINT" \
    --sdf_path "$SDF_PATH" \
    --resolution "$RESOLUTION" \
    --image_left "$IMAGE_LEFT" \
    --image_right "$IMAGE_RIGHT" \
    --interocular_distance "$INTEROCULAR_DISTANCE" \
    --cam_intrinsic "$CAM_INTRINSIC"

⚠️ Important Note: Our model is trained specifically on data with fixed camera resolutions and baseline distances. For testing, please use images and parameters consistent with the provided examples in the assets/ directory. Directly testing on images from other stereo camera setups has not been verified and will likely require fine-tuning the model on a custom dataset.

The rendered 3D meshes (.obj files), original input images, and 2D visualization results will be stored under inference_results/single_mode/ with corresponding timestamps.

Quantitative Evaluation on SS3D

To evaluate the model's performance on the SS3D dataset and calculate metrics, you can run our evaluation script:

sh metric.sh

Inside metric.sh, we use tools/dist_test.sh to perform the testing. By default, it runs on a single GPU using the ConvNeXt backbone:

CUDA_VISIBLE_DEVICES=0 tools/dist_test.sh projects/configs/coders/coders_ss3d_convnext_uvd.py ./ckpts/coders_ss3d_convnext.pth 1 28501 --eval bbox --vis 5

Command Breakdown:

1: Specifies the number of GPUs used for testing.
28501: The port assigned for the distributed testing process.
--eval bbox: Instructs the script to compute evaluation metrics for 3D bounding boxes.
--vis 5: Enables visualization during testing (e.g., saving visualization results for qualitative analysis).

Evaluation Outputs: After the evaluation is complete, the quantitative results will be saved as a CSV file in a timestamped directory. For example: inference_results/coders_ss3d_convnext_uvd/20260324_200837/metrics_output.csv

This file contains detailed per-class and overall mean metrics. The key headers in the report correspond directly to the metrics reported in our paper (refer to projects/mmdet3d_plugin/models/detectors/coders.py for implementation details):

tp55 & tp55_acc: Accuracy where the rotation error is < 5° and the translation error is < 5cm.
tp52 & tp52_acc: Accuracy where the rotation error is < 5° and the translation error is < 2cm.
iou25_acc, iou50_acc, iou75_acc: 3D Intersection over Union (IoU) accuracies at thresholds of 0.25, 0.50, and 0.75 respectively.
error_rate: Error in label classfier.

Training

To train CODERS from scratch, run the following command using our script:

sh train.sh

Inside train.sh, we use tools/dist_train.sh for distributed training. You can specify the visible GPUs, the configuration file, the number of GPUs, and the working directory.

By default, the script trains the model with the DINOv2 backbone using 6 GPUs:

# For DINOv2 backbone
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 tools/dist_train.sh projects/configs/coders/coders_ss3d_dinov2_uvd.py 6 --work-dir work_dirs/coders_ss3d_dinov2_uvd

(Optional) If you want to train using the ConvNeXt backbone instead, you can uncomment the corresponding line in the script:

# For ConvNeXt backbone
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 tools/dist_train.sh projects/configs/coders/coders_ss3d_convnext_uvd.py 6 --work-dir work_dirs/coders_ss3d_convnext_uvd

Note: Please adjust CUDA_VISIBLE_DEVICES and the GPU count parameter (e.g., 6) based on your available hardware resources.

BibTeX

If you find our work helpful, please consider citing:

@inproceedings{zhang2024category,
  title={Category-level object detection, pose estimation and reconstruction from stereo images},
  author={Zhang, Chuanrui and Ling, Yonggen and Lu, Minglei and Qin, Minghan and Wang, Haoqian},
  booktitle={European Conference on Computer Vision},
  pages={332--349},
  year={2024},
  organization={Springer}
}

Acknowledgements

This project is heavily inspired by and built upon several excellent open-source codebases. We would like to specifically thank the authors and maintainers of PETR and MMDetection3D for their foundational transformer architectures and 3D detection frameworks. Furthermore, we have incorporated concepts, code snippets, and pre-trained models from DINOv2, Facebook Research's DeepSDF, and maurock's DeepSDF implementation.

Many thanks to all these projects for their outstanding contributions to the open-source community!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CODERS: Category-level Object Detection, Pose Estimation and Reconstruction from Stereo Images

ECCV 2024

Paper | Project Page | Pretrained Models | Dataset

Installation

Acquiring Datasets

SS3D Dataset

Data Structure

Data Preparation (Annotation Generation)

Running the Code

Pre-trained Models

Directory Structure

Weight Descriptions

Evaluation

Quick Start: Single-mode Inference

Quantitative Evaluation on SS3D

Training

BibTeX

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
deepsdf		deepsdf
projects		projects
tools		tools
.gitignore		.gitignore
README.md		README.md
infer.sh		infer.sh
inference.py		inference.py
metric.sh		metric.sh
train.sh		train.sh

Folders and files

Latest commit

History

Repository files navigation

CODERS: Category-level Object Detection, Pose Estimation and Reconstruction from Stereo Images

ECCV 2024

Paper | Project Page | Pretrained Models | Dataset

Installation

Acquiring Datasets

SS3D Dataset

Data Structure

Data Preparation (Annotation Generation)

Running the Code

Pre-trained Models

Directory Structure

Weight Descriptions

Evaluation

Quick Start: Single-mode Inference

Quantitative Evaluation on SS3D

Training

BibTeX

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages