DynamicVerse.mp4
- Table of Contents
- Overview
- Key Features
- Project Structure
- Installation
- Quick Start
- Processing Pipeline
- Output Directory Structure
- Evaluation
- Notes
- Acknowledgements
- License
- Contributing
- Citation
DynamicVerse is an integrated framework for dynamic scene understanding and 4D reconstruction, combining advanced visual models such as Sa2VA, Qwen-VL, DAM, CameraBench, CoTracker, and UniDepth to achieve end-to-end processing from video to 4D scenes.
- 🎬 Dynamic Scene Analysis: Supports video keyframe extraction and motion-aware analysis
- 🔍 Multimodal Understanding: Integrates vision-language models for scene description and object recognition
- 🎯 Dense Segmentation: Precise object segmentation and tracking based on Sa2VA
- 📊 4D Reconstruction: Complete pipeline from video to 4D scene reconstruction
DynamicVerse/
├── dynamicBA/ # 4D scene reconstruction module
│ ├── unimatch/ # Optical flow and depth estimation
│ ├── dataset_prepare/ # Data preprocessing tools
│ └── config/ # Configuration files
├── data/ # Dataset directory
├── scripts/ # Preprocessing scripts
├── dynamicgen/ # Pipeline execution
│ └── scripts/ # DynamicGen pipeline
├── Sa2VA/ # Vision-language multimodal model
├── CoTracker/ # Point tracking model
├── UniDepth/ # Monocular depth estimation
└── ...
git clone --recurse-submodules https://github.com/Dynamics-X/DynamicVerse.git
cd DynamicVerse
conda create -n dynamicverse python=3.10
conda activate dynamicverse
bash scripts/install.shbash scripts/download_weights.shThis script will automatically download the following models:
- CoTracker3 (for motion tracking)
- UniDepth (for depth estimation)
- Sa2VA-8B (multimodal understanding model)
- Qwen2.5-VL-72B-Instruct (vision-language model)(optional)
Process a complete geometric scene pipeline:
cd dynamicgen
bash scripts/run_pipeline_demo.sh '' -allThis script executes the following steps:
- Keyframe Extraction: Motion-aware video keyframe extraction
- Scene Analysis: Multimodal analysis using Qwen and Sa2VA
- Segmentation Processing: Generate object masks and organize output
- 4D Reconstruction (Optional): Complete 4D scene reconstruction using dynamicBA
Qwen2.5-VL can be used in two ways:
For API service usage:
-
Set API Key: Set environment variable when running scripts
export DASHSCOPE_API_KEY=your_api_keyOr set it directly in
dynamicgen/scripts/run_pipeline_demo.sh -
Modify API Configuration: Edit
dynamicgen/stage1_qwen.pyclient = OpenAI( api_key=api_key, # Use API key from environment variable base_url="https://dashscope.aliyuncs.com/compatible-mode/v1" # API service address ) model="qvq-max-latest" # Or other Qwen models
For local deployment, modify dynamicgen/stage1_qwen.py to local service configuration:
client = OpenAI(
base_url="http://127.0.0.1:22002/v1", # Local service address
api_key="none" # Not needed for local service
)
# Specify model name
model="Qwen/Qwen2.5-VL-72B-Instruct"Install Dependencies:
pip install accelerate
pip install qwen-vl-utils==0.0.14
uv pip install -U vllm # Requires vllm>=0.11.0Start Local Service:
python -m vllm.entrypoints.openai.api_server \
--model <ckpt_path> \
--served-model-name Qwen/Qwen2.5-VL-72B-Instruct \
--tensor-parallel-size 4 \
--mm-encoder-tp-mode data \
--enable-expert-parallel \
--host 0.0.0.0 \
--port 22002 \
--dtype bfloat16 \
--gpu-memory-utilization 0.70 \
--quantization fp8 \
--distributed-executor-backend mpFor detailed deployment instructions, refer to Qwen-VL
Place videos or image sequences in the data/ directory
python motion_aware_key_frame_extract.py \
--input_root <input_path> \
--output_root <output_path> \
--flow_model 'unimatch'python batch_process_qwen_pipeline.py \
<dataset_path> \
<output_path> \
--base_frame_dir <base_frame_dir> \
--key_frame_dir <key_frame_dir>cd dynamicBA
python ./dynamicBA/run.py \
--config ./dynamicBA/config/config.yaml \
--experiment_name base \
--opt_intrinsics \
--workdir <workdir>After processing, the following directory structure is generated:
data/
├── key_frames/ # Keyframe extraction results
│ └── <dataset_name>/ # Dataset name
│ └── <scene_id>/ # Scene ID
│ ├── frame_*.jpg
│ └── keyframe_info.json
└── demo/ # Processed scene data
└── <scene_id>/ # Scene ID directory
├── videos/ # Original video files
│ └── <scene_id>.mp4
├── rgb/ # Extracted RGB frames
│ ├── 00001.jpg
│ ├── 00002.jpg
│ └── ...
├── analysis/ # Scene analysis results
│ └── dynamic_objects_<scene_id>.json # Dynamic object detection results
├── qwen/ # Qwen model outputs
│ └── Annotations/ # Segmentation annotations
│ ├── frame_00000.png
│ ├── frame_00001.png
│ └── ...
├── segmentation/ # Sa2VA segmentation results
│ ├── frames/ # Frame-level segmentation results
│ │ ├── original/ # Original frames
│ │ ├── masks/ # Segmentation masks
│ │ ├── overlay/ # Overlay visualizations
│ │ └── segmented/ # Segmented images
│ ├── videos/ # Segmentation videos
│ │ ├── original.mp4 # Original video
│ │ ├── masks.mp4 # Mask video
│ │ ├── overlay.mp4 # Overlay video
│ │ └── segmented.mp4 # Segmented video
│ ├── instance_labels.json # Instance label information
│ └── result_summary.json # Segmentation result summary
├── dynamicBA/ (Optional) # 4D reconstruction results
│ ├── pose.npz # Intrinsic and Extrinsics
│ ├── depth/ # Depth maps
│ └── flow/ # Optical flow data
└── processing_log_<scene_id>.log # Processing log
- dynamic_objects_*.json: Contains detected dynamic object information, including position, category, and tracking ID
- instance_labels.json: Label mapping for each instance, used for multi-object segmentation
- result_summary.json: Segmentation result statistics, including frame count, object count, etc.
- processing_log_*.log: Detailed processing log for debugging
We provide preprocessed datasets to reproduce Table 1 and 2 in our main paper.
You can also download our preprocessed data that we used for the quantitive results in our paper:
cd data
gdown https://drive.google.com/uc?id=1V1WIRvnJCJStL63rluwNZMPI2Gq4-yQy -O preprocessed.zip
unzip preprocessed.zipWe provide evaluation scripts for pose and depth metrics:
bash ./scripts/eval.sh- Storage Space: Pre-trained models require approximately 100GB storage
- Memory Requirements: Sa2VA-8B requires at least 32GB VRAM, Qwen2.5-VL requires more resources
- Data Formats: Supports common video formats and image sequences
Our code is based on the following awesome repositories:
This project is built upon multiple open-source projects. Please refer to the license requirements of each submodule.
Issues and Pull Requests are welcome. Before submitting code, please ensure:
- Code follows project style guidelines
- Passes all test cases
- Updates relevant documentation
If you find our work useful in your research, please consider giving a star ⭐ and citing the following paper 📝.
@misc{wen2025dynamicverse,
title={DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling},
author={Kairun Wen and Yuzhi Huang and Runyu Chen and Hui Zheng and Yunlong Lin and Panwang Pan and Chenxin Li and Wenyan Cong and Jian Zhang and Junbin Lu and Chenguo Lin and Dilin Wang and Zhicheng Yan and Hongyu Xu and Justin Theiss and Yue Huang and Xinghao Ding and Rakesh Ranjan and Zhiwen Fan},
year={2025},
eprint={2512.03000},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.03000},
}