GitHub - Dynamics-X/DynamicVerse: [NeurIPS 2025]"DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling"

DynamicVerse: A Physically-Aware Multimodal Framework
for 4D World Modeling

DynamicVerse.mp4

Overview

DynamicVerse is an integrated framework for dynamic scene understanding and 4D reconstruction, combining advanced visual models such as Sa2VA, Qwen-VL, DAM, CameraBench, CoTracker, and UniDepth to achieve end-to-end processing from video to 4D scenes.

Key Features

🎬 Dynamic Scene Analysis: Supports video keyframe extraction and motion-aware analysis
🔍 Multimodal Understanding: Integrates vision-language models for scene description and object recognition
🎯 Dense Segmentation: Precise object segmentation and tracking based on Sa2VA
📊 4D Reconstruction: Complete pipeline from video to 4D scene reconstruction

Project Structure

DynamicVerse/
├── dynamicBA/              # 4D scene reconstruction module
│   ├── unimatch/           # Optical flow and depth estimation
│   ├── dataset_prepare/    # Data preprocessing tools
│   └── config/             # Configuration files
├── data/                   # Dataset directory
├── scripts/                # Preprocessing scripts
├── dynamicgen/             # Pipeline execution
│   └── scripts/            # DynamicGen pipeline
├── Sa2VA/                  # Vision-language multimodal model
├── CoTracker/              # Point tracking model
├── UniDepth/               # Monocular depth estimation
└── ...

Installation

1. DynamicVerse Environment

git clone --recurse-submodules https://github.com/Dynamics-X/DynamicVerse.git 
cd DynamicVerse
conda create -n dynamicverse python=3.10
conda activate dynamicverse
bash scripts/install.sh

2. Download Pre-trained Models

bash scripts/download_weights.sh

This script will automatically download the following models:

CoTracker3 (for motion tracking)
UniDepth (for depth estimation)
Sa2VA-8B (multimodal understanding model)
Qwen2.5-VL-72B-Instruct (vision-language model)(optional)

Quick Start

Run DynamicGen Demo

Process a complete geometric scene pipeline:

cd dynamicgen
bash scripts/run_pipeline_demo.sh '' -all

This script executes the following steps:

Keyframe Extraction: Motion-aware video keyframe extraction
Scene Analysis: Multimodal analysis using Qwen and Sa2VA
Segmentation Processing: Generate object masks and organize output
4D Reconstruction (Optional): Complete 4D scene reconstruction using dynamicBA

Qwen2.5-VL Configuration

Qwen2.5-VL can be used in two ways:

Option 1: API Service (Default)

For API service usage:

Set API Key: Set environment variable when running scripts
```
export DASHSCOPE_API_KEY=your_api_key
```
Or set it directly in dynamicgen/scripts/run_pipeline_demo.sh

Modify API Configuration: Edit dynamicgen/stage1_qwen.py

client = OpenAI(
    api_key=api_key,  # Use API key from environment variable
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"  # API service address
)

model="qvq-max-latest"  # Or other Qwen models

Option 2: Local Deployment

For local deployment, modify dynamicgen/stage1_qwen.py to local service configuration:

client = OpenAI(
    base_url="http://127.0.0.1:22002/v1",  # Local service address
    api_key="none"  # Not needed for local service
)

# Specify model name
model="Qwen/Qwen2.5-VL-72B-Instruct"

Install Dependencies:

pip install accelerate
pip install qwen-vl-utils==0.0.14
uv pip install -U vllm  # Requires vllm>=0.11.0

Start Local Service:

python -m vllm.entrypoints.openai.api_server \
  --model <ckpt_path> \
  --served-model-name Qwen/Qwen2.5-VL-72B-Instruct \
  --tensor-parallel-size 4 \
  --mm-encoder-tp-mode data \
  --enable-expert-parallel \
  --host 0.0.0.0 \
  --port 22002 \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.70 \
  --quantization fp8 \
  --distributed-executor-backend mp

For detailed deployment instructions, refer to Qwen-VL

Processing Pipeline

1. Data Preparation

Place videos or image sequences in the data/ directory

2. Keyframe Extraction

python motion_aware_key_frame_extract.py \
    --input_root <input_path> \
    --output_root <output_path> \
    --flow_model 'unimatch'

3. Multimodal Analysis

python batch_process_qwen_pipeline.py \
    <dataset_path> \
    <output_path> \
    --base_frame_dir <base_frame_dir> \
    --key_frame_dir <key_frame_dir>

4. 4D Scene Reconstruction (Optional)

cd dynamicBA
python ./dynamicBA/run.py \
    --config ./dynamicBA/config/config.yaml \
    --experiment_name base \
    --opt_intrinsics \
    --workdir <workdir>

Output Directory Structure

After processing, the following directory structure is generated:

data/
├── key_frames/                  # Keyframe extraction results
│   └── <dataset_name>/         # Dataset name
│       └── <scene_id>/         # Scene ID
│           ├── frame_*.jpg
│           └── keyframe_info.json
└── demo/                        # Processed scene data
    └── <scene_id>/              # Scene ID directory
        ├── videos/              # Original video files
        │   └── <scene_id>.mp4
        ├── rgb/                 # Extracted RGB frames
        │   ├── 00001.jpg
        │   ├── 00002.jpg
        │   └── ...
        ├── analysis/            # Scene analysis results
        │   └── dynamic_objects_<scene_id>.json  # Dynamic object detection results
        ├── qwen/                # Qwen model outputs
        │   └── Annotations/     # Segmentation annotations
        │       ├── frame_00000.png
        │       ├── frame_00001.png
        │       └── ...
        ├── segmentation/        # Sa2VA segmentation results
        │   ├── frames/          # Frame-level segmentation results
        │   │   ├── original/   # Original frames
        │   │   ├── masks/      # Segmentation masks
        │   │   ├── overlay/    # Overlay visualizations
        │   │   └── segmented/  # Segmented images
        │   ├── videos/          # Segmentation videos
        │   │   ├── original.mp4    # Original video
        │   │   ├── masks.mp4       # Mask video
        │   │   ├── overlay.mp4     # Overlay video
        │   │   └── segmented.mp4   # Segmented video
        │   ├── instance_labels.json    # Instance label information
        │   └── result_summary.json     # Segmentation result summary
        ├── dynamicBA/ (Optional)        # 4D reconstruction results
        │   ├── pose.npz        # Intrinsic and Extrinsics
        │   ├── depth/          # Depth maps
        │   └── flow/           # Optical flow data
        └── processing_log_<scene_id>.log  # Processing log

Output Files Description

dynamic_objects_*.json: Contains detected dynamic object information, including position, category, and tracking ID
instance_labels.json: Label mapping for each instance, used for multi-object segmentation
result_summary.json: Segmentation result statistics, including frame count, object count, etc.
processing_log_*.log: Detailed processing log for debugging

Evaluation

We provide preprocessed datasets to reproduce Table 1 and 2 in our main paper.

Preprocessing

You can also download our preprocessed data that we used for the quantitive results in our paper:

cd data
gdown https://drive.google.com/uc?id=1V1WIRvnJCJStL63rluwNZMPI2Gq4-yQy -O preprocessed.zip
unzip preprocessed.zip

Metrics Evaluation

We provide evaluation scripts for pose and depth metrics:

bash ./scripts/eval.sh

Notes

Storage Space: Pre-trained models require approximately 100GB storage
Memory Requirements: Sa2VA-8B requires at least 32GB VRAM, Qwen2.5-VL requires more resources
Data Formats: Supports common video formats and image sequences

Acknowledgements

Our code is based on the following awesome repositories:

License

This project is built upon multiple open-source projects. Please refer to the license requirements of each submodule.

Contributing

Issues and Pull Requests are welcome. Before submitting code, please ensure:

Code follows project style guidelines
Passes all test cases
Updates relevant documentation

Citation

If you find our work useful in your research, please consider giving a star ⭐ and citing the following paper 📝.

@misc{wen2025dynamicverse,
        title={DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling}, 
        author={Kairun Wen and Yuzhi Huang and Runyu Chen and Hui Zheng and Yunlong Lin and Panwang Pan and Chenxin Li and Wenyan Cong and Jian Zhang and Junbin Lu and Chenguo Lin and Dilin Wang and Zhicheng Yan and Hongyu Xu and Justin Theiss and Yue Huang and Xinghao Ding and Rakesh Ranjan and Zhiwen Fan},
        year={2025},
        eprint={2512.03000},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2512.03000}, 
    }

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
Sa2VA @ a387d8f		Sa2VA @ a387d8f
UniDepth @ 8d8cfe4		UniDepth @ 8d8cfe4
co-tracker @ 9e3082b		co-tracker @ 9e3082b
data/demo/rgb		data/demo/rgb
dataset_prepare		dataset_prepare
dynamicBA		dynamicBA
dynamicgen		dynamicgen
preprocess		preprocess
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DynamicVerse: A Physically-Aware Multimodal Framework
for 4D World Modeling

Table of Contents

Overview

Key Features

Project Structure

Installation

1. DynamicVerse Environment

2. Download Pre-trained Models

Quick Start

Run DynamicGen Demo

Qwen2.5-VL Configuration

Option 1: API Service (Default)

Option 2: Local Deployment

Processing Pipeline

1. Data Preparation

2. Keyframe Extraction

3. Multimodal Analysis

4. 4D Scene Reconstruction (Optional)

Output Directory Structure

Output Files Description

Evaluation

Preprocessing

Metrics Evaluation

Notes

Acknowledgements

License

Contributing

Citation

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

Dynamics-X/DynamicVerse

Folders and files

Latest commit

History

Repository files navigation

DynamicVerse: A Physically-Aware Multimodal Frameworkfor 4D World Modeling

Table of Contents

Overview

Key Features

Project Structure

Installation

1. DynamicVerse Environment

2. Download Pre-trained Models

Quick Start

Run DynamicGen Demo

Qwen2.5-VL Configuration

Option 1: API Service (Default)

Option 2: Local Deployment

Processing Pipeline

1. Data Preparation

2. Keyframe Extraction

3. Multimodal Analysis

4. 4D Scene Reconstruction (Optional)

Output Directory Structure

Output Files Description

Evaluation

Preprocessing

Metrics Evaluation

Notes

Acknowledgements

License

Contributing

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

DynamicVerse: A Physically-Aware Multimodal Framework
for 4D World Modeling

Packages