VDOT: Efficient Unified Video Creation via Optimal Transport Distillation

Yutong Wang¹, Haiyu Zhang^3,2, Tianfan Xue^4,2, Yu Qiao², Yaohui Wang^2*, Chang Xu^1*, Xinyuan Chen^2*

¹USYD, ²Shanghai AI Laboratory, ³BUAA, ⁴CUHK

Introduction

VDOT is an efficient, unified video creation model that achieves high-quality results in just 4 denoising steps. By employing Computational Optimal Transport (OT) within the distillation process, VDOT ensures training stability and enhances both training and inference efficiency. VDOT unifies a wide range of capabilities, such as Reference-to-Video (R2V), Video-to-Video (V2V), Masked Video Editing (MV2V), and arbitrary composite tasks, matching the versatility of VACE with significantly reduced inference costs.

sour_cover_compressed_.mp4

🎉 News

Mar 15, 2026: 🔥Release code of model training, inference, and gradio demos.
Mar 14, 2026: 🔥VDOT-14B is now avaiable at HuggingFace.
Mar 14, 2026: 🔥UVCBench is now avaiable at HuggingFace.
Feb 21, 2026: VDOT is accepted by CVPR 2026.
Dec 7, 2025: We propose VDOT, a 4-step unified video creation model based on VACE.

⚙️ Installation

The codebase was tested with Python 3.10.13, CUDA version 12.4, and PyTorch >= 2.5.1.

git clone https://github.com/hhhh1138/VDOT.git && cd VDOT
pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124  # If PyTorch is not installed.
pip install -r requirements.txt
pip install wan@git+https://github.com/Wan-Video/Wan2.1

Setup for Preprocess Tools

For the preprocessing tools, such as extracting the source video and mask video of classic video creation tasks (such as V2V and MV2V tasks), we recommend using the annotator tools in the VACE repository.

For the more complicated creation tasks, such as video character replacement and video try-on tasks, they typically involve more complex input conditions: a restricted pose source video, a mask video for the target area, and images of the person or garments to be replaced. For details on the processing methods, please refer to our scripts folder.

Local Directories Setup

We recommend to organize local directories as:

VDOT
├── ...
├── benchmarks
│   ├── VACE-Benchmark
│   └── UVCBench
├── models
│   ├── VACE-Annotators
│   ├── VACE-Wan2.1-14B
│   └── VDOT ### (download from [huggingface](https://huggingface.co/yutongwang1012/VDOT))
│       └── google
│       ├── vdot-weights
│       ├   └── vdot_14b.pt
│       ├── models_t5_umt5-xxl-enc-bf16.pth
│       └── Wan2.1_VAE.pth
├── inference
└── training

🚀 Usage

In VDOT, users can generate videos based on any combination of input conditions in just four denoising steps.

Inference

# See the commands in ``run_vdot.sh'', we recommend using 4 GPUs for the inference of VDOT-14B.
torchrun --nproc_per_node=4 --nnodes=1 inference/vace_wan_inference.py \
    --dit_fsdp \
    --t5_fsdp \
    --ulysses_size 4 \
    --ring_size 1 \
    --size 480p \
    --sample_guide_scale 1 \
    --sample_steps 4 \
    --src_video example_video_1.mp4 \
    --src_mask example_video_2.mp4 \
    --src_ref_images example_image_1.png,example_image_2.png \
    --prompt "xxx"

Gradio Demo

torchrun --nproc_per_node=4 --nnodes=1 inference/vdot_gradio.py

Training

cd training
bash train_vdot.sh

Acknowledgement

We are grateful for the following awesome projects, including VACE, Wan, and Self-Forcing.

BibTeX

@article{wang2025vdot,
  title={VDOT: Efficient Unified Video Creation via Optimal Transport Distillation},
  author={Wang, Yutong and Zhang, Haiyu and Xue, Tianfan and Qiao, Yu and Wang, Yaohui and Xu, Chang and Chen, Xinyuan},
  journal={arXiv preprint arXiv:2512.06802},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
assets		assets
benchmarks/VACE-Benchmark		benchmarks/VACE-Benchmark
inference		inference
models		models
scripts		scripts
training		training
README.md		README.md
requirements.txt		requirements.txt
run_gradio_demo.sh		run_gradio_demo.sh
run_vdot.sh		run_vdot.sh
test_uvcbench_composite.sh		test_uvcbench_composite.sh
test_uvcbench_single.sh		test_uvcbench_single.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VDOT: Efficient Unified Video Creation via Optimal Transport Distillation

Introduction

🎉 News

⚙️ Installation

Setup for Preprocess Tools

Local Directories Setup

🚀 Usage

Inference

Gradio Demo

Training

Acknowledgement

BibTeX

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VDOT: Efficient Unified Video Creation via Optimal Transport Distillation

Introduction

🎉 News

⚙️ Installation

Setup for Preprocess Tools

Local Directories Setup

🚀 Usage

Inference

Gradio Demo

Training

Acknowledgement

BibTeX

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages