Yutong Wang1, Haiyu Zhang3,2, Tianfan Xue4,2, Yu Qiao2, Yaohui Wang2*, Chang Xu1*, Xinyuan Chen2*
1USYD, 2Shanghai AI Laboratory, 3BUAA, 4CUHK
VDOT is an efficient, unified video creation model that achieves high-quality results in just 4 denoising steps. By employing Computational Optimal Transport (OT) within the distillation process, VDOT ensures training stability and enhances both training and inference efficiency. VDOT unifies a wide range of capabilities, such as Reference-to-Video (R2V), Video-to-Video (V2V), Masked Video Editing (MV2V), and arbitrary composite tasks, matching the versatility of VACE with significantly reduced inference costs.
sour_cover_compressed_.mp4
- Mar 15, 2026: 🔥Release code of model training, inference, and gradio demos.
- Mar 14, 2026: 🔥VDOT-14B is now avaiable at HuggingFace.
- Mar 14, 2026: 🔥UVCBench is now avaiable at HuggingFace.
- Feb 21, 2026: VDOT is accepted by CVPR 2026.
- Dec 7, 2025: We propose VDOT, a 4-step unified video creation model based on VACE.
The codebase was tested with Python 3.10.13, CUDA version 12.4, and PyTorch >= 2.5.1.
git clone https://github.com/hhhh1138/VDOT.git && cd VDOT
pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124 # If PyTorch is not installed.
pip install -r requirements.txt
pip install wan@git+https://github.com/Wan-Video/Wan2.1For the preprocessing tools, such as extracting the source video and mask video of classic video creation tasks (such as V2V and MV2V tasks), we recommend using the annotator tools in the VACE repository.
For the more complicated creation tasks, such as video character replacement and video try-on tasks, they typically involve more complex input conditions: a restricted pose source video, a mask video for the target area, and images of the person or garments to be replaced. For details on the processing methods, please refer to our scripts folder.
We recommend to organize local directories as:
VDOT
├── ...
├── benchmarks
│ ├── VACE-Benchmark
│ └── UVCBench
├── models
│ ├── VACE-Annotators
│ ├── VACE-Wan2.1-14B
│ └── VDOT ### (download from [huggingface](https://huggingface.co/yutongwang1012/VDOT))
│ └── google
│ ├── vdot-weights
│ ├ └── vdot_14b.pt
│ ├── models_t5_umt5-xxl-enc-bf16.pth
│ └── Wan2.1_VAE.pth
├── inference
└── training
In VDOT, users can generate videos based on any combination of input conditions in just four denoising steps.
# See the commands in ``run_vdot.sh'', we recommend using 4 GPUs for the inference of VDOT-14B.
torchrun --nproc_per_node=4 --nnodes=1 inference/vace_wan_inference.py \
--dit_fsdp \
--t5_fsdp \
--ulysses_size 4 \
--ring_size 1 \
--size 480p \
--sample_guide_scale 1 \
--sample_steps 4 \
--src_video example_video_1.mp4 \
--src_mask example_video_2.mp4 \
--src_ref_images example_image_1.png,example_image_2.png \
--prompt "xxx"torchrun --nproc_per_node=4 --nnodes=1 inference/vdot_gradio.pycd training
bash train_vdot.shWe are grateful for the following awesome projects, including VACE, Wan, and Self-Forcing.
@article{wang2025vdot,
title={VDOT: Efficient Unified Video Creation via Optimal Transport Distillation},
author={Wang, Yutong and Zhang, Haiyu and Xue, Tianfan and Qiao, Yu and Wang, Yaohui and Xu, Chang and Chen, Xinyuan},
journal={arXiv preprint arXiv:2512.06802},
year={2025}
}