Paper | Project Page | Model
- 2025.7.16 Release model weights and inference code. Have a try!
- 2025.6.10 Release training and evaluation code.
- Clone the repo into a local folder.
git clone https://github.com/Hoar012/TDC-Video.git
cd TDC-Video- Install packages.
conda create -n tdc python=3.10 -y
conda activate tdc
pip install -r requirements.txt
pip install flash-attn --no-build-isolationPretrained model weights are available on Hugging Face.
TDC-Qwen2-7B: TDC-Qwen2-7B; TDC-Llama3_2-3B: TDC-Llama3_2-3B
python main.py- Prepare training data
- Stage 1: Image-Text Alignment: LLaVA-OneVision-Single
- Stage 2: Video Instruction Tuning: Stage2 data
- Stage 3: Audio-Video Instruction Tuning: Stage3 data
We also provide the processed videos and audios for stage 3 training: Processed data.
- Start training
Modify the PATH_TO_JSON and PATH_TO_FOLDER arguments in the training scripts to your save folder.
PATH_TO_JSON=""
PATH_TO_FOLDER=""
Train your own model
- Stage 1: Image-Text Alignment
sh scripts/stage1/train_image_qwen.sh
Modify PREV_STAGE_CHECKPOINT in the training scripts to your first stage model path
Change image_token_len and query_num_list in config.json to 144
- Stage 2: Video Instruction Tuning
sh scripts/stage2/train_video_qwen.sh
- Stage 3: Audio-Video Instruction Tuning
# Lora training
sh scripts/stage3/train_video_audio_qwen_lora.sh
torchrun --nproc_per_node=8 ./eval/eval_mlvu.py --model_path Hoar012/TDC-Qwen2-7B --model_name cambrian_qwen --version qwen --data_path eval/MLVUtorchrun --nproc_per_node=8 ./eval/eval_musicQA.py --model_path Hoar012/TDC-Qwen2-7B --model_name cambrian_qwen --version qwen --data_path eval/Music-AVQA --test_file eval/Music-AVQA/avqa-test.jsonFor more detailed instructions on evaluation, please refer to the evaluation guide.
@misc{hao2025multimodallongvideomodeling,
title={Multimodal Long Video Modeling Based on Temporal Dynamic Context},
author={Haoran Hao and Jiaming Han and Yiyuan Zhang and Xiangyu Yue},
year={2025},
eprint={2504.10443},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.10443},
}
This repository is built upon: LLaVA, LongVU and StoryTeller.
