GitHub - Hoar012/TDC-Video: Official implementation of TDC.

Multimodal Long Video Modeling Based on Temporal Dynamic Context

Paper | Project Page | Model

News

2025.7.16 Release model weights and inference code. Have a try!
2025.6.10 Release training and evaluation code.

📋 Contents

Install
Models
Demo
Training
Evaluation

Framework of Temporal Dynamic Context Compression

Architecture of Our Multimodal Video Encoder. We first extract features for each second of the video, including both visual and corresponding audio tokens. The first frame is selected as the static frame, and a Q-Former is used to perform Temporal Dynamic Context compression based on its relationship with subsequent frames, resulting in K compressed tokens per frame. The final video representation consists of all static frame tokens and multimodal video context.

Install

Clone the repo into a local folder.

git clone https://github.com/Hoar012/TDC-Video.git
cd TDC-Video

Install packages.

conda create -n tdc python=3.10 -y
conda activate tdc
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

Models

Pretrained model weights are available on Hugging Face.

TDC-Qwen2-7B: TDC-Qwen2-7B; TDC-Llama3_2-3B: TDC-Llama3_2-3B

Demo

python main.py

Training

Prepare training data

Stage 1: Image-Text Alignment: LLaVA-OneVision-Single
Stage 2: Video Instruction Tuning: Stage2 data
Stage 3: Audio-Video Instruction Tuning: Stage3 data

We also provide the processed videos and audios for stage 3 training: Processed data.

Start training

Modify the PATH_TO_JSON and PATH_TO_FOLDER arguments in the training scripts to your save folder.

PATH_TO_JSON=""
PATH_TO_FOLDER=""

Train your own model

Stage 1: Image-Text Alignment

sh scripts/stage1/train_image_qwen.sh

Modify PREV_STAGE_CHECKPOINT in the training scripts to your first stage model path

Change image_token_len and query_num_list in config.json to 144

Stage 2: Video Instruction Tuning

sh scripts/stage2/train_video_qwen.sh

Stage 3: Audio-Video Instruction Tuning

# Lora training
sh scripts/stage3/train_video_audio_qwen_lora.sh

Evaluation

Evaluation on General Video Understanding

torchrun --nproc_per_node=8 ./eval/eval_mlvu.py --model_path Hoar012/TDC-Qwen2-7B --model_name cambrian_qwen --version qwen --data_path eval/MLVU

Evaluation on Audio-Visual Comprehension

torchrun --nproc_per_node=8 ./eval/eval_musicQA.py --model_path Hoar012/TDC-Qwen2-7B --model_name cambrian_qwen --version qwen --data_path eval/Music-AVQA --test_file eval/Music-AVQA/avqa-test.json

For more detailed instructions on evaluation, please refer to the evaluation guide.

BibTeX

@misc{hao2025multimodallongvideomodeling,
        title={Multimodal Long Video Modeling Based on Temporal Dynamic Context}, 
        author={Haoran Hao and Jiaming Han and Yiyuan Zhang and Xiangyu Yue},
        year={2025},
        eprint={2504.10443},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2504.10443}, 
  }

Acknowledgement

This repository is built upon: LLaVA, LongVU and StoryTeller.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
checkpoints/audio_encoder		checkpoints/audio_encoder
eval		eval
examples		examples
images		images
scripts		scripts
tdc		tdc
utils		utils
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multimodal Long Video Modeling Based on Temporal Dynamic Context

Paper | Project Page | Model

News

📋 Contents

Framework of Temporal Dynamic Context Compression

Install

Models

Demo

Training

Evaluation

Evaluation on General Video Understanding

Evaluation on Audio-Visual Comprehension

BibTeX

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Hoar012/TDC-Video

Folders and files

Latest commit

History

Repository files navigation

Multimodal Long Video Modeling Based on Temporal Dynamic Context

Paper | Project Page | Model

News

📋 Contents

Framework of Temporal Dynamic Context Compression

Install

Models

Demo

Training

Evaluation

Evaluation on General Video Understanding

Evaluation on Audio-Visual Comprehension

BibTeX

Acknowledgement

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages