1AMD 2University of Rochester
- [2025.6] Release the Hour-LLaVA training code & Hour-LLaVA models.
- [2025.6] Release the VideoMarathon Dataset.
VideoMarathon is a large-scale long video instruction-following dataset with a total duration of approximately 9,700 hours, comprising 3.3 million QA pairs across 22 task categories.
The dataset contains 22 diverse tasks over six fundamental topics, including temporality, spatiality, object, action, scene, and event. These diverse tasks require both short-form (yellow tag) and long-form (red tag) video comprehension.

- Data Source: The dataset spans diverse video source domains.
- Question Type: The dataset features a wide range of question types for long-form video-language modeling.
- Video Duration: The dataset consists of long videos ranging from three minutes to one hour.
- Event Counting: The dataset includes complex video content reflected by the number of events per video.
The comparison between our VideoMarathon and other existing video instruction-following datasets shows that VideoMarathon features a significantly longer average video length, broader duration range, and a larger number of QA pairs.

Powered by memory augmentation, we propose Hour-LLaVA, an efficient video-language model capable of modeling hour-long videos at 1 FPS. It comprises three key modules: a video encoder, a memory augmentation module (i.e., MemAug), and an LLM decoder.

- Python 3.10
- PyTorch == 2.1.2
- torchvision == 0.16.1
- transformers == 4.46.0
- flash-attention == 2.6.3
-
Clone this repository:
git clone https://github.com/jylins/hourllava.git cd hourllava -
Create a conda virtual environment and activate it:
conda create -n hourllava python=3.10 -y conda activate hourllava
-
Install the training package of Hour-LLaVA:
pip install --upgrade pip pip install -e ".[train]" -
Install other requirements:
git clone -b v0.11.1 https://github.com/jylins/streaming cd streaming pip install -e .
Due to copyright restrictions, we are unable to release the raw videos directly. Please follow the instructions below to download the videos from their respective sources.
For raw videos used in the VideoMarathon dataset:
- Panda-70M: Please download the videos using the URLs provided in the VideoMarathon dataset.
- Ego4D: Download the videos directly from the official website: [link].
- MovieChat-1K: Please download the videos from the official HuggingFace repository: Enxin/MovieChat-1K_train.
- ActivityNet: Please download the videos from this HuggingFace repository: lmms-lab/ActivityNetQA.
- YouCook2: Please download the videos from this HuggingFace repository: lmms-lab/YouCook2.
In addition, please download LLaVA-Video-178K and LLaVA-OneVision by following the official instructions provided in their respective HuggingFace repositories.
For training efficiency, up to 5 QA pairs from the same VideoMarathon video are grouped as a multi-turn conversation per training sample. Please download the processed instruction-following data from videomarathon.json.
To accelerate the training process, we recommend pre-extracting video features from the VideoMarathon dataset.
For Hour-LLaVA-3B, run:
bash scripts/embedding/hourllava_qwen25_3b_emb.shFor Hour-LLaVA-7B, run:
bash scripts/embedding/hourllava_qwen2_7b_emb.shTo enable the streaming data loader, we transform the raw dataset into the format supported by the mosaicml/streaming repo:
- Image-Language Pretraining (Stage 1):
bash scripts/sharding/scripts/sharding_llava_ov_si_onlyimg.sh- Video-Language Adaptation (Stage 2):
bash scripts/sharding/scripts/sharding_llavavideo_text.sh
bash scripts/sharding/scripts/sharding_llavavideo_si.sh
bash scripts/sharding/scripts/sharding_llavavideo_mi.sh
bash scripts/sharding/scripts/sharding_llavavideo_video.sh- Video Instruction Tuning (Stage 3):
bash scripts/sharding/scripts/sharding_videomarathon_video.sh
bash scripts/sharding/scripts/merge_hourllava_s3.shdata
├── VideoMarathon
│ ├── videos
│ │ ├── activitynet
│ │ ├── ego4d
│ │ ├── moviechat
│ │ ├── panda
│ │ └── youcook2
│ ├── features
│ │ ├── activitynet
│ │ ├── ego4d
│ │ ├── moviechat
│ │ ├── panda
│ │ └── youcook2
│ ├── jsons
│ │ └── videomarathon.json
│ └── sharding
│ ├── llava_ov_si_onlyimg_part0
│ ├── ...
│ ├── llavavideo_text_part0
│ ├── ...
│ ├── llavavideo_si_part0
│ ├── ...
│ ├── llavavideo_mi_part0
│ ├── ...
│ ├── llavavideo_video_part0
│ ├── ...
│ ├── videomarathon_video_part0
│ ├── ...
│ └── hourllava_s3
│ ├── 0
│ └── 1
├── LLaVA-Video-178K
└── LLaVA-OneVision
To train Hour-LLaVA-7B for image-language pretraining, run:
bash scripts/hourllava/qwen2_7b/s1_image_language_pretraining.shTo train Hour-LLaVA-3B for image-language pretraining, run:
bash scripts/hourllava/qwen25_3b/s1_image_language_pretraining.shTo train Hour-LLaVA-7B for video-language adaptation, run:
bash scripts/hourllava/qwen2_7b/s2_video_language_adaptation.shTo train Hour-LLaVA-3B for video-language adaptation, run:
bash scripts/hourllava/qwen25_3b/s2_video_language_adaptation.shTo train Hour-LLaVA-7B for video instruction tuning, run:
bash scripts/hourllava/qwen2_7b/s3_video_instruction_tuning.shTo train Hour-LLaVA-3B for video instruction tuning, run:
bash scripts/hourllava/qwen25_3b/s3_video_instruction_tuning.shThis project builds upon the following open-source frameworks:
- LLaVA-NeXT: Open Large Multimodal Models
- flash-attention: The Official Implementation of FlashAttention and FlashAttention-2.
@article{lin2025unleashing,
title={Unleashing Hour-Scale Video Training for Long Video-Language Understanding},
author={Lin, Jingyang and Wu, Jialian and Sun, Ximeng and Wang, Ze and Liu, Jiang and Su, Yusheng and Yu, Xiaodong and Chen, Hao and Luo, Jiebo and Liu, Zicheng and others},
journal={arXiv preprint arXiv:2506.05332},
year={2025}
}