DreamFrame:
Enhancing Video Understanding via
Automatically Generated QA and Style-Consistent Keyframes
Zhende Song
·
Chenchen Wang
·
Jiamu Sheng
·
Chi Zhang
·
Shengji Tang
·
Jiayuan Fan✦
·
Tao Chen
(✦ Corresponding Author )
From Fudan University and Tencent PCG
Please follow the instructions below to install the required packages. Our training process is mainly based on LLaMA-VID. And our short video evaluation process is mainly based on quantitative_evaluation from Video-ChatGPT.
Clone this repository
git clone https://github.com/Deaddawn/DreamFrame-code.gitInstall Package (Tested on A100 and RTX3090, CUDA 11.8. We recommend sticking to the package versions we provided, as changes in the versions of diffusers and transformers may lead to certain issues.)
conda create -n DreamFrame python=3.10 -y
conda activate DreamFrame
cd DreamFrame
pip install -r requirements.txtThe data generation process of DreamFrame mainly consists of three stage: (1) Move Plot Generation (2) Style Immobilization Process (3) Video Instruction Data Generation
We basically adopt a story expanding strategy which incrementally generates frame descriptions through three levels. We provide three-level example prompts. Use any LLM(We use GPT-4) to generate frame descriptions and organize them into a JSON file like this story_js
Style Immobilization is to learn a style embedding which can be used to generate style consistent key frames. To learn the style embedding, we will need a style-related keyword and a set of style-related images. Keyword can be obtained from stage one. For style-related images, we simply use sdxl-1.0-base to generate these based on the detail style description (you can find an example in the prompt we provide).
Here, we provide an example to show how you can train a style embedding. We use keyword "Dramatic".
cd StyleImmobilization
python style_embedding.py --style_keyword Dramatic --image_path ./styleThe learned style embedding will be saved at folder "Embeddings". This should only take 5~10 minutes (tested on A100).
After train a style embedding, you can start to generate consistent keyframes based on the aformentioned json file like this:
cd StyleImmobilization
python generate.py --js_path ./json/story_info_0.json --embed_path ./Embeddings/story_0_Dramatic.pt --keyword Dramatic --save_path ./save_pathWe provide our baseline model and model trained on our generated dataset. For more detailed information, refer to LLaMA-VID-model. And please follow LLaMA-VID to prepare the necessary settings and feel free to use our provided checkpiont.
| Type | Max Token | Base LLM | Finetuning Data | Finetuning schedule | Download |
|---|---|---|---|---|---|
| Base Model | 64K | Vicuna-7B-v1.5 | LLaVA1.5-VideoChatGPT-Instruct | full_ft-1e | ckpt |
| DreamFrame-7B | 64K | Vicuna-7B-v1.5 | LLaVA1.5-VideoChatGPT-Instruct + DreamFrameQA | full_ft-1e | ckpt |
Data generated from our pipeline consists of key frame images, corresponding QAs and dialogues. You can download it from here DreamFrame-Data

We follow MVBench, Video-Bench and TempCompass to conduct evaluations.
We would like to thank the following repos for their great work:
- Our model is trained based on LLaMA-VID.
- We build our pipeline based on textual-inversion




