The dataset and related resources are organized as follows:
VBench-2.0_human_anomaly_root/
├── dataset/ # Main directory for the image dataset
│ ├── all_images.zip # Compressed file containing all images
│ ├── face_train.jsonl # Face training set annotations
│ ├── face_test.jsonl # Face testing set annotations
│ ├── hand_train.jsonl # Hand training set annotations
│ ├── hand_test.jsonl # Hand testing set annotations
│ ├── human_train.jsonl # Human training set annotations
│ ├── human_test.jsonl # Human testing set annotations
├── src_video/ # Raw video data used to construct the dataset
│ ├── CogVideo/ # Videos generated by CogVideoX-1.5
│ ├── CogVideo-1.0/ # Videos generated by CogVideoX-1.0
│ ├── vad_real/ # Real-world videos
Each JSONL file contains annotations for the corresponding images. Each line in the JSONL file represents a single sample in the following format:
["image_filename.jpg", label, score]image_filename.jpg: The base name of the image file.label: The ground truth label (e.g., class or pose information).score: Confidence score (unused).
- We first use YOLO-World as the open-vocaburary detector to detect the human in each frame, then detect the face and hand from the croped human image. The detection threshold is set to 0.1.
- To meet the input size requirement (square) of SimMIM, we extend the bounding boxes into square.
- The human, face, hand will be saved to different folders for the following three anomaly detectors.
- We show the data processing pipeline below.
- We utilize the pre-trained weight of SimMIM and finetune the whole network with an additional anomaly detector that consists of a MLP layer.
- For the anomaly score, 0 means the image is normal and 1 means the image is abnormal, we use the binary cross entropy loss.
- The human, human face and human hand anomaly detectors need to be trained separately, the higher the score, the more possible it is an anomaly.
- We show the training pipeline below.
- Given a generated video, we do frame-wise anomaly detection.
- Firstly, we detect and crop the human, human face and hand by YOLO-World as shown in
data pre-processingin each frame. - Then the detected image will be fed into corresponding anomaly detector with different anomaly threshold (i.e., 0.45, 0.3, 0.32 for human, human face, and human hands respectively).
- For each frame, we calculate the average score of the anomaly score of each human (sometimes there is more than one person in the frame).
- For each human, it is flagged as abnormal if any of the three models predict an anomaly and the score will be 0, otherwise 1.
- Follow VBench2.0 environment set-up
-
Download Pre-trained Model:
- Download the pre-trained SimMIM model from Google Drive
- Download the pre-trained YOLO-World model from Google Drive
- put the models into
pretrain/ - or use the following command:
gdown https://drive.google.com/uc?id=1dJn6GYkwMIcoP3zqOEyW1_iQfpBi8UOw -O pretrain/ gdown https://drive.google.com/uc?id=1qo-K1kum7yiEwIlN1TWDvABXX6qriUen -O pretrain/
-
Download Dataset:
- Download the dataset files from Google Drive or the following URL:
git clone https://huggingface.co/datasets/Vchitect/VBench-2.0_human_anomaly
- If the large folder does not download, use the following command (You need to configure Git LFS in advance, or we suggest using the Google Drive link above.):
cd "VBench-2.0_human_anomaly" git lfs install git lfs pull
- Download the dataset files from Google Drive or the following URL:
-
Extract Images:
- Unzip the
opensource.zipfile:cd "VBench-2.0_human_anomaly" zip -s 0 --out merged.zip "opensource.zip" unzip merged.zip mv opensource/* ./ rm -rf opensource rm merged.zip rm opensource.*
- Unzip the
-
Run Training:
- Execute the training code (we take face detector training as an example, for human and hand detectors, change the corresponding names):
torchrun --master_port 15690 main_finetune.py \ --cfg 'configs/vit_base__800ep/simmim_finetune__vit_base__img224__800ep.yaml' \ --train-path "VBench-2.0_human_anomaly/dataset/face_train.jsonl" \ --val-path "VBench-2.0_human_anomaly/dataset/face_test.jsonl" \ --pretrained 'pretrain/simmim_pretrain__vit_base__img224__800ep.pth' \ --batch-size 128 \ --output "checkpoint/face"
- Execute the training code (we take face detector training as an example, for human and hand detectors, change the corresponding names):
-
Run Inference:
- Note that you need to train all of the three detectors to run the inference code (By default, we train for 30 epochs):
python inference.py \ --cfg 'configs/vit_base__800ep/simmim_finetune__vit_base__img224__800ep.yaml' \ --detector_config 'third_party/YOLO-World/yolo_world_v2_xl_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py' \ --detector_weights 'pretrain/yolo_world_v2_xl_obj365v1_goldg_cc3mlite_pretrain-5daf1395.pth' \ --human_model 'checkpoint/human/ckpt29.pth' \ --face_model 'checkpoint/face/ckpt29.pth' \ --hand_model 'checkpoint/hand/ckpt29.pth'
- Ensure that all paths in the JSONL files correctly point to the extracted images.
- Modify the configuration file (
config.yamlor similar) if you wish to customize training parameters.
- We show some results of our anomaly detector below
If you find our repo useful for your research, please consider citing our paper:
@article{zheng2025vbench2,
title={{VBench-2.0}: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness},
author={Zheng, Dian and Huang, Ziqi and Liu, Hongbo and Zou, Kai and He, Yinan and Zhang, Fan and Zhang, Yuanhan and He, Jingwen and Zheng, Wei-Shi and Qiao, Yu and Liu, Ziwei},
journal={arXiv preprint arXiv:2503.21755},
year={2025}
}
@InProceedings{huang2023vbench,
title={{VBench}: Comprehensive Benchmark Suite for Video Generative Models},
author={Huang, Ziqi and He, Yinan and Yu, Jiashuo and Zhang, Fan and Si, Chenyang and Jiang, Yuming and Zhang, Yuanhan and Wu, Tianxing and Jin, Qingyang and Chanpaisit, Nattapol and Wang, Yaohui and Chen, Xinyuan and Wang, Limin and Lin, Dahua and Qiao, Yu and Liu, Ziwei},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2024}
}
@article{huang2025vbench++,
title={{VBench++}: Comprehensive and Versatile Benchmark Suite for Video Generative Models},
author={Huang, Ziqi and Zhang, Fan and Xu, Xiaojie and He, Yinan and Yu, Jiashuo and Dong, Ziyue and Ma, Qianli and Chanpaisit, Nattapol and Si, Chenyang and Jiang, Yuming and Wang, Yaohui and Chen, Xinyuan and Chen, Ying-Cong and Wang, Limin and Lin, Dahua and Qiao, Yu and Liu, Ziwei},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2025},
doi={10.1109/TPAMI.2025.3633890}
}
VBench-2.0 is currently maintained by Dian Zheng, Ziqi Huang and Kai Zou.
This project wouldn't be possible without the following open-sourced repositories: YOLO_World, SimMIM
This project is released under the MIT License. You are free to use, modify, and distribute the dataset and code for academic and commercial purposes, provided you include proper attribution.




