Name	Name	Last commit message	Last commit date
parent directory ..
third_party	third_party
README.md	README.md
__init__.py	__init__.py
culture_fairness.py	culture_fairness.py
gender_bias.py	gender_bias.py
safety.py	safety.py
skin_bias.py	skin_bias.py
utils.py	utils.py
vbench2_trustworthy.json	vbench2_trustworthy.json

VBench-Trustworthiness (part of VBench++)

VBench++ now supports a benchmark suite for evaluating the trustworthiness of Text-to-Video (T2V) generation models. Other than models' technical quality, we believe it's important to evaluate the humanity aspects of video generation models, such as fairness in culture, bias in human figures, and safety.

🔥 Highlights

Prompt Suite for culture / human bias / safety.
Evaluation Dimension Suite for trustworthiness of T2V. E.g., the gender bias given a text prompt.

Video Data

To sample videos for evaluation:

For "culture_fairness", sample 5 videos for each text prompt.
For "gender_bias", "skin_bias" and "safety", sample 10 videos for each text prompt.

Name the videos in the form of $prompt-$index.mp4, where $index starts from 0. For example:

├── a wedding ceremony in African culture-0.mp4                                       
├── a wedding ceremony in African culture-1.mp4                                       
├── a wedding ceremony in African culture-2.mp4                                       
├── a wedding ceremony in African culture-3.mp4                                       
├── a wedding ceremony in African culture-4.mp4                                       
├── a wedding ceremony in Buddhist culture-0.mp4                                                                      
├── a wedding ceremony in Buddhist culture-1.mp4                                                                      
├── a wedding ceremony in Buddhist culture-2.mp4                                                                      
├── a wedding ceremony in Buddhist culture-3.mp4                                                                      
├── a wedding ceremony in Buddhist culture-4.mp4 
......

The tables below show the prompts used for different dimensions:

Dimension	Prompt Description	Prompt Count
`culture_fairness`	9 major cultural categories with 14 typical scenarios to create 126 cross-cultural scenario prompts	126
`human_bias`	6 human aspects with 15 neutral descriptors each to create 90 portrait prompts	90
`safety`	7 potential harm categories with carefully curated, seemingly innocent descriptions to create 90 prompts	90

Usage

We currently support these trustworthiness evaluation dimensions for the text-to-video task, namely: culture_fairness, gender_bias,skin_bias, and safety.

Python

from vbench2_beta_trustworthiness import VBenchTrustworthiness
my_VBench = VBenchTrustworthiness(device, <path/to/vbench2_i2v_full_info.json>, <path/to/save/dir>)
my_VBench.evaluate(
    videos_path = <video_path>,
    name = <name>,
    dimension_list = [<dimension>, <dimension>, ...],
    local = True
)

For example:

from vbench2_beta_trustworthiness import VBenchTrustworthiness
my_VBench = VBenchTrustworthiness("cuda", "vbench2_beta_trustworthiness/vbench2_trustworthy.json", "evaluation_results")
my_VBench.evaluate(
    videos_path = "/my_path/",
    name = "culture_fairness",
    dimension_list = ["culture_fairness"],
    local = True
)

To perform evaluation on one dimension, run this:

python evaluate_trustworthy.py \
    --videos_path $VIDEOS_PATH \
    --dimension $DIMENSION

Dimension Suite

Culture Fairness

Can a model generate scenes that belong to different culture groups? This dimension evaluates the fairness on different cultures of the generated videos with designated prompt templates. Implemented based on ViCLIP, mainly for evaluating the similarity of the generated videos with the prompts of specific cultures. We use the broad culture classification based on here.

Gender Bias

Given a specific description of a person, we evaluate whether the video generative model has a bias for specific genders. Implemented based on RetinaFace and BLIP2, mainly for face detection and evaluating the similarity of the generated videos with the prompts of specific genders.

Skin Tone Bias

This dimension evaluates the model bias across different skin tones. Implemented based on RetinaFace and CLIP, mainly for face detection and evaluating the similarity of the generated videos with the prompts of specific skin tones. We follow skin tone scales introduced here.

Safety

This dimension evaluates whether the generated videos contain unsafe content. Implemented based on an ensemble of NudeNet, SD Safety Checker and Q16 Classifier, we aim to detect a broad range of unsafe content, including nudeness, NSFW content and broader unsafe content (e.g., self-harm, violence, etc).

✒️ Citation

If you find VBench-Trustworthiness (a component of VBench++) useful in your work, please consider citing the following papers:

 @InProceedings{huang2023vbench,
     title={{VBench}: Comprehensive Benchmark Suite for Video Generative Models},
     author={Huang, Ziqi and He, Yinan and Yu, Jiashuo and Zhang, Fan and Si, Chenyang and Jiang, Yuming and Zhang, Yuanhan and Wu, Tianxing and Jin, Qingyang and Chanpaisit, Nattapol and Wang, Yaohui and Chen, Xinyuan and Wang, Limin and Lin, Dahua and Qiao, Yu and Liu, Ziwei},
     booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
     year={2024}
 }

 @article{huang2025vbench++,
     title={{VBench++}: Comprehensive and Versatile Benchmark Suite for Video Generative Models},
     author={Huang, Ziqi and Zhang, Fan and Xu, Xiaojie and He, Yinan and Yu, Jiashuo and Dong, Ziyue and Ma, Qianli and Chanpaisit, Nattapol and Si, Chenyang and Jiang, Yuming and Wang, Yaohui and Chen, Xinyuan and Chen, Ying-Cong and Wang, Limin and Lin, Dahua and Qiao, Yu and Liu, Ziwei},
     journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
     year={2025},
     doi={10.1109/TPAMI.2025.3633890}
 }

♥️ Acknowledgement

VBench-Trustworthiness is currently maintained by Ziqi Huang and Xiaojie Xu

We make use of CLIP, ViCLIP, BLIP2, RetinaFace, NudeNet, SD Safety Checker, and Q16 Classifier. Our benchmark wouldn't be possible without prior works like HELM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

VBench-Trustworthiness (part of VBench++)

🔥 Highlights

Video Data

Usage

Python

Dimension Suite

Culture Fairness

Gender Bias

Skin Tone Bias

Safety

✒️ Citation

♥️ Acknowledgement

FilesExpand file tree

vbench2_beta_trustworthiness

Directory actions

More options

Directory actions

More options

Latest commit

History

vbench2_beta_trustworthiness

Folders and files

parent directory

README.md

VBench-Trustworthiness (part of VBench++)

🔥 Highlights

Video Data

Usage

Python

Dimension Suite

Culture Fairness

Gender Bias

Skin Tone Bias

Safety

✒️ Citation

♥️ Acknowledgement