T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition

Chen Yeh*, You-Ming Chang*, Wei-Chen Chiu, Ning Yu

🎉 Accepted to NeurIPS'24 Datasets and Benchmarks Track 🎉

💡 Overview

We propose a comprehensive and extensive harmful dataset, Visual Harmful Dataset 11K (VHD11K), consisting of 10,000 images and 1,000 videos, crawled from the Internet and generated by 4 generative models, across a total of 10 harmful categories covering a full spectrum of harmful concepts with non-trival definition. We also propose a novel annotation framework by formulating the annotation process as a Multi-agent Visual Question Answering (VQA) Task, having 3 different VLMs "debate" about whether the given image/video is harmful, and incorporating the in-context learning strategy in the debating process.

📚 VHD11K: Our Proposed Multimodal Dataset for Visual Harmfulness Recognition

The entire dataset is publicly available at here.

Under the shared folder, there are:

dataset_10000_1000
|--croissant-vhd11k.json            # metadata of VHD11K
|--harmful_image_10000_ann.json     # annotaion file of harmful images of VHD11K 
                                      (image name, harmful type, arguments, ...)
|--harmful_images_10000.zip         # 10000 harmful images of VHD11K
|--image_urls.csv                   # urls of images of VHD11K
|--harmful_video_1000_ann.json      # annotaion file of harmful videos of VHD11K
                                      (video name, harmful type, arguments, ...)
|--harmful_videos_1000.zip          # 1000 harmful videos of VHD11K
|--video_urls.csv                   # urls of videos of VHD11K
|--ICL_samples.zip                  # in-context learning samples used in annoators
    |--ICL_images                   # in-context learning images
    |--ICL_videos_frames            # frames of each in-context learning video

📊 Evaluation

Evaluation and experimental results demonstrate that

the great alignment between the annotation from our novel annotation framework and those from human, ensuring the reliability of VHD11K.
our full-spectrum harmful dataset successfully identifies the inability of existing harmful content detection methods to detect extensive harmful contents and improves the performance of existing harmfulness recognition methods.
our dataset outperforms the baseline dataset, SMID, as evidenced by the superior improvement in harmfulness recognition methods.

🚧 Prerequisites

Environment Installation

git clone https://github.com/thisismingggg/autogen.git
cd autogen
conda env create -f pyautogen_env.yaml

Add in OpenAI API key

The debating structure requires access to OpenAI VLMs. Please create a file under autogen/ with the following information, and replace the placeholder with the your OpenAI key like the following:

{
    "model": "gpt-4-vision-preview",
    "api_key": <your OpenAI API key>
}

Please refer to file OAI_CONFIG_LIST_sample and the manual from Autogen for further details.

Download In-context Learning Samples

Download ICL_samples.zip from here.

Baseline checkpoints

These checkpoints will be only used for reproducing the benchmarking results in the paper, which are not neccesary for annotating images and videos.

Model	Checkpoint	Model	Checkpoint
Q16	Pretrained checkpoint from GitHub	InstructBLIP	Vicuna-7B (v1.3) from GitHub
HOD	Pretrained checkpoint from GitHub	CogVLM	cogvlm-chat-v1.1 from HF
NudeNet	Pretrained checkpoint from GitHub	GPT-4V	API service from OpenAI
Hive AI	API service from Hive AI Visual Moderation	LLaVa-NeXT	LLaVA-NeXT-8b for images & LLaVA-NeXT-Video-7B-DPO for vidoes

✏️ Annotate images or videos

Annotate images

Modify the neccessary arguments in annotator/scripts/annotateImage.sh
- i.e. config, imageRoot, path2ImageICL, path2AnnFile, path2LogFile
Put the ICL samples into the path2ImageICL directory.
Run the following code:

cd annotator
sh scripts/annotateImage.sh

Annotate videos

Modify the neccessary arguments in annotator/scripts/annotateVideo.sh
- i.e. config, frameRoot, videoRoot, path2VideoICL, path2AnnFile, path2LogFile
Put the ICL samples into the path2VideoICL directory.
Run the following code:

cd annotator
sh scripts/annotateVideo.sh

✒️ Citation

@inproceedings{yeh2024t2vs,
 author={Chen Yeh and You-Ming Chang and Wei-Chen Chiu and Ning Yu},
 booktitle = {Advances in Neural Information Processing Systems},
 title={T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition},
 year = {2024}
}

🙌 Acknowledgement

This project is built upon the the gaint sholder of Autogen. Great thanks to them!

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.devcontainer		.devcontainer
.github		.github
annotator		annotator
autogen		autogen
docs		docs
dotnet		dotnet
notebook		notebook
samples		samples
scripts		scripts
test		test
website		website
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
LICENSE-CODE		LICENSE-CODE
OAI_CONFIG_LIST_sample		OAI_CONFIG_LIST_sample
README.md		README.md
README_autogen.md		README_autogen.md
SECURITY.md		SECURITY.md
TRANSPARENCY_FAQS.md		TRANSPARENCY_FAQS.md
azure-pipelines.yml		azure-pipelines.yml
pyautogen_env.yaml		pyautogen_env.yaml
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition

Table of contents

💡 Overview

📚 VHD11K: Our Proposed Multimodal Dataset for Visual Harmfulness Recognition

📊 Evaluation

🚧 Prerequisites

Environment Installation

Add in OpenAI API key

Download In-context Learning Samples

Baseline checkpoints

✏️ Annotate images or videos

Annotate images

Annotate videos

✒️ Citation

🙌 Acknowledgement

About

Licenses found

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition

Table of contents

💡 Overview

📚 VHD11K: Our Proposed Multimodal Dataset for Visual Harmfulness Recognition

📊 Evaluation

🚧 Prerequisites

Environment Installation

Add in OpenAI API key

Download In-context Learning Samples

Baseline checkpoints

✏️ Annotate images or videos

Annotate images

Annotate videos

✒️ Citation

🙌 Acknowledgement

About

Resources

License

Licenses found

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages