🎉 Accepted to NeurIPS'24 Datasets and Benchmarks Track 🎉
We propose a comprehensive and extensive harmful dataset, Visual Harmful Dataset 11K (VHD11K), consisting of 10,000 images and 1,000 videos, crawled from the Internet and generated by 4 generative models, across a total of 10 harmful categories covering a full spectrum of harmful concepts with non-trival definition. We also propose a novel annotation framework by formulating the annotation process as a Multi-agent Visual Question Answering (VQA) Task, having 3 different VLMs "debate" about whether the given image/video is harmful, and incorporating the in-context learning strategy in the debating process.
The entire dataset is publicly available at here.
Under the shared folder, there are:
dataset_10000_1000
|--croissant-vhd11k.json # metadata of VHD11K
|--harmful_image_10000_ann.json # annotaion file of harmful images of VHD11K
(image name, harmful type, arguments, ...)
|--harmful_images_10000.zip # 10000 harmful images of VHD11K
|--image_urls.csv # urls of images of VHD11K
|--harmful_video_1000_ann.json # annotaion file of harmful videos of VHD11K
(video name, harmful type, arguments, ...)
|--harmful_videos_1000.zip # 1000 harmful videos of VHD11K
|--video_urls.csv # urls of videos of VHD11K
|--ICL_samples.zip # in-context learning samples used in annoators
|--ICL_images # in-context learning images
|--ICL_videos_frames # frames of each in-context learning video
Evaluation and experimental results demonstrate that
- the great alignment between the annotation from our novel annotation framework and those from human, ensuring the reliability of VHD11K.
- our full-spectrum harmful dataset successfully identifies the inability of existing harmful content detection methods to detect extensive harmful contents and improves the performance of existing harmfulness recognition methods.
- our dataset outperforms the baseline dataset, SMID, as evidenced by the superior improvement in harmfulness recognition methods.
git clone https://github.com/thisismingggg/autogen.git
cd autogen
conda env create -f pyautogen_env.yaml
The debating structure requires access to OpenAI VLMs. Please create a file under autogen/ with the following information, and replace the placeholder with the your OpenAI key like the following:
{
"model": "gpt-4-vision-preview",
"api_key": <your OpenAI API key>
}Please refer to file OAI_CONFIG_LIST_sample and the manual from Autogen for further details.
Download ICL_samples.zip from here.
These checkpoints will be only used for reproducing the benchmarking results in the paper, which are not neccesary for annotating images and videos.
| Model | Checkpoint | Model | Checkpoint |
|---|---|---|---|
| Q16 | Pretrained checkpoint from GitHub | InstructBLIP | Vicuna-7B (v1.3) from GitHub |
| HOD | Pretrained checkpoint from GitHub | CogVLM | cogvlm-chat-v1.1 from HF |
| NudeNet | Pretrained checkpoint from GitHub | GPT-4V | API service from OpenAI |
| Hive AI | API service from Hive AI Visual Moderation | LLaVa-NeXT | LLaVA-NeXT-8b for images & LLaVA-NeXT-Video-7B-DPO for vidoes |
- Modify the neccessary arguments in
annotator/scripts/annotateImage.sh- i.e.
config,imageRoot,path2ImageICL,path2AnnFile,path2LogFile
- i.e.
- Put the ICL samples into the
path2ImageICLdirectory. - Run the following code:
cd annotator
sh scripts/annotateImage.sh
- Modify the neccessary arguments in
annotator/scripts/annotateVideo.sh- i.e.
config,frameRoot,videoRoot,path2VideoICL,path2AnnFile,path2LogFile
- i.e.
- Put the ICL samples into the
path2VideoICLdirectory. - Run the following code:
cd annotator
sh scripts/annotateVideo.sh
@inproceedings{yeh2024t2vs,
author={Chen Yeh and You-Ming Chang and Wei-Chen Chiu and Ning Yu},
booktitle = {Advances in Neural Information Processing Systems},
title={T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition},
year = {2024}
}
This project is built upon the the gaint sholder of Autogen. Great thanks to them!


