Skip to content

nctu-eva-lab/VHD11K

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition

arxiv Static Badge

Chen Yeh*, You-Ming Chang*, Wei-Chen Chiu, Ning Yu

🎉 Accepted to NeurIPS'24 Datasets and Benchmarks Track 🎉

Table of contents

💡 Overview

Image

We propose a comprehensive and extensive harmful dataset, Visual Harmful Dataset 11K (VHD11K), consisting of 10,000 images and 1,000 videos, crawled from the Internet and generated by 4 generative models, across a total of 10 harmful categories covering a full spectrum of harmful concepts with non-trival definition. We also propose a novel annotation framework by formulating the annotation process as a Multi-agent Visual Question Answering (VQA) Task, having 3 different VLMs "debate" about whether the given image/video is harmful, and incorporating the in-context learning strategy in the debating process.

📚 VHD11K: Our Proposed Multimodal Dataset for Visual Harmfulness Recognition

The entire dataset is publicly available at here.

Under the shared folder, there are:

dataset_10000_1000
|--croissant-vhd11k.json            # metadata of VHD11K
|--harmful_image_10000_ann.json     # annotaion file of harmful images of VHD11K 
                                      (image name, harmful type, arguments, ...)
|--harmful_images_10000.zip         # 10000 harmful images of VHD11K
|--image_urls.csv                   # urls of images of VHD11K
|--harmful_video_1000_ann.json      # annotaion file of harmful videos of VHD11K
                                      (video name, harmful type, arguments, ...)
|--harmful_videos_1000.zip          # 1000 harmful videos of VHD11K
|--video_urls.csv                   # urls of videos of VHD11K
|--ICL_samples.zip                  # in-context learning samples used in annoators
    |--ICL_images                   # in-context learning images
    |--ICL_videos_frames            # frames of each in-context learning video

📊 Evaluation

Image Image

Evaluation and experimental results demonstrate that

  1. the great alignment between the annotation from our novel annotation framework and those from human, ensuring the reliability of VHD11K.
  2. our full-spectrum harmful dataset successfully identifies the inability of existing harmful content detection methods to detect extensive harmful contents and improves the performance of existing harmfulness recognition methods.
  3. our dataset outperforms the baseline dataset, SMID, as evidenced by the superior improvement in harmfulness recognition methods.

🚧 Prerequisites

Environment Installation

git clone https://github.com/thisismingggg/autogen.git
cd autogen
conda env create -f pyautogen_env.yaml

Add in OpenAI API key

The debating structure requires access to OpenAI VLMs. Please create a file under autogen/ with the following information, and replace the placeholder with the your OpenAI key like the following:

{
    "model": "gpt-4-vision-preview",
    "api_key": <your OpenAI API key>
}

Please refer to file OAI_CONFIG_LIST_sample and the manual from Autogen for further details.

Download In-context Learning Samples

Download ICL_samples.zip from here.

Baseline checkpoints

These checkpoints will be only used for reproducing the benchmarking results in the paper, which are not neccesary for annotating images and videos.

Model Checkpoint Model Checkpoint
Q16 Pretrained checkpoint from GitHub InstructBLIP Vicuna-7B (v1.3) from GitHub
HOD Pretrained checkpoint from GitHub CogVLM cogvlm-chat-v1.1 from HF
NudeNet Pretrained checkpoint from GitHub GPT-4V API service from OpenAI
Hive AI API service from Hive AI Visual Moderation LLaVa-NeXT LLaVA-NeXT-8b for images & LLaVA-NeXT-Video-7B-DPO for vidoes

✏️ Annotate images or videos

Annotate images

  1. Modify the neccessary arguments in annotator/scripts/annotateImage.sh
    • i.e. config, imageRoot, path2ImageICL, path2AnnFile, path2LogFile
  2. Put the ICL samples into the path2ImageICL directory.
  3. Run the following code:
cd annotator
sh scripts/annotateImage.sh

Annotate videos

  1. Modify the neccessary arguments in annotator/scripts/annotateVideo.sh
    • i.e. config, frameRoot, videoRoot, path2VideoICL, path2AnnFile, path2LogFile
  2. Put the ICL samples into the path2VideoICL directory.
  3. Run the following code:
cd annotator
sh scripts/annotateVideo.sh

✒️ Citation

@inproceedings{yeh2024t2vs,
 author={Chen Yeh and You-Ming Chang and Wei-Chen Chiu and Ning Yu},
 booktitle = {Advances in Neural Information Processing Systems},
 title={T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition},
 year = {2024}
}

🙌 Acknowledgement

This project is built upon the the gaint sholder of Autogen. Great thanks to them!

About

Official implementation of T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition

Resources

License

CC-BY-4.0, MIT licenses found

Licenses found

CC-BY-4.0
LICENSE
MIT
LICENSE-CODE

Code of conduct

Security policy

Stars

Watchers

Forks

Contributors