Skip to content

CodeGoat24/UniGenBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

104 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Image

UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation

UnifiedReward Team

Paper PDF Paper PDF
Project Page Project Page

Hugging Face Spaces Hugging Face Spaces

Hugging Face Spaces Hugging Face Spaces

Hugging Face Spaces Hugging Face Spaces

🔥 News

😊 We are actively gathering feedback from the community to improve our benchmark. We welcome your input and encourage you to stay updated through our repository!!

📝 To add your own model to the leaderboard, please send an Email to Yibin Wang, then we will help with the evaluation and updating the leaderboard.

Please leave us a star ⭐ if you find our benchmark helpful.

image
  • [2025/9] 🔥🔥 Lumina-DiMOO, OmniGen2, Infinity, X-Omni, OneCAT, Echo-4o, and MMaDA are added to all 🏅Leaderboard.

  • [2025/9] 🔥🔥 Seedream-4.0, Nano Banana, GPT-4o, Qwen-Image, FLUX-Kontext-[Max/Pro] are added to all 🏅Leaderboard.

  • [2025/9] 🔥🔥 We release UniGenBench 🏅Leaderboard (Chinese), 🏅Leaderboard (English Long) and 🏅Leaderboard (Chinese Long). We will continue to update them regularly. The test prompts are provided in ./data.

  • [2025/9] 🔥🔥 We release all generated images from the T2I models evaluated in our UniGenBench on UniGenBench-Eval-Images. Feel free to use any evaluation model that is convenient and suitable for you to assess and compare the performance of your models.

  • [2025/8] 🔥🔥 We release paper, project page, and UniGenBench 🏅Leaderboard (English).

Introduction

We propose UniGenBench, a unified and versatile benchmark for image generation that integrates diverse prompt themes with a comprehensive suite of fine-grained evaluation criteria.

image

✨ Highlights:

  • Comprehensive and Fine-grained Evaluation: covering 10 primary dimensions and 27 sub-dimensions, enabling systematic and fine-grained assessment of diverse model capabilities.

  • Rich Prompt Theme Coverage: organized into 5 primary themes and 20 sub-themes, comprehensively spanning both realistic and imaginative generation scenarios.

  • Efficient yet Comprehensive: unlike other benchmarks, UniGenBench requires only 600 prompts, with each prompt targeting 1–10 specific testpoint, ensuring both coverage and efficiency.

  • Stremlined MLLM Evaluation: Each testpoint of the prompt is accompanied by a detailed description, explaining how the testpoint is reflected in the prompt, assisting MLLM in conducting precise evaluations.

  • Bilingual and Length-variant Prompt Support: providing both English and Chinese test prompts in short and long forms, together with evaluation pipelines for both languages, thus enabling fair and broad cross-lingual benchmarking.

  • Reliable Evaluation Model for Offline Assessment: To facilitate community use, we train a robust evaluation model that supports offline assessment of T2I model outputs.

image

Image

📑 Prompt Introduction

Each prompt in our benchmark is recorded as a row in a .csv file, combining with structured annotations for evaluation.

  • index
  • prompt: The full English prompt to be tested
  • sub_dims: A JSON-encoded field that organizes rich metadata, including:
    • Primary / Secondary Categories – prompt theme (e.g., Creative Divergence → Imaginative Thinking)
    • Subjects – the main entities involved in the prompt (e.g., Animal)
    • Sentence Structure – the linguistic form of the prompt (e.g., Descriptive)
    • Testpoints – key aspects to evaluate (e.g., Style, World Knowledge, Attribute - Quantity)
    • Testpoint Description – evaluation cues extracted from the prompt (e.g., classical ink painting, Egyptian pyramids, two pandas)
Category File Description
English Short data/test_prompts_en.csv 600 short English prompts
English Long data/test_prompts_en_long.csv Long-form English prompts
Chinese Short data/test_prompts_zh.csv 600 short Chinese prompts
Chinese Long data/test_prompts_zh_long.csv Long-form Chinese prompts
Training data/train_prompt.txt Training prompts

🚀 Inference

We provide reference code for multi-node inference based on FLUX.1-dev.

# English Prompt
bash inference/flux_en_dist_infer.sh

# Chinese Prompt
bash inference/flux_zh_dist_infer.sh

For each test prompt, 4 images are generated and stored in the following folder structure:

output_directory/
  ├── 0_0.png
  ├── 0_1.png
  ├── 0_2.png
  ├── 0_3.png
  ├── 1_0.png
  ├── 1_1.png
  ...

The file naming follows the pattern promptID_imageID.png

📂 Expected Image Directory Structure

The evaluation scripts expect generated images organized as follows:

eval_data/
  ├── en/
  │   └── FLUX.1-dev/          # --model name
  │       ├── 0_0.png
  │       ├── 0_1.png
  │       ├── ...
  │       └── 599_3.png
  ├── en_long/
  │   └── FLUX.1-dev/
  ├── zh/
  │   └── FLUX.1-dev/
  └── zh_long/
      └── FLUX.1-dev/

File naming: {promptID}_{imageID}.png (4 images per prompt by default).

You can customize the base directory via --eval_data_dir, images per prompt via --images_per_prompt, and file extension via --image_suffix.

✨ Evaluation with Gemini 2.5 Pro

We use gemini-2.5-pro (GA, June 17, 2025) via OpenAI-compatible API.

1. Evaluation

# Set API credentials (or pass via --api_key / --base_url)
export GEMINI_API_KEY="sk-xxxxxxx"
export GEMINI_BASE_URL="https://..."

# Evaluate English & Chinese short prompts
bash eval/eval_gemini.sh --model FLUX.1-dev --categories en zh

# Evaluate all categories (en, en_long, zh, zh_long)
bash eval/eval_gemini.sh --model FLUX.1-dev --categories all

# Resume from previous progress
bash eval/eval_gemini.sh --model FLUX.1-dev --categories en --resume

Available categories: en (English short), en_long (English long), zh (Chinese short), zh_long (Chinese long), all.

Run bash eval/eval_gemini.sh -h for all options (--num_processes, --images_per_prompt, etc.).

2. Output

After evaluation, for each category:

  • Scores across all dimensions are printed to the console
  • A detailed CSV results file is saved: ./results/{model}_{category}.csv
  • A JSON score summary is saved: ./results/{model}_{category}.json

3. Re-calculate Scores

python eval/src/calculate_score.py --result_csv ./results/FLUX.1-dev_en.csv --json_path ./results/FLUX.1-dev_en.json

✨ Evaluation with UniGenBench-EvalModel

1. Deploy vLLM Server

Install dependencies:

pip install vllm>=0.11.0 qwen-vl-utils==0.0.14

Start server:

# UniGenBench-EvalModel-qwen-72b-v1
vllm serve CodeGoat24/UniGenBench-EvalModel-qwen-72b-v1 \
    --host localhost --port 8080 \
    --served-model-name QwenVL \
    --trust-remote-code \
    --gpu-memory-utilization 0.9 \
    --tensor-parallel-size 4 \
    --limit-mm-per-prompt.image 2

# UniGenBench-EvalModel-qwen3vl-32b-v1 (recommended, supports 8 GPUs)
vllm serve CodeGoat24/UniGenBench-EvalModel-qwen3vl-32b-v1 \
    --host localhost --port 8080 \
    --served-model-name QwenVL \
    --trust-remote-code \
    --gpu-memory-utilization 0.9 \
    --tensor-parallel-size 8 \
    --limit-mm-per-prompt.image 2

2. Evaluation

# Evaluate English & Chinese short prompts
bash eval/eval_vllm.sh --model FLUX.1-dev --categories en zh

# Evaluate all categories
bash eval/eval_vllm.sh --model FLUX.1-dev --categories all

# Custom server URL and resume
bash eval/eval_vllm.sh --model FLUX.1-dev --categories en_long zh_long \
    --api_url http://gpu-server:8080 --resume

Run bash eval/eval_vllm.sh -h for all options.

3. Output

Same as Gemini evaluation — results are saved to ./results/{model}_{category}.csv and ./results/{model}_{category}.json.

4. Re-calculate Scores

python eval/src/calculate_score.py --result_csv ./results/FLUX.1-dev_en.csv --json_path ./results/FLUX.1-dev_en.json

📧 Contact

If you have any comments or questions, please open a new issue or feel free to contact Yibin Wang.

⭐ Citation

@article{UniGenBench++,
  title={UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation},
  author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Bu, Jiazi and Zhou, Yujie and Xin, Yi and He, Junjun and Wang, Chunyu and Lu, Qinglin and Jin, Cheng and others},
  journal={arXiv preprint arXiv:2510.18701},
  year={2025}
}

@article{Pref-GRPO&UniGenBench,
  title={Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning},
  author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Zhou, Yujie and Bu, Jiazi and Wang, Chunyu and Lu, Qinglin and Jin, Cheng and Wang, Jiaqi},
  journal={arXiv preprint arXiv:2508.20751},
  year={2025}
}

🏅 Evaluation Leaderboards

English Short Prompt Evaluation

image

English Long Prompt Evaluation

image

Chinese Short Prompt Evaluation

image

Chinese Long Prompt Evaluation

image

About

UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages