UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation

UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation

🔥 News

😊 We are actively gathering feedback from the community to improve our benchmark. We welcome your input and encourage you to stay updated through our repository!!

📝 To add your own model to the leaderboard, please send an Email to Yibin Wang, then we will help with the evaluation and updating the leaderboard.

Please leave us a star ⭐ if you find our benchmark helpful.

[2026/03] 🔥🔥 We are excited to see PKU & ByteDance Seed adopt our benchmark in Enhancing Spatial Understanding in Image Generation via Reward Modeling. Many thanks to all the contributors!
[2026/02] 🔥🔥 GPT-4o-1.5, Seedream-4.5, and FLUX.2-(klein/pro/flex/max) are added to all 🏅Leaderboard.
[2025/11] 🔥🔥 Nano Banana Pro, FLUX.2-dev and Z-Image are added to all 🏅Leaderboard.
[2025/11] 🔥🔥🔥 We release the offline evaluation model UniGenBench-EvalModel-qwen3vl-32b-v1.
[2025/10] 🔥🔥🔥 We release the offline evaluation model UniGenBench-EvalModel-qwen-72b-v1, which achieves an average accuracy of 94% compared to evaluations by Gemini 2.5 Pro.

[2025/9] 🔥🔥 Lumina-DiMOO, OmniGen2, Infinity, X-Omni, OneCAT, Echo-4o, and MMaDA are added to all 🏅Leaderboard.
[2025/9] 🔥🔥 Seedream-4.0, Nano Banana, GPT-4o, Qwen-Image, FLUX-Kontext-[Max/Pro] are added to all 🏅Leaderboard.
[2025/9] 🔥🔥 We release UniGenBench 🏅Leaderboard (Chinese), 🏅Leaderboard (English Long) and 🏅Leaderboard (Chinese Long). We will continue to update them regularly. The test prompts are provided in ./data.
[2025/9] 🔥🔥 We release all generated images from the T2I models evaluated in our UniGenBench on UniGenBench-Eval-Images. Feel free to use any evaluation model that is convenient and suitable for you to assess and compare the performance of your models.
[2025/8] 🔥🔥 We release paper, project page, and UniGenBench 🏅Leaderboard (English).

Introduction

We propose UniGenBench, a unified and versatile benchmark for image generation that integrates diverse prompt themes with a comprehensive suite of fine-grained evaluation criteria.

✨ Highlights:

Comprehensive and Fine-grained Evaluation: covering 10 primary dimensions and 27 sub-dimensions, enabling systematic and fine-grained assessment of diverse model capabilities.
Rich Prompt Theme Coverage: organized into 5 primary themes and 20 sub-themes, comprehensively spanning both realistic and imaginative generation scenarios.
Efficient yet Comprehensive: unlike other benchmarks, UniGenBench requires only 600 prompts, with each prompt targeting 1–10 specific testpoint, ensuring both coverage and efficiency.
Stremlined MLLM Evaluation: Each testpoint of the prompt is accompanied by a detailed description, explaining how the testpoint is reflected in the prompt, assisting MLLM in conducting precise evaluations.
Bilingual and Length-variant Prompt Support: providing both English and Chinese test prompts in short and long forms, together with evaluation pipelines for both languages, thus enabling fair and broad cross-lingual benchmarking.
Reliable Evaluation Model for Offline Assessment: To facilitate community use, we train a robust evaluation model that supports offline assessment of T2I model outputs.

📑 Prompt Introduction

Each prompt in our benchmark is recorded as a row in a .csv file, combining with structured annotations for evaluation.

index
prompt: The full English prompt to be tested
sub_dims: A JSON-encoded field that organizes rich metadata, including:
- Primary / Secondary Categories – prompt theme (e.g., Creative Divergence → Imaginative Thinking)
- Subjects – the main entities involved in the prompt (e.g., Animal)
- Sentence Structure – the linguistic form of the prompt (e.g., Descriptive)
- Testpoints – key aspects to evaluate (e.g., Style, World Knowledge, Attribute - Quantity)
- Testpoint Description – evaluation cues extracted from the prompt (e.g., classical ink painting, Egyptian pyramids, two pandas)

Category	File	Description
English Short	`data/test_prompts_en.csv`	600 short English prompts
English Long	`data/test_prompts_en_long.csv`	Long-form English prompts
Chinese Short	`data/test_prompts_zh.csv`	600 short Chinese prompts
Chinese Long	`data/test_prompts_zh_long.csv`	Long-form Chinese prompts
Training	`data/train_prompt.txt`	Training prompts

🚀 Inference

We provide reference code for multi-node inference based on FLUX.1-dev.

# English Prompt
bash inference/flux_en_dist_infer.sh

# Chinese Prompt
bash inference/flux_zh_dist_infer.sh

For each test prompt, 4 images are generated and stored in the following folder structure:

output_directory/
  ├── 0_0.png
  ├── 0_1.png
  ├── 0_2.png
  ├── 0_3.png
  ├── 1_0.png
  ├── 1_1.png
  ...

The file naming follows the pattern promptID_imageID.png

📂 Expected Image Directory Structure

The evaluation scripts expect generated images organized as follows:

eval_data/
  ├── en/
  │   └── FLUX.1-dev/          # --model name
  │       ├── 0_0.png
  │       ├── 0_1.png
  │       ├── ...
  │       └── 599_3.png
  ├── en_long/
  │   └── FLUX.1-dev/
  ├── zh/
  │   └── FLUX.1-dev/
  └── zh_long/
      └── FLUX.1-dev/

File naming: {promptID}_{imageID}.png (4 images per prompt by default).

You can customize the base directory via --eval_data_dir, images per prompt via --images_per_prompt, and file extension via --image_suffix.

✨ Evaluation with Gemini 2.5 Pro

We use gemini-2.5-pro (GA, June 17, 2025) via OpenAI-compatible API.

1. Evaluation

# Set API credentials (or pass via --api_key / --base_url)
export GEMINI_API_KEY="sk-xxxxxxx"
export GEMINI_BASE_URL="https://..."

# Evaluate English & Chinese short prompts
bash eval/eval_gemini.sh --model FLUX.1-dev --categories en zh

# Evaluate all categories (en, en_long, zh, zh_long)
bash eval/eval_gemini.sh --model FLUX.1-dev --categories all

# Resume from previous progress
bash eval/eval_gemini.sh --model FLUX.1-dev --categories en --resume

Available categories: en (English short), en_long (English long), zh (Chinese short), zh_long (Chinese long), all.

Run bash eval/eval_gemini.sh -h for all options (--num_processes, --images_per_prompt, etc.).

2. Output

After evaluation, for each category:

Scores across all dimensions are printed to the console
A detailed CSV results file is saved: ./results/{model}_{category}.csv
A JSON score summary is saved: ./results/{model}_{category}.json

3. Re-calculate Scores

python eval/src/calculate_score.py --result_csv ./results/FLUX.1-dev_en.csv --json_path ./results/FLUX.1-dev_en.json

✨ Evaluation with UniGenBench-EvalModel

1. Deploy vLLM Server

Install dependencies:

pip install vllm>=0.11.0 qwen-vl-utils==0.0.14

Start server:

# UniGenBench-EvalModel-qwen-72b-v1
vllm serve CodeGoat24/UniGenBench-EvalModel-qwen-72b-v1 \
    --host localhost --port 8080 \
    --served-model-name QwenVL \
    --trust-remote-code \
    --gpu-memory-utilization 0.9 \
    --tensor-parallel-size 4 \
    --limit-mm-per-prompt.image 2

# UniGenBench-EvalModel-qwen3vl-32b-v1 (recommended, supports 8 GPUs)
vllm serve CodeGoat24/UniGenBench-EvalModel-qwen3vl-32b-v1 \
    --host localhost --port 8080 \
    --served-model-name QwenVL \
    --trust-remote-code \
    --gpu-memory-utilization 0.9 \
    --tensor-parallel-size 8 \
    --limit-mm-per-prompt.image 2

2. Evaluation

# Evaluate English & Chinese short prompts
bash eval/eval_vllm.sh --model FLUX.1-dev --categories en zh

# Evaluate all categories
bash eval/eval_vllm.sh --model FLUX.1-dev --categories all

# Custom server URL and resume
bash eval/eval_vllm.sh --model FLUX.1-dev --categories en_long zh_long \
    --api_url http://gpu-server:8080 --resume

Run bash eval/eval_vllm.sh -h for all options.

3. Output

Same as Gemini evaluation — results are saved to ./results/{model}_{category}.csv and ./results/{model}_{category}.json.

4. Re-calculate Scores

python eval/src/calculate_score.py --result_csv ./results/FLUX.1-dev_en.csv --json_path ./results/FLUX.1-dev_en.json

📧 Contact

If you have any comments or questions, please open a new issue or feel free to contact Yibin Wang.

⭐ Citation

@article{UniGenBench++,
  title={UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation},
  author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Bu, Jiazi and Zhou, Yujie and Xin, Yi and He, Junjun and Wang, Chunyu and Lu, Qinglin and Jin, Cheng and others},
  journal={arXiv preprint arXiv:2510.18701},
  year={2025}
}

@article{Pref-GRPO&UniGenBench,
  title={Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning},
  author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Zhou, Yujie and Bu, Jiazi and Wang, Chunyu and Lu, Qinglin and Jin, Cheng and Wang, Jiaqi},
  journal={arXiv preprint arXiv:2508.20751},
  year={2025}
}

🏅 Evaluation Leaderboards

English Short Prompt Evaluation

English Long Prompt Evaluation

Chinese Short Prompt Evaluation

Chinese Long Prompt Evaluation

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
assets		assets
data		data
docs		docs
eval		eval
inference		inference
.gitignore		.gitignore
License.txt		License.txt
README.md		README.md
offline_eval_vllm_server.sh		offline_eval_vllm_server.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation

🔥 News

Introduction

✨ Highlights:

📑 Prompt Introduction

🚀 Inference

📂 Expected Image Directory Structure

✨ Evaluation with Gemini 2.5 Pro

1. Evaluation

2. Output

3. Re-calculate Scores

✨ Evaluation with UniGenBench-EvalModel

1. Deploy vLLM Server

2. Evaluation

3. Output

4. Re-calculate Scores

📧 Contact

⭐ Citation

🏅 Evaluation Leaderboards

English Short Prompt Evaluation

English Long Prompt Evaluation

Chinese Short Prompt Evaluation

Chinese Long Prompt Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation

🔥 News

Introduction

✨ Highlights:

📑 Prompt Introduction

🚀 Inference

📂 Expected Image Directory Structure

✨ Evaluation with Gemini 2.5 Pro

1. Evaluation

2. Output

3. Re-calculate Scores

✨ Evaluation with UniGenBench-EvalModel

1. Deploy vLLM Server

2. Evaluation

3. Output

4. Re-calculate Scores

📧 Contact

⭐ Citation

🏅 Evaluation Leaderboards

English Short Prompt Evaluation

English Long Prompt Evaluation

Chinese Short Prompt Evaluation

Chinese Long Prompt Evaluation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages