KITTEN Benchmark

KITTEN: A Knowledge-Integrated Evaluation of Image Generation on Visual Entities (TMLR 2026)

KITTEN evaluates the ability of text-to-image models to generate real-world entities grounded in knowledge sources.

Contact: Hsin-Ping Huang (hhuang79@ucmerced.edu)
[Paper] [Project Page] [Prior Work: OVEN]

KITTEN benchmark is constructed from real-world entities across eight domains. For each selected entity, we define five evaluation tasks of image-generation prompts incorporating the entity. KITTEN includes a support set of entity images from the knowledge source for evaluating retrieval-augmented models, and an evaluation set for assessing the fidelity of the generated entities.

Introduction

Recent advances in text-to-image generation have improved the quality of synthesized images, but evaluations mainly focus on aesthetics or alignment with text prompts. Thus, it remains unclear whether these models can accurately represent a wide variety of realistic visual entities. To bridge this gap, we propose Kitten, a benchmark for Knowledge-InTegrated image generaTion on real-world ENtities. Using Kitten, we conduct a systematic study of recent text-to-image models, retrieval-augmented models, and unified understanding and generation models, focusing on their ability to generate real-world visual entities such as landmarks and animals. Analyses using carefully designed human evaluations, automatic metrics, and MLLMs as judges show that even advanced text-to-image and unified models fail to generate accurate visual details of entities. While retrieval-augmented models improve entity fidelity by incorporating reference images, they tend to over-rely on them and struggle to create novel configurations of the entities in creative text prompts.

Dataset Documentation

KITTEN focuses on evaluating faithfulness to knowledge-grounded concepts. To ensure diversity, we select entities from specialized domains and construct multiple prompts per domain for evaluation.
Our domains are sampled from OVEN-Wiki, a dataset constructed from entities in 14 existing image recognition and visual question answering datasets, with all entity labels grounded in Wikipedia. From the image recognition portion of OVEN-Wiki, we select eight representative domains: iNaturalist2017, Cars196, Food101, Sports100, Aircraft, Oxford Flowers, and Google Landmarks v2. These correspond to the plant, insect, vehicle, cuisine, sport, aircraft, flower, and landmark domains in KITTEN. Human identities are intentionally excluded due to ethical and privacy considerations.
For each domain, we select up to 50 entities. If a domain contains more than 50 entities, we first exclude entities with high Wikipedia page-click counts, then randomly sample 50 from the remaining set. This results in 322 entities across eight domains. For each entity, we randomly sample up to 10 images as the support set and up to 5 images as the evaluation set.
We apply a general safety filter to remove non-imageable classes, classes containing undesired social bias, and non-entity classes. The KITTEN dataset will be released under the Apache-2.0 license, consistent with OVEN-Wiki.

Dataset Download

Please follow the instructions from OVEN-eval to download the datasets used in this project.
The required datasets for each domain are listed below:

Domain	Dataset	Download Link
Aircraft	Aircraft	Download
Vehicle	Cars196	Download
Flower	Oxford Flowers	Download
Insect	iNaturalist2017	Download
Plant	iNaturalist2017	Download
Landmark	GLDv2	Download
Cuisine	Food101	Download
Sport	Sports100	Download

Dataset Structure

For each domain in our dataset, we provide two main components: a prompt list and an entity list. Each entity has a set of associated support images and evaluation images, which are used for model training and evaluation, respectively.

Support images: For each entity, we provide up to 10 images as reference inputs to assess retrieval-augmented models. These images capture different appearances of the entity under various conditions.
Evaluation images: For each entity, we provide up to 5 images for evaluation. These images are separate from the support set and are used to assess the model’s ability to generalize.
Evaluation prompts: Each (prompt, entity) pair is used to construct a corresponding evaluation prompt, ensuring coverage across diverse scenarios.

The following table summarizes the dataset statistics across different domains:

Domain	Prompts	Entities	Support Images	Eval Images	#(Prompt, Entity)
Aircraft	20 (link)	48 (link)	480 (link)	240 (link)	960
Vehicle	20 (link)	50 (link)	500 (link)	250 (link)	1000
Flower	20 (link)	18 (link)	180 (link)	90 (link)	360
Insect	20 (link)	50 (link)	500 (link)	250 (link)	1000
Plant	20 (link)	48 (link)	480 (link)	240 (link)	960
Landmark	20 (link)	50 (link)	500 (link)	250 (link)	1000
Cuisine	20 (link)	31 (link)	310 (link)	155 (link)	620
Sport	20 (link)	27 (link)	270 (link)	135 (link)	540

Note: Each entity has up to 10 support images and up to 5 evaluation images. The #(Prompt, Entity) column represents the total number of evaluation prompts generated from all prompt-entity combinations.

We categorize evaluation tasks into five types: 1) generating the knowledge entity, 2) generating the knowledge entity in context, 3) composition of entities, 4) creation in different styles, and 5) creation in different materials. The distribution is summarized below:

Evaluation Task	# Prompts	Percentage (%)
Basic	295	4.58
Location	1969	30.57
Composition	1467	22.78
Style	1365	21.20
Material	1344	20.87

Notebooks

eval_kitten.ipynb evaluates generated images along two dimensions using GPT-based MLLMs via the OpenAI API: entity alignment, which measures how faithfully a generated image represents a reference (support) entity, and prompt alignment, which assesses how well the image captures the details described in the textual prompt. The notebook loads and encodes images, sends them to the OpenAI API, and returns numeric scores (1–5) with explanations.
load_kitten.ipynb prepares the KITTEN dataset for evaluation by organizing entities, prompts, and associated images. For each entity, it provides a support set of reference images, and an evaluation set to assess generation fidelity. The loader generates all prompt-entity combinations by filling template prompts with each entity name, and links each combination to the corresponding support and evaluation images.

Citation

KITTEN: A Knowledge-Integrated Evaluation of Image Generation on Visual Entities
Hsin-Ping Huang^1,2 Xinyi Wang¹ Yonatan Bitton¹ Hagai Taitelbaum¹ Gaurav Singh Tomar¹
Ming-Wei Chang¹ Xuhui Jia¹ Kelvin C.K. Chan¹ Hexiang Hu¹ Yu-Chuan Su¹ Ming-Hsuan Yang^1,2

¹Google DeepMind ²University of California, Merced

Transactions on Machine Learning Research (TMLR), 2026

Please cite our paper if you find it useful for your research.

@article{huang_2026_kitten,
   title = {{KITTEN}: A Knowledge-Integrated Evaluation of Image Generation on Visual Entities},
   author={Huang, Hsin-Ping and Wang, Xinyi and Bitton, Yonatan and Taitelbaum, Hagai and Tomar, Gaurav Singh and Chang, Ming-Wei and Jia, Xuhui and Chan, Kelvin C.K. and Hu, Hexiang and Su, Yu-Chuan and Yang, Ming-Hsuan},
   journal={Transactions on Machine Learning Research},
   year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
kitten		kitten
LICENSE		LICENSE
README.md		README.md
eval_kitten.ipynb		eval_kitten.ipynb
example_generated_image.png		example_generated_image.png
example_support_image.png		example_support_image.png
load_kitten.ipynb		load_kitten.ipynb
teaser.png		teaser.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KITTEN Benchmark

Introduction

Dataset Documentation

Dataset Download

Dataset Structure

Notebooks

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KITTEN Benchmark

Introduction

Dataset Documentation

Dataset Download

Dataset Structure

Notebooks

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages