KITTEN: A Knowledge-Integrated Evaluation of Image Generation on Visual Entities (TMLR 2026)
KITTEN evaluates the ability of text-to-image models to generate real-world entities grounded in knowledge sources.
Contact: Hsin-Ping Huang (hhuang79@ucmerced.edu)
[Paper]
[Project Page]
[Prior Work: OVEN]
KITTEN benchmark is constructed from real-world entities across eight domains. For each selected entity, we define five evaluation tasks of image-generation prompts incorporating the entity. KITTEN includes a support set of entity images from the knowledge source for evaluating retrieval-augmented models, and an evaluation set for assessing the fidelity of the generated entities.
Recent advances in text-to-image generation have improved the quality of synthesized images, but evaluations mainly focus on aesthetics or alignment with text prompts. Thus, it remains unclear whether these models can accurately represent a wide variety of realistic visual entities. To bridge this gap, we propose Kitten, a benchmark for Knowledge-InTegrated image generaTion on real-world ENtities. Using Kitten, we conduct a systematic study of recent text-to-image models, retrieval-augmented models, and unified understanding and generation models, focusing on their ability to generate real-world visual entities such as landmarks and animals. Analyses using carefully designed human evaluations, automatic metrics, and MLLMs as judges show that even advanced text-to-image and unified models fail to generate accurate visual details of entities. While retrieval-augmented models improve entity fidelity by incorporating reference images, they tend to over-rely on them and struggle to create novel configurations of the entities in creative text prompts.
-
KITTEN focuses on evaluating faithfulness to knowledge-grounded concepts. To ensure diversity, we select entities from specialized domains and construct multiple prompts per domain for evaluation.
-
Our domains are sampled from OVEN-Wiki, a dataset constructed from entities in 14 existing image recognition and visual question answering datasets, with all entity labels grounded in Wikipedia. From the image recognition portion of OVEN-Wiki, we select eight representative domains: iNaturalist2017, Cars196, Food101, Sports100, Aircraft, Oxford Flowers, and Google Landmarks v2. These correspond to the plant, insect, vehicle, cuisine, sport, aircraft, flower, and landmark domains in KITTEN. Human identities are intentionally excluded due to ethical and privacy considerations.
-
For each domain, we select up to 50 entities. If a domain contains more than 50 entities, we first exclude entities with high Wikipedia page-click counts, then randomly sample 50 from the remaining set. This results in 322 entities across eight domains. For each entity, we randomly sample up to 10 images as the support set and up to 5 images as the evaluation set.
-
We apply a general safety filter to remove non-imageable classes, classes containing undesired social bias, and non-entity classes. The KITTEN dataset will be released under the Apache-2.0 license, consistent with OVEN-Wiki.
- Please follow the instructions from OVEN-eval to download the datasets used in this project.
- The required datasets for each domain are listed below:
| Domain | Dataset | Download Link |
|---|---|---|
| Aircraft | Aircraft | Download |
| Vehicle | Cars196 | Download |
| Flower | Oxford Flowers | Download |
| Insect | iNaturalist2017 | Download |
| Plant | iNaturalist2017 | Download |
| Landmark | GLDv2 | Download |
| Cuisine | Food101 | Download |
| Sport | Sports100 | Download |
For each domain in our dataset, we provide two main components: a prompt list and an entity list. Each entity has a set of associated support images and evaluation images, which are used for model training and evaluation, respectively.
- Support images: For each entity, we provide up to 10 images as reference inputs to assess retrieval-augmented models. These images capture different appearances of the entity under various conditions.
- Evaluation images: For each entity, we provide up to 5 images for evaluation. These images are separate from the support set and are used to assess the model’s ability to generalize.
- Evaluation prompts: Each (prompt, entity) pair is used to construct a corresponding evaluation prompt, ensuring coverage across diverse scenarios.
The following table summarizes the dataset statistics across different domains:
| Domain | Prompts | Entities | Support Images | Eval Images | #(Prompt, Entity) |
|---|---|---|---|---|---|
| Aircraft | 20 (link) | 48 (link) | 480 (link) | 240 (link) | 960 |
| Vehicle | 20 (link) | 50 (link) | 500 (link) | 250 (link) | 1000 |
| Flower | 20 (link) | 18 (link) | 180 (link) | 90 (link) | 360 |
| Insect | 20 (link) | 50 (link) | 500 (link) | 250 (link) | 1000 |
| Plant | 20 (link) | 48 (link) | 480 (link) | 240 (link) | 960 |
| Landmark | 20 (link) | 50 (link) | 500 (link) | 250 (link) | 1000 |
| Cuisine | 20 (link) | 31 (link) | 310 (link) | 155 (link) | 620 |
| Sport | 20 (link) | 27 (link) | 270 (link) | 135 (link) | 540 |
Note: Each entity has up to 10 support images and up to 5 evaluation images. The
#(Prompt, Entity)column represents the total number of evaluation prompts generated from all prompt-entity combinations.
We categorize evaluation tasks into five types: 1) generating the knowledge entity, 2) generating the knowledge entity in context, 3) composition of entities, 4) creation in different styles, and 5) creation in different materials. The distribution is summarized below:
| Evaluation Task | # Prompts | Percentage (%) |
|---|---|---|
| Basic | 295 | 4.58 |
| Location | 1969 | 30.57 |
| Composition | 1467 | 22.78 |
| Style | 1365 | 21.20 |
| Material | 1344 | 20.87 |
- eval_kitten.ipynb evaluates generated images along two dimensions using GPT-based MLLMs via the OpenAI API: entity alignment, which measures how faithfully a generated image represents a reference (support) entity, and prompt alignment, which assesses how well the image captures the details described in the textual prompt. The notebook loads and encodes images, sends them to the OpenAI API, and returns numeric scores (1–5) with explanations.
- load_kitten.ipynb prepares the KITTEN dataset for evaluation by organizing entities, prompts, and associated images. For each entity, it provides a support set of reference images, and an evaluation set to assess generation fidelity. The loader generates all prompt-entity combinations by filling template prompts with each entity name, and links each combination to the corresponding support and evaluation images.
KITTEN: A Knowledge-Integrated Evaluation of Image Generation on Visual Entities
Hsin-Ping Huang1,2
Xinyi Wang1
Yonatan Bitton1
Hagai Taitelbaum1
Gaurav Singh Tomar1
Ming-Wei Chang1
Xuhui Jia1
Kelvin C.K. Chan1
Hexiang Hu1
Yu-Chuan Su1
Ming-Hsuan Yang1,2
1Google DeepMind 2University of California, Merced
Transactions on Machine Learning Research (TMLR), 2026
Please cite our paper if you find it useful for your research.
@article{huang_2026_kitten,
title = {{KITTEN}: A Knowledge-Integrated Evaluation of Image Generation on Visual Entities},
author={Huang, Hsin-Ping and Wang, Xinyi and Bitton, Yonatan and Taitelbaum, Hagai and Tomar, Gaurav Singh and Chang, Ming-Wei and Jia, Xuhui and Chan, Kelvin C.K. and Hu, Hexiang and Su, Yu-Chuan and Yang, Ming-Hsuan},
journal={Transactions on Machine Learning Research},
year={2026}
}
