Skip to content

hhsinping/kitten

Repository files navigation

KITTEN Benchmark

KITTEN: A Knowledge-Integrated Evaluation of Image Generation on Visual Entities (TMLR 2026)

KITTEN evaluates the ability of text-to-image models to generate real-world entities grounded in knowledge sources.

Contact: Hsin-Ping Huang (hhuang79@ucmerced.edu)
[Paper] [Project Page] [Prior Work: OVEN]

Image

KITTEN benchmark is constructed from real-world entities across eight domains. For each selected entity, we define five evaluation tasks of image-generation prompts incorporating the entity. KITTEN includes a support set of entity images from the knowledge source for evaluating retrieval-augmented models, and an evaluation set for assessing the fidelity of the generated entities.

Introduction

Recent advances in text-to-image generation have improved the quality of synthesized images, but evaluations mainly focus on aesthetics or alignment with text prompts. Thus, it remains unclear whether these models can accurately represent a wide variety of realistic visual entities. To bridge this gap, we propose Kitten, a benchmark for Knowledge-InTegrated image generaTion on real-world ENtities. Using Kitten, we conduct a systematic study of recent text-to-image models, retrieval-augmented models, and unified understanding and generation models, focusing on their ability to generate real-world visual entities such as landmarks and animals. Analyses using carefully designed human evaluations, automatic metrics, and MLLMs as judges show that even advanced text-to-image and unified models fail to generate accurate visual details of entities. While retrieval-augmented models improve entity fidelity by incorporating reference images, they tend to over-rely on them and struggle to create novel configurations of the entities in creative text prompts.

Dataset Documentation

  • KITTEN focuses on evaluating faithfulness to knowledge-grounded concepts. To ensure diversity, we select entities from specialized domains and construct multiple prompts per domain for evaluation.

  • Our domains are sampled from OVEN-Wiki, a dataset constructed from entities in 14 existing image recognition and visual question answering datasets, with all entity labels grounded in Wikipedia. From the image recognition portion of OVEN-Wiki, we select eight representative domains: iNaturalist2017, Cars196, Food101, Sports100, Aircraft, Oxford Flowers, and Google Landmarks v2. These correspond to the plant, insect, vehicle, cuisine, sport, aircraft, flower, and landmark domains in KITTEN. Human identities are intentionally excluded due to ethical and privacy considerations.

  • For each domain, we select up to 50 entities. If a domain contains more than 50 entities, we first exclude entities with high Wikipedia page-click counts, then randomly sample 50 from the remaining set. This results in 322 entities across eight domains. For each entity, we randomly sample up to 10 images as the support set and up to 5 images as the evaluation set.

  • We apply a general safety filter to remove non-imageable classes, classes containing undesired social bias, and non-entity classes. The KITTEN dataset will be released under the Apache-2.0 license, consistent with OVEN-Wiki.

Dataset Download

  • Please follow the instructions from OVEN-eval to download the datasets used in this project.
  • The required datasets for each domain are listed below:
Domain Dataset Download Link
Aircraft Aircraft Download
Vehicle Cars196 Download
Flower Oxford Flowers Download
Insect iNaturalist2017 Download
Plant iNaturalist2017 Download
Landmark GLDv2 Download
Cuisine Food101 Download
Sport Sports100 Download

Dataset Structure

For each domain in our dataset, we provide two main components: a prompt list and an entity list. Each entity has a set of associated support images and evaluation images, which are used for model training and evaluation, respectively.

  • Support images: For each entity, we provide up to 10 images as reference inputs to assess retrieval-augmented models. These images capture different appearances of the entity under various conditions.
  • Evaluation images: For each entity, we provide up to 5 images for evaluation. These images are separate from the support set and are used to assess the model’s ability to generalize.
  • Evaluation prompts: Each (prompt, entity) pair is used to construct a corresponding evaluation prompt, ensuring coverage across diverse scenarios.

The following table summarizes the dataset statistics across different domains:

Domain Prompts Entities Support Images Eval Images #(Prompt, Entity)
Aircraft 20 (link) 48 (link) 480 (link) 240 (link) 960
Vehicle 20 (link) 50 (link) 500 (link) 250 (link) 1000
Flower 20 (link) 18 (link) 180 (link) 90 (link) 360
Insect 20 (link) 50 (link) 500 (link) 250 (link) 1000
Plant 20 (link) 48 (link) 480 (link) 240 (link) 960
Landmark 20 (link) 50 (link) 500 (link) 250 (link) 1000
Cuisine 20 (link) 31 (link) 310 (link) 155 (link) 620
Sport 20 (link) 27 (link) 270 (link) 135 (link) 540

Note: Each entity has up to 10 support images and up to 5 evaluation images. The #(Prompt, Entity) column represents the total number of evaluation prompts generated from all prompt-entity combinations.

We categorize evaluation tasks into five types: 1) generating the knowledge entity, 2) generating the knowledge entity in context, 3) composition of entities, 4) creation in different styles, and 5) creation in different materials. The distribution is summarized below:

Evaluation Task # Prompts Percentage (%)
Basic 295 4.58
Location 1969 30.57
Composition 1467 22.78
Style 1365 21.20
Material 1344 20.87

Notebooks

  • eval_kitten.ipynb evaluates generated images along two dimensions using GPT-based MLLMs via the OpenAI API: entity alignment, which measures how faithfully a generated image represents a reference (support) entity, and prompt alignment, which assesses how well the image captures the details described in the textual prompt. The notebook loads and encodes images, sends them to the OpenAI API, and returns numeric scores (1–5) with explanations.
  • load_kitten.ipynb prepares the KITTEN dataset for evaluation by organizing entities, prompts, and associated images. For each entity, it provides a support set of reference images, and an evaluation set to assess generation fidelity. The loader generates all prompt-entity combinations by filling template prompts with each entity name, and links each combination to the corresponding support and evaluation images.

Citation

KITTEN: A Knowledge-Integrated Evaluation of Image Generation on Visual Entities
Hsin-Ping Huang1,2Xinyi Wang1Yonatan Bitton1Hagai Taitelbaum1Gaurav Singh Tomar1
Ming-Wei Chang1Xuhui Jia1Kelvin C.K. Chan1Hexiang Hu1Yu-Chuan Su1Ming-Hsuan Yang1,2

1Google DeepMind  2University of California, Merced

Transactions on Machine Learning Research (TMLR), 2026

Please cite our paper if you find it useful for your research.

@article{huang_2026_kitten,
   title = {{KITTEN}: A Knowledge-Integrated Evaluation of Image Generation on Visual Entities},
   author={Huang, Hsin-Ping and Wang, Xinyi and Bitton, Yonatan and Taitelbaum, Hagai and Tomar, Gaurav Singh and Chang, Ming-Wei and Jia, Xuhui and Chan, Kelvin C.K. and Hu, Hexiang and Su, Yu-Chuan and Yang, Ming-Hsuan},
   journal={Transactions on Machine Learning Research},
   year={2026}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors