OSCBench: Benchmarking Object State Change in Text-to-Video Generation

1 National University of Singapore 2 Singapore Management University 3 Carnegie Mellon University 4 Fudan University
arXiv 2026
*Equal Contribution  Corresponding author

Current text-to-video (T2V) models generate visually impressive videos
but still struggle with object state changes

Regular scenario: A woman is slicing an apple
(Discontinuous state changes)

Minimal Prompt: Slicing apple
(Objects appear suddenly)

Novel scenario: A chef is sauteing pineapple on the grass
(Objects appear suddenly)

Compositional scenario: A chef is mincing and sauteing onion in a kitchen
(Artificial artifacts)

Object State Change in T2V Generation

  • Object state change (OSC) is common in daily life and indicates whether a task has been completed.
  • Existing T2V evaluations mainly focus on semantic alignment, visual quality and physical plausibility, but overlook whether the object reaches the action-induced state.
  • T2V-generated videos may look plausible but still fail to produce correct and temporally consistent object state changes (see figure on the right).
Radar plot showing the gap between semantic alignment and OSC performance

Performance evaluated by human.

Overview of OSCBench

OSCBench construction and evaluation pipeline

Overview of the OSCBench. We build unified action and object categories from instructional cooking data via a human-in-the-loop process, and construct regular, novel, and compositional OSC scenarios as text prompts for video generation. The generated videos are evaluated by humans and MLLMs, and we analyze their correlations to assess automatic evaluation reliability.

Benchmark Statistics

20 action elements → 8 action categories

134 object elements → 28 object categories

108 regular scenarios

20 novel scenarios

12 compositional scenarios

Each scenario contains 8 action–object pairs

1,120 prompts overall

Action taxonomy diagram

Action taxonomy.
20 elements -> 8 categories

Benchmark statistics visualization

Example prompts and failure cases from regular, novel, and compositional OSC scenarios.

Main Results

1. Overall Performance of T2V Models

Main result figure 1

Overall performance of T2V models from human and MLLM-based evaluators.

Main result figure 2

Performance of different evaluation dimension by GPT-5.2.


2. Performance of Different Scenarios

Main results table

Regular scenarios achieve the highest OSC performance, while novel scenarios show the most severe degradation. Compositional scenarios perform better than novel ones.


3. OSC Performance of Different Actions

  • Models perform well on simple actions with clear and visually salient state changes, such as rolling and heating.
  • Performance drops on actions involving complex hand–object interactions or subtle visual changes, such as peeling, coating, and pressing.
Radar plot showing the gap between semantic alignment and OSC performance

OSC Performance of different actions evaluated by human.

Rolling: A man is rolling dough in the bakery

Pressing: A woman in casual clothes is squeezing lemon on the grass (Lemon did not show any obvious changes)

Interesting Sampled Videos

å

Regular scenario: A chef is slicing leek at a street food stand
(Discontinuous state changes)

Regular scenario (minimal prompt): Peeling zucchini
(Artificial artifacts)

Novel scenario: A woman is zesting grapefruit in the kitchen
(Memorization rather than understanding)

Compositional scenario: A robot with robotic hands is mincing and sauteing ginger in the kitchen
(Only one state change)

BibTeX

@article{han2026oscbench,
        title={OSCBench: Benchmarking Object State Change in Text-to-Video Generation},
        author={Han, Xianjing and Zhu, Bin and Hu, Shiqi and Li, Franklin Mingzhe and Carrington, Patrick and Zimmermann, Roger and Chen, Jingjing},
        journal={arXiv preprint arXiv:2603.11698},
        year={2026}
        }