A Very Big Video Reasoning Suite

We bet on a future that video reasoning is the next fundamental intelligence paradigm, after language reasoning, where spatiotemporal embodied world experiences could be more naturally captured.

locate_twelve_o_clock_arrows
GitHub
Knowledge training set
The image contains 2 clocks, each with only an hour hand. Exactly one clock has its hour hand pointing to 12 o'clock. First find the single clock pointing to 12 o'clock, then draw a red circle around it. Do not change anything else. Show the complete solution step by step.
First Frame
Last Frame
sequence_completion
GitHub
Abstraction in-domain testset
The scene shows a color_cycle sequence. Elements are arranged horizontally from left to right. The last position contains a question mark (?) indicating a missing element. Observe the pattern: the colors follow a cyclic order that repeats after a certain number of elements. Determine the element that should replace the question mark to complete the sequence according to the established pattern.
First Frame
Last Frame
grid_number_sequence
GitHub
Spatiality in-domain testset
The scene shows a 10x10 grid with a green start point, a red end point, and yellow cells marked with numbers 1, 2, and 3. An orange circular agent is positioned at the green start point. The agent can move to adjacent cells (up, down, left, right). Starting from the green start point, the agent must visit the numbered yellow cells in numerical order (1, then 2, then 3), taking the shortest path between each consecutive pair of numbered cells. The agent is allowed to pass through the red end point when visiting the numbered cells if needed. After visiting all numbered cells in sequence, the agent must reach the red end point, also following the shortest path.
First Frame
Last Frame
reorder_objects
GitHub
Transformation training set
The scene contains multiple objects arranged in a horizontal line. Keep all other objects unchanged. Swap the positions of the 4th and 5th objects from the left using shortest paths.
First Frame
Last Frame
mark_wave_peaks
GitHub
Perception out-of-domain testset
The scene shows a continuous wave on a white background. Find all peaks (local maxima: each point where the wave value is greater than both immediate neighbors). Circle each peak with a red hollow outline and a solid red dot at its center, from left to right one by one, and show the solution step by step.
First Frame
Last Frame

Inference Results

View All Results
Circle Largest Value - Samples
00
01
02
03
04
Task Domains 1/5
Circle Largest Value
Knowledge out-of-domain testset
Construction Blueprint
Abstraction in-domain testset
Select Leftmost Shape
Spatiality out-of-domain testset
Separate Objects (No Spin)
Transformation out-of-domain testset
Color Triple Intersection
Perception out-of-domain testset
Prompt
Loading...
Ground Truth
First
First Frame
Final
Final Frame
Model Outputs
1/9
VBVR-Wan2.2
VBVR-Wan2.2
CogVideoX 1.5
Kling 2.6
LTX-2
Runway Gen-4
Sora 2
Veo 3
Wan 2.2 I2V
Hunyuan I2V

Leaderboard

Reference
Strong Baseline
Proprietary
Open-source
Human
Human
97.4%
#1
VBVR
VBVR-Wan2.2
68.5%
#2
Sora 2
Sora 2
54.6%
#3
Veo 3.1
Veo 3.1
48.0%
#4
Runway
Runway Gen-4 Turbo
40.3%
#5
Wan2.2
Wan2.2-I2V-A14B
37.1%
#6
Kling
Kling 2.6
36.9%
#7
LTX-2
LTX-2
31.3%
#8
CogVideoX
CogVideoX1.5-5B-I2V
27.3%
#9
HunyuanVideo
HunyuanVideo-I2V
27.3%
#9