A Very Big Video Reasoning Suite

We bet on a future that video reasoning is the next fundamental intelligence paradigm, after language reasoning, where spatiotemporal embodied world experiences could be more naturally captured.

circle_largest_numerical_value
GitHub
Knowledge out-of-domain testset
The scene shows 6 numbers on a white canvas. First compare the numerical values of all numbers, then draw one red circle around the single largest number. Do not circle any other numbers. Show the complete circling process step by step.
First Frame
Last Frame
return_to_correct_bin
GitHub
Abstraction training set
Move each item into the bin that matches its color. Only move items, do not change anything else.
First Frame
Last Frame
grid_color_sequence
GitHub
Spatiality training set
The scene shows a 10x10 grid with a green start point, a red end point, and colored cells (orange, yellow, and blue). A purple circular agent is positioned at the green start point. The agent can move to adjacent cells (up, down, left, right). Starting from the green start point, the agent must visit the colored cells in order (orange, then yellow, then blue), taking the shortest path between each consecutive pair of colored cells. The agent is allowed to pass through the red end point when visiting the colored cells if needed. After visiting all colored cells in sequence, the agent must reach the red end point, also following the shortest path.
First Frame
Last Frame
separate_objects_no_spin
GitHub
Transformation out-of-domain testset
The scene shows 2 objects on the left side and dashed target outlines on the right side. The dashed target outlines remain completely stationary. Move each object horizontally to the right so that it aligns exactly with and fits within its corresponding dashed target outline.
First Frame
Last Frame
color_mixing
GitHub
Perception training set
The scene has two colored light sources positioned on the left and right sides, and a mixing zone marked by a white rectangular border in the center. In additive color mixing (light mixing), when two lights overlap, their RGB components add together: result_R = min(color1_R + color2_R, 255), same for G and B, with each channel clamped to 255 maximum. First identify the RGB values of the left light (an RGB(69, 80, 31) colored light) and the right light (an RGB(92, 60, 102) colored light), then calculate the mixed color by adding their RGB components channel by channel. Fill the white-bordered mixing zone in the center with the resulting mixed color and show the full calculation process step by step.
First Frame
Last Frame

Inference Results

View All Results
Ball Bounces - Samples
00
01
02
03
04
Task Domains 1/5
Ball Bounces
Knowledge in-domain testset
Sequence Completion
Abstraction in-domain testset
Locate Topmost Figure
Spatiality out-of-domain testset
2D Geometric Transform
Transformation out-of-domain testset
Majority Color
Perception in-domain testset
Prompt
Loading...
Ground Truth
First
First Frame
Final
Final Frame
Model Outputs
1/9
VBVR-Wan2.2
VBVR-Wan2.2
CogVideoX 1.5
Kling 2.6
LTX-2
Runway Gen-4
Sora 2
Veo 3
Wan 2.2 I2V
Hunyuan I2V

Leaderboard

Reference
Strong Baseline
Proprietary
Open-source
Human
Human
97.4%
#1
VBVR
VBVR-Wan2.2
68.5%
#2
Sora 2
Sora 2
54.6%
#3
Veo 3.1
Veo 3.1
48.0%
#4
Runway
Runway Gen-4 Turbo
40.3%
#5
Wan2.2
Wan2.2-I2V-A14B
37.1%
#6
Kling
Kling 2.6
36.9%
#7
LTX-2
LTX-2
31.3%
#8
CogVideoX
CogVideoX1.5-5B-I2V
27.3%
#9
HunyuanVideo
HunyuanVideo-I2V
27.3%
#9