A Very Big Video Reasoning Suite

We bet on a future that video reasoning is the next fundamental intelligence paradigm, after language reasoning, where spatiotemporal embodied world experiences could be more naturally captured.

communicating_vessels
GitHub
Knowledge in-domain testset
A system of 3 communicating vessels with equal-diameter vertical tubes is filled with water (water-like (low viscosity)), which appears blue in color. As shown in the initial frame, the liquid levels in the tubes are [43, 2, 54] cm respectively. Due to pressure differences between the tubes, the liquid begins to flow through the connecting channels at the bottom. The flow is governed by hydrostatic pressure equalization and damped by viscous resistance with coefficient k=3.64. As the liquid redistributes, the height differences gradually decrease, and the system evolves toward equilibrium. Eventually, through conservation of volume, all tubes reach the same final liquid level, which equals the average of the initial heights. Simulate this settling process from the initial unbalanced state to the final stable equilibrium.
First Frame
Last Frame
select_next_figure_large_small_alternating_sequence
GitHub
Abstraction in-domain testset
The scene has two separated areas: a top SEQUENCE area and a bottom CHOICES area. In the SEQUENCE area, the shapes are the same shape and the same color, and their sizes strictly alternate between LARGE and SMALL from left to right. First observe the size-alternation pattern and determine whether the next item should be LARGE or SMALL, then select the one correct option (out of 4) in the CHOICES area that continues the same shape, color, and large/small alternation pattern. Circle the correct option and show the full process step by step.
First Frame
Last Frame
planar_warp_verification
GitHub
Spatiality training set
Transform the blue grid by aligning its four corners to the four corners of the red quadrilateral. Apply a perspective transformation so the grid matches the red outline. Keep all background elements, colored dots, and gray patches unchanged. Output the transformed grid.
First Frame
Last Frame
animal_matching
GitHub
Transformation out-of-domain testset
Colored animal faces are on the left side of the canvas, and dark outlines of animals are on the right side. Move each colored animal face to its matching outline via the shortest path.
First Frame
Last Frame
color_triple_intersection_red
GitHub
Perception out-of-domain testset
A Venn diagram of circles is shown. Identify the region that lies in all three of the first three circles (triple intersection) and color that region red. Do not change anything else.
First Frame
Last Frame

Inference Results

View All Results
Mirror Reflection - Samples
00
01
02
03
04
Task Domains 1/5
Mirror Reflection
Knowledge in-domain testset
Sequence Completion
Abstraction in-domain testset
Directed Graph Navigation
Spatiality in-domain testset
Symbol Deletion
Transformation out-of-domain testset
Attention Shift (Different)
Perception in-domain testset
Prompt
Loading...
Ground Truth
First
First Frame
Final
Final Frame
Model Outputs
1/9
VBVR-Wan2.2
VBVR-Wan2.2
CogVideoX 1.5
Kling 2.6
LTX-2
Runway Gen-4
Sora 2
Veo 3
Wan 2.2 I2V
Hunyuan I2V

Leaderboard

Reference
Strong Baseline
Proprietary
Open-source
Human
Human
97.4%
#1
VBVR
VBVR-Wan2.2
68.5%
#2
Sora 2
Sora 2
54.6%
#3
Veo 3.1
Veo 3.1
48.0%
#4
Runway
Runway Gen-4 Turbo
40.3%
#5
Wan2.2
Wan2.2-I2V-A14B
37.1%
#6
Kling
Kling 2.6
36.9%
#7
LTX-2
LTX-2
31.3%
#8
CogVideoX
CogVideoX1.5-5B-I2V
27.3%
#9
HunyuanVideo
HunyuanVideo-I2V
27.3%
#9