Delong Chen* Willy Chung* Yejin Bang Ziwei Ji Pascale Fung
Meta FAIR Paris HKUST Sorbonne Université
* Equal Contribution
Humans are known to have an internal “world model” that enables us to carry out action planning based on world states. AI agents need to have such a world model for action planning as well. It is not clear how current AI models, especially generative models, are able to learn such world models and carry out procedural planning in diverse environments.
We introduce WorldPrediction, a video-based benchmark for evaluating world modeling and procedural planning capabilities of different AI models. In contrast to prior benchmarks that focus primarily on low-level world modeling and robotic motion planning, WorldPrediction is the first benchmark that emphasizes actions with temporal and semantic abstraction. Given initial and final world states, the task is to distinguish the proper action (WorldPrediction-WM) or the properly ordered sequence of actions (WorldPrediction-PP) from a set of counterfactual distractors. This discriminative task setup enable us to evaluate different types of world models and planners and realize a thorough comparison across different hypothesis.
The benchmark represents states and actions using visual observations. In order to prevent models from exploiting low-level continuity cues in background scenes, we provide “action equivalents” – identical actions observed in different contexts – as candidates for selection.
The World Modeling task evaluates a model's ability to identify which action caused an observed state transition. Given:
- An initial state (visual observation before the action)
- A final state (visual observation after the action)
- A set of candidate actions (video clips depicting different actions)
The model must select the candidate action that most plausibly explains the transition from the initial to final state.
Annotation Format:
{
"states": {
"segment_uid": "segment|COIN|U9bZX2wABWQ_000",
"video": "U9bZX2wABWQ.mp4",
"segment_start_time": 23.7,
"segment_end_time": 42.3
},
"ground_truth": "segment|COIN|pCCVwljBs7M_005",
"candidates": [
{
"video": "pCCVwljBs7M.mp4",
"segment_start_time": 112.2,
"segment_end_time": 123.8,
"action_label": "set up the brackets",
"segment_uid": "segment|COIN|pCCVwljBs7M_005"
},
...
]
}The Procedural Planning task extends world modeling to multi-step reasoning. Given:
- An initial state (starting visual observation)
- A final state (goal visual observation)
- A set of candidate plans (different orderings of action sequences)
- A pool of candidate actions (video clips for each action in the plans)
The model must select the plan (ordered sequence of actions) that correctly transforms the initial state into the final state.
Annotation Format:
{
"states": {
"segment_uid": "PP|segment|COIN|dBukB_1cXBE_003|-|segment|COIN|dBukB_1cXBE_005",
"video": "dBukB_1cXBE.mp4",
"segment_start_time": 60.7,
"segment_end_time": 76.3
},
"ground_truth": 3,
"candidates": [
["segment|COIN|AVbdqzpyAxk_007", "segment|COIN|AVbdqzpyAxk_004", "segment|COIN|dMSqm5BB5jA_000"],
["segment|COIN|AVbdqzpyAxk_004", "segment|COIN|AVbdqzpyAxk_007", "segment|COIN|dMSqm5BB5jA_000"],
...
],
"action_segments": {
"segment|COIN|dMSqm5BB5jA_000": {
"video": "dMSqm5BB5jA.mp4",
"segment_start_time": 28.7,
"segment_end_time": 42.3,
"action_label": "take out the shell"
},
...
}
}WorldPrediction includes implementations for evaluating diverse model architectures:
| Model Class | Description | Modality |
|---|---|---|
VLM |
Vision-Language Models (e.g., Qwen2.5-VL, InternVL, GPT-4o, Claude) | Vision + Language |
SocraticLLM |
LLMs with precomputed visual captions (e.g. Llama3.1-3.3-4, Qwen2.5, DeepSeek, Gemini-2.0, Claude-3.5, GPT-4o) | Language-only |
ContinuityShortCut |
Feature-matching baselines (CLIP, DINOv2, pixel-level) | Vision-only |
VideoDiffusion |
Video generation models (CogVideoX, I2VGen-XL) | Video Generation |
ExampleModel |
Random baseline for sanity checking | N/A |
- Python 3.9+
# Create and activate conda environment
conda create -n WorldPrediction python=3.9 -y
conda activate WorldPrediction
# Install dependencies
pip install -r requirements.txtWorldPrediction aggregates annotations from five established instructional video datasets:
| Dataset | Domain | Videos | Repository |
|---|---|---|---|
| COIN | Diverse instructional tasks | YouTube | https://ego-exo4d-data.org/ |
| CrossTask | Cooking & household | YouTube | https://github.com/DmZhukov/CrossTask |
| EgoExo4D | Egocentric & exocentric | 3rd Party | https://ego-exo4d-data.org/ |
| IKEA-ASM | Furniture assembly | 3rd Party | https://ikeaasm.github.io/ |
| EPIC-KITCHENS-100 | Kitchen activities | 3rd Party | https://epic-kitchens.github.io/2025 |
- Download the source videos from each dataset (refer to original dataset documentation)
- Update the video root paths in
data/video_roots.py:
VIDEO_ROOTS = {
"COIN": "/path/to/COIN/",
"CrossTask": "/path/to/CrossTask/",
"EgoExo4D": "/path/to/EgoExo4D/",
"IKEAASM": "/path/to/IKEA-ASM/",
"EPIC-KITCHENS-100": "/path/to/EPIC-KITCHENS-100/"
}The benchmark annotations are provided in two JSON files:
data/WorldPrediction-WM.json— World Modeling task annotationsdata/WorldPrediction-PP.json— Procedural Planning task annotations
See notebooks/example_run_eval.ipynb for an interactive walkthrough of the evaluation pipeline.
import json
from evaluator import WorldPredictionEvaluator
from model import VLM
# Load annotations
data = json.load(open("WorldPrediction-WM.json", "r"))
# Initialize evaluator
evaluator = WorldPredictionEvaluator(
data=data,
task="WM",
start_index=0,
end_index=10 # Evaluate first 10 samples
)
# Load model
config = json.load(open("configs/vlm/qwenvl/Qwen2.5-VL-3B-Instruct-4f.json", "r"))
model = VLM(**config)
# Run evaluation
accuracy, results = evaluator.evaluate(model)
print(f"Accuracy: {accuracy:.3f}")python run.py \
--task WM \
--data data/WorldPrediction-WM.json \
--start_index 0 \
--end_index 100 \
--model_class VLM \
--model_config configs/vlm/qwenvl/Qwen2.5-VL-32B-Instruct-4f.json \
--output_path results/wm_results.jsonFor large-scale evaluation using SLURM clusters:
python run_submitit.py \
--task PP \
--data data/WorldPrediction-PP.json \
--model_class VLM \
--model_config configs/vlm/qwenvl/Qwen2.5-VL-32B-Instruct-2f.json \
--output_dir results \
--samples_per_job 100 \
--gpus_per_node 8| Task | Data File | Description |
|---|---|---|
WM |
WorldPrediction-WM.json |
High-Level World Modeling (single-step) |
PP |
WorldPrediction-PP.json |
Long-Horizon Procedural Planning (multi-step) |
Vision-Language Models:
--model_class VLM --model_config configs/vlm/qwenvl/Qwen2.5-VL-3B-Instruct-4f.json
--model_class VLM --model_config configs/vlm/internvl/InternVL2_5-1B-2f.json
--model_class VLM --model_config configs/vlm/gpt-4o.json
--model_class VLM --model_config configs/vlm/claude-3-5-sonnet-20241022-v2.json
--model_class VLM --model_config configs/vlm/gemini1.5-pro.jsonContinuity Baselines (Feature Matching):
--model_class ContinuityShortCut --model_config configs/continuity/clip_vit_l_14_openai.json
--model_class ContinuityShortCut --model_config configs/continuity/dinov2_large.json
--model_class ContinuityShortCut --model_config configs/continuity/pixel_32.jsonSocratic LLM (Text-only):
--model_class SocraticLLM --model_config configs/socratic_llm/Qwen2.5-3B-Instruct.jsonVideo Diffusion Models:
--model_class VideoDiffusion --model_config configs/videodiffusion/cogvideox/CogVideoX-5B-I2V_InfStep50_Frames49_Scale6.json
--model_class VideoDiffusion --model_config configs/videodiffusion/i2vgenxl/I2VGenXL_InfStep50_Frames16_Scale9.jsonTo evaluate your own model, implement a class with the following interface:
class MyCustomModel:
def __init__(self, **kwargs):
"""Initialize your model with configuration parameters."""
pass
def select_action(self, states: dict, candidate_actions: list) -> tuple[str, dict]:
"""
Single-step action selection for World Modeling.
Args:
states: {
"initial_state": PIL.Image, # Initial world state
"final_state": PIL.Image, # Final world state
"segment_uid": str, # Unique identifier
"sample_uid": str, # Sample identifier
}
candidate_actions: List of dicts, each containing:
- "video": str (path to video file)
- "start_time": float
- "end_time": float
- "segment_uid": str
Returns:
Tuple of (selected_segment_uid: str, info: dict)
"""
pass
def select_plan(self, states: dict, candidate_plans: list,
candidate_actions: dict) -> tuple[int, dict]:
"""
Multi-step plan selection for Procedural Planning.
Args:
states: Same as select_action
candidate_plans: List of plans, where each plan is a list of segment_uids
candidate_actions: Dict mapping segment_uid to action info
Returns:
Tuple of (selected_plan_index: int, info: dict)
"""
passRegister your model in model/__init__.py:
from model.my_custom_model import MyCustomModelThen run evaluation:
python run.py \
--task WM \
--data data/WorldPrediction-WM.json \
--model_class MyCustomModel \
--model_config path/to/config.json \
--output_path results/my_model_results.jsonWorldPrediction is CC BY-NC 4.0 licensed, as found in the LICENSE file.
@article{chen2025worldprediction,
title={WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning},
author={Chen, Delong and Chung, Willy and Bang, Yejin and Ji, Ziwei and Fung, Pascale},
journal={arXiv preprint arXiv:2506.04363},
year={2025}
}
