WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning

Delong Chen* Willy Chung* Yejin Bang Ziwei Ji Pascale Fung

Meta FAIR Paris HKUST Sorbonne Université

* Equal Contribution

Humans are known to have an internal “world model” that enables us to carry out action planning based on world states. AI agents need to have such a world model for action planning as well. It is not clear how current AI models, especially generative models, are able to learn such world models and carry out procedural planning in diverse environments.

We introduce WorldPrediction, a video-based benchmark for evaluating world modeling and procedural planning capabilities of different AI models. In contrast to prior benchmarks that focus primarily on low-level world modeling and robotic motion planning, WorldPrediction is the first benchmark that emphasizes actions with temporal and semantic abstraction. Given initial and final world states, the task is to distinguish the proper action (WorldPrediction-WM) or the properly ordered sequence of actions (WorldPrediction-PP) from a set of counterfactual distractors. This discriminative task setup enable us to evaluate different types of world models and planners and realize a thorough comparison across different hypothesis.

The benchmark represents states and actions using visual observations. In order to prevent models from exploiting low-level continuity cues in background scenes, we provide “action equivalents” – identical actions observed in different contexts – as candidates for selection.

Benchmark Tasks

1. World Modeling (WorldPrediction-WM)

The World Modeling task evaluates a model's ability to identify which action caused an observed state transition. Given:

An initial state (visual observation before the action)
A final state (visual observation after the action)
A set of candidate actions (video clips depicting different actions)

The model must select the candidate action that most plausibly explains the transition from the initial to final state.

Annotation Format:

{
  "states": {
    "segment_uid": "segment|COIN|U9bZX2wABWQ_000",
    "video": "U9bZX2wABWQ.mp4",
    "segment_start_time": 23.7,
    "segment_end_time": 42.3
  },
  "ground_truth": "segment|COIN|pCCVwljBs7M_005",
  "candidates": [
    {
      "video": "pCCVwljBs7M.mp4",
      "segment_start_time": 112.2,
      "segment_end_time": 123.8,
      "action_label": "set up the brackets",
      "segment_uid": "segment|COIN|pCCVwljBs7M_005"
    },
    ...
  ]
}

2. Procedural Planning (WorldPrediction-PP)

The Procedural Planning task extends world modeling to multi-step reasoning. Given:

An initial state (starting visual observation)
A final state (goal visual observation)
A set of candidate plans (different orderings of action sequences)
A pool of candidate actions (video clips for each action in the plans)

The model must select the plan (ordered sequence of actions) that correctly transforms the initial state into the final state.

Annotation Format:

{
  "states": {
    "segment_uid": "PP|segment|COIN|dBukB_1cXBE_003|-|segment|COIN|dBukB_1cXBE_005",
    "video": "dBukB_1cXBE.mp4",
    "segment_start_time": 60.7,
    "segment_end_time": 76.3
  },
  "ground_truth": 3,
  "candidates": [
    ["segment|COIN|AVbdqzpyAxk_007", "segment|COIN|AVbdqzpyAxk_004", "segment|COIN|dMSqm5BB5jA_000"],
    ["segment|COIN|AVbdqzpyAxk_004", "segment|COIN|AVbdqzpyAxk_007", "segment|COIN|dMSqm5BB5jA_000"],
    ...
  ],
  "action_segments": {
    "segment|COIN|dMSqm5BB5jA_000": {
      "video": "dMSqm5BB5jA.mp4",
      "segment_start_time": 28.7,
      "segment_end_time": 42.3,
      "action_label": "take out the shell"
    },
    ...
  }
}

Supported Models

WorldPrediction includes implementations for evaluating diverse model architectures:

Model Class	Description	Modality
`VLM`	Vision-Language Models (e.g., Qwen2.5-VL, InternVL, GPT-4o, Claude)	Vision + Language
`SocraticLLM`	LLMs with precomputed visual captions (e.g. Llama3.1-3.3-4, Qwen2.5, DeepSeek, Gemini-2.0, Claude-3.5, GPT-4o)	Language-only
`ContinuityShortCut`	Feature-matching baselines (CLIP, DINOv2, pixel-level)	Vision-only
`VideoDiffusion`	Video generation models (CogVideoX, I2VGen-XL)	Video Generation
`ExampleModel`	Random baseline for sanity checking	N/A

Installation

Prerequisites

Python 3.9+

Setup

# Create and activate conda environment
conda create -n WorldPrediction python=3.9 -y
conda activate WorldPrediction

# Install dependencies
pip install -r requirements.txt

Dataset Preparation

Video Sources

WorldPrediction aggregates annotations from five established instructional video datasets:

Dataset	Domain	Videos	Repository
COIN	Diverse instructional tasks	YouTube	https://ego-exo4d-data.org/
CrossTask	Cooking & household	YouTube	https://github.com/DmZhukov/CrossTask
EgoExo4D	Egocentric & exocentric	3rd Party	https://ego-exo4d-data.org/
IKEA-ASM	Furniture assembly	3rd Party	https://ikeaasm.github.io/
EPIC-KITCHENS-100	Kitchen activities	3rd Party	https://epic-kitchens.github.io/2025

Configuration

Download the source videos from each dataset (refer to original dataset documentation)
Update the video root paths in data/video_roots.py:

VIDEO_ROOTS = {
    "COIN": "/path/to/COIN/",
    "CrossTask": "/path/to/CrossTask/",
    "EgoExo4D": "/path/to/EgoExo4D/",
    "IKEAASM": "/path/to/IKEA-ASM/",
    "EPIC-KITCHENS-100": "/path/to/EPIC-KITCHENS-100/"
}

Benchmark Annotations

The benchmark annotations are provided in two JSON files:

data/WorldPrediction-WM.json — World Modeling task annotations
data/WorldPrediction-PP.json — Procedural Planning task annotations

Usage

Quick Start

See notebooks/example_run_eval.ipynb for an interactive walkthrough of the evaluation pipeline.

import json
from evaluator import WorldPredictionEvaluator
from model import VLM

# Load annotations
data = json.load(open("WorldPrediction-WM.json", "r"))

# Initialize evaluator
evaluator = WorldPredictionEvaluator(
    data=data,
    task="WM",
    start_index=0,
    end_index=10  # Evaluate first 10 samples
)

# Load model
config = json.load(open("configs/vlm/qwenvl/Qwen2.5-VL-3B-Instruct-4f.json", "r"))
model = VLM(**config)

# Run evaluation
accuracy, results = evaluator.evaluate(model)
print(f"Accuracy: {accuracy:.3f}")

Single Process Evaluation

python run.py \
    --task WM \
    --data data/WorldPrediction-WM.json \
    --start_index 0 \
    --end_index 100 \
    --model_class VLM \
    --model_config configs/vlm/qwenvl/Qwen2.5-VL-32B-Instruct-4f.json \
    --output_path results/wm_results.json

Distributed Evaluation

For large-scale evaluation using SLURM clusters:

python run_submitit.py \
    --task PP \
    --data data/WorldPrediction-PP.json \
    --model_class VLM \
    --model_config configs/vlm/qwenvl/Qwen2.5-VL-32B-Instruct-2f.json \
    --output_dir results \
    --samples_per_job 100 \
    --gpus_per_node 8

Model Configuration

Task Selection

Task	Data File	Description
`WM`	`WorldPrediction-WM.json`	High-Level World Modeling (single-step)
`PP`	`WorldPrediction-PP.json`	Long-Horizon Procedural Planning (multi-step)

Available Model Configurations

Vision-Language Models:

--model_class VLM --model_config configs/vlm/qwenvl/Qwen2.5-VL-3B-Instruct-4f.json
--model_class VLM --model_config configs/vlm/internvl/InternVL2_5-1B-2f.json
--model_class VLM --model_config configs/vlm/gpt-4o.json
--model_class VLM --model_config configs/vlm/claude-3-5-sonnet-20241022-v2.json
--model_class VLM --model_config configs/vlm/gemini1.5-pro.json

Continuity Baselines (Feature Matching):

--model_class ContinuityShortCut --model_config configs/continuity/clip_vit_l_14_openai.json
--model_class ContinuityShortCut --model_config configs/continuity/dinov2_large.json
--model_class ContinuityShortCut --model_config configs/continuity/pixel_32.json

Socratic LLM (Text-only):

--model_class SocraticLLM --model_config configs/socratic_llm/Qwen2.5-3B-Instruct.json

Video Diffusion Models:

--model_class VideoDiffusion --model_config configs/videodiffusion/cogvideox/CogVideoX-5B-I2V_InfStep50_Frames49_Scale6.json
--model_class VideoDiffusion --model_config configs/videodiffusion/i2vgenxl/I2VGenXL_InfStep50_Frames16_Scale9.json

Implementing Custom Models

To evaluate your own model, implement a class with the following interface:

class MyCustomModel:
    def __init__(self, **kwargs):
        """Initialize your model with configuration parameters."""
        pass

    def select_action(self, states: dict, candidate_actions: list) -> tuple[str, dict]:
        """
        Single-step action selection for World Modeling.

        Args:
            states: {
                "initial_state": PIL.Image,  # Initial world state
                "final_state": PIL.Image,    # Final world state
                "segment_uid": str,          # Unique identifier
                "sample_uid": str,           # Sample identifier
            }
            candidate_actions: List of dicts, each containing:
                - "video": str (path to video file)
                - "start_time": float
                - "end_time": float
                - "segment_uid": str

        Returns:
            Tuple of (selected_segment_uid: str, info: dict)
        """
        pass

    def select_plan(self, states: dict, candidate_plans: list,
                    candidate_actions: dict) -> tuple[int, dict]:
        """
        Multi-step plan selection for Procedural Planning.

        Args:
            states: Same as select_action
            candidate_plans: List of plans, where each plan is a list of segment_uids
            candidate_actions: Dict mapping segment_uid to action info

        Returns:
            Tuple of (selected_plan_index: int, info: dict)
        """
        pass

Register your model in model/__init__.py:

from model.my_custom_model import MyCustomModel

Then run evaluation:

python run.py \
    --task WM \
    --data data/WorldPrediction-WM.json \
    --model_class MyCustomModel \
    --model_config path/to/config.json \
    --output_path results/my_model_results.json

License

WorldPrediction is CC BY-NC 4.0 licensed, as found in the LICENSE file.

Citation

@article{chen2025worldprediction,
  title={WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning},
  author={Chen, Delong and Chung, Willy and Bang, Yejin and Ji, Ziwei and Fung, Pascale},
  journal={arXiv preprint arXiv:2506.04363},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
configs		configs
data		data
model		model
notebooks		notebooks
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
evaluator.py		evaluator.py
requirements.txt		requirements.txt
run.py		run.py
run_submitit.py		run_submitit.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning

Benchmark Tasks

1. World Modeling (WorldPrediction-WM)

2. Procedural Planning (WorldPrediction-PP)

Supported Models

Installation

Prerequisites

Setup

Dataset Preparation

Video Sources

Configuration

Benchmark Annotations

Usage

Quick Start

Single Process Evaluation

Distributed Evaluation

Model Configuration

Task Selection

Available Model Configurations

Implementing Custom Models

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

facebookresearch/WorldPrediction

Folders and files

Latest commit

History

Repository files navigation

WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning

Benchmark Tasks

1. World Modeling (WorldPrediction-WM)

2. Procedural Planning (WorldPrediction-PP)

Supported Models

Installation

Prerequisites

Setup

Dataset Preparation

Video Sources

Configuration

Benchmark Annotations

Usage

Quick Start

Single Process Evaluation

Distributed Evaluation

Model Configuration

Task Selection

Available Model Configurations

Implementing Custom Models

License

Citation

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages