WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World

WorldBench Team

arXiv

PDF

GitHub

Leaderboard

Data

Is your driving world model an all-around player?

Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally.

Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control.

We introduce WorldLens, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world.

Evaluation Protocol

Generation

measuring whether a model can synthesize visually realistic, temporally stable, and semantically consistent scenes. Even state-of-the-art models that achieve low perceptual error (e.g., LPIPS, FVD) often suffer from view flickering or motion instability, revealing the limits of current diffusion-based architectures.

Reconstruction

probing whether generated videos can be reprojected into a coherent 4D scene using differentiable rendering. Models that appear sharp in 2D frequently collapse when reconstructed, producing geometric "floaters": a gap that exposes how temporal coherence remains weakly coupled in most pipelines.

Action-Following

testing if a pre-trained action planner can operate safely inside the generated world. High open-loop realism does not guarantee safe closed-loop control; almost all existing world models trigger collisions or off-road drifts, underscoring that photometric realism alone cannot yield functional fidelity.

Downstream Task

evaluating whether the synthetic data support downstream perception models trained on real-world datasets. Even visually appealing worlds may degrade detection or segmentation accuracy by 30-50%, highlighting that alignment to task distributions, not just image quality, is vital for practical usability.

Human Preference

capturing subjective scores such as world realism, physical plausibility, and behavioral safety through large-scale human annotations. Our study reveals that models with strong geometric consistency are generally rated as more "real", confirming that perceptual fidelity is inseparable from structural coherence.

A Journey of Evaluation

WorldLens: Full-Spectrum Evaluations

Generative world models must go beyond visual realism to achieve geometric consistency, physical plausibility, and functional reliability. WorldLens is a unified benchmark that evaluates these capabilities across five complementary aspects - from low-level appearance fidelity to high-level behavioral realism.

Each aspect is decomposed into fine-grained, interpretable dimensions, forming a comprehensive framework that bridges human perception, physical reasoning, and downstream utility.

Benchmarked Models

DiST-4D

OpenDWM

DriveDreamer-2

DreamForge

X-Scene

MagicDrive-V2

MagicDrive

Panacea

DrivingSphere

RLGF

AD-R1

. . .

Official Leaderboard

Call for Participation

We invite researchers and practitioners to submit their models for evaluation on the leaderboard, enabling consistent comparison and supporting progress in world model research.

For more details, kindly refer to our Evaluation Toolkit.

WorldLens-26K

To bridge the gap between human judgment and automated evaluation, we curate a large-scale human-annotated dataset comprising 26,808 scoring records of generated videos.

Each entry includes a discrete score and a concise textual rationale written by annotators, capturing both quantitative assessment and qualitative explanation.

The dataset covers complementary dimensions of perceptual quality. This balanced design ensures comprehensive coverage across spatial, temporal, and behavioral aspects of world-model realism.

We envision WorldLens-26K as a foundational resource for training auto-evaluation agents and constructing human-aligned reward or advantage functions for reinforcement fine-tuning of generative world models.

WorldLens-Agent

WorldLens-Agent: SFT from Human Feedback

Evaluating generated worlds hinges on human-centered criteria (physical plausibility) and subjective preferences (perceived realism) that quantitative metrics inherently miss, highlighting the necessity of a human-aligned evaluator.

To this end, we introduce WorldLens-Agent, a vision-language critic agent trained on WorldLens-26K. Through LoRA-based supervised fine-tuning, we distill human perceptual and physical judgments into a Qwen3-VL, enabling it to internalize criteria such as realism, plausibility, and behavioral safety.

This provides consistent, human-aligned assessments, offering a scalable preference oracle for benchmarking future world models.