UrbanVerse
Scaling Urban Simulation by Watching City-Tour Videos
Mingxuan Liu1,2,*       Honglin He1,*       Elisa Ricci2,3       Wayne Wu1       Bolei Zhou1 1University of California, Los Angeles     2University of Trento     3Fondazione Bruno Kessler ICLR 2026

Introducing UrbanVerse — a system that converts real-world urban scenes from city-tour videos into physics-aware, interactive simulation environments, enabling scalable robot learning in urban spaces with real-world generalization. Unmute for the best audio experience.

Overview.
UrbanVerse is a scalable system that converts real-world urban scenes from city-tour videos into physics-aware, interactive simulation environments. By extracting scene layouts from uncalibrated RGB footage and populating them with assets from the UrbanVerse-100K database, it enables robot learning in faithful digital replicas of real streets—with zero-shot sim-to-real transfer.
Real-to-sim Scene Generation Results.

Using the extracted scene layout from YouTube videos as a blueprint and assets retrieved from UrbanVerse-100K, UrbanVerse generates simulation environments faithfully grounded in the real-world layout.

Generated Digital Cousin Scenes: Beijing, China.

For the same city-tour video and its layout, UrbanVerse generates multiple diverse digital cousin scenes by instantiating the layout with different retrieved assets.

Input Video

Digital Cousin Scene 01

Digital Cousin Scene 02

Digital Cousin Scene 03

Digital Cousin Scene 04

Digital Cousin Scene 05

Generated Digital Cousin Scenes: Tangier, Morocco.

For the same city-tour video and its layout, UrbanVerse generates multiple diverse digital cousin scenes by instantiating the layout with different retrieved assets.

Input Video

Digital Cousin Scene 01

Digital Cousin Scene 02

Digital Cousin Scene 03

Digital Cousin Scene 04

Digital Cousin Scene 05

Digital Cousin Scene 06

Digital Cousin Scene 07

Digital Cousin Scene 08

Digital Cousin Scene 09

UrbanVerse Scenes Populated with Dynamic Agents.
UrbanVerse-100K Asset Database.

View Interactive Category Distribution (Full Screen) →

Example of Per-object Annotation
Road PBRs
Sidewalk PBRs
Sky HDRIs
CraftBench Test Scenes Gallery.

Scene 01

Scene 02

Scene 03

Scene 04

Scene 05

Scene 06

Scene 07

Scene 08

Scene 09

Scene 10

Diverse Real-world City Tour Video Collection.

Real-world city-tour videos from across the globe provide the foundation for urban simulation layout grounding.

Real-world Scene Layout Distillation.

Given the uncalibrated RGB city-tour videos, we use the UrbanVerse-Gen pipeline to extract the real-world semantic scene layouts.

Real-World Urban Navigation Results.
On Diverse Street Environments.
Side-by-Side Comparison: COCO Wheeled Robot (3x Speed).
Side-by-Side Comparison: Go2 Quadruped Robot (3x Speed).
Mapless Long-horizon Urban Navigation Deployment.
Other Applications.

Beyond Navigation: Mobile Manipulation in UrbanVerse Scenes

Immersive VR Interaction and Scene Editing

BibTeX
@inproceedings{liu2026urbanverse,
  title={UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos},
  author={Mingxuan Liu and Honglin He and Elisa Ricci and Wayne Wu and Bolei Zhou},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
}