Introducing UrbanVerse — a system that converts real-world urban scenes from city-tour videos into physics-aware, interactive simulation environments, enabling scalable robot learning in urban spaces with real-world generalization. Unmute for the best audio experience.
Using the extracted scene layout from YouTube videos as a blueprint and assets retrieved from UrbanVerse-100K, UrbanVerse generates simulation environments faithfully grounded in the real-world layout.
For the same city-tour video and its layout, UrbanVerse generates multiple diverse digital cousin scenes by instantiating the layout with different retrieved assets.
Input Video
Digital Cousin Scene 01
Digital Cousin Scene 02
Digital Cousin Scene 03
Digital Cousin Scene 04
Digital Cousin Scene 05
For the same city-tour video and its layout, UrbanVerse generates multiple diverse digital cousin scenes by instantiating the layout with different retrieved assets.
Input Video
Digital Cousin Scene 01
Digital Cousin Scene 02
Digital Cousin Scene 03
Digital Cousin Scene 04
Digital Cousin Scene 05
Digital Cousin Scene 06
Digital Cousin Scene 07
Digital Cousin Scene 08
Digital Cousin Scene 09
View Interactive Category Distribution (Full Screen) →
Scene 01
Scene 02
Scene 03
Scene 04
Scene 05
Scene 06
Scene 07
Scene 08
Scene 09
Scene 10
Real-world city-tour videos from across the globe provide the foundation for urban simulation layout grounding.
Given the uncalibrated RGB city-tour videos, we use the UrbanVerse-Gen pipeline to extract the real-world semantic scene layouts.
Beyond Navigation: Mobile Manipulation in UrbanVerse Scenes
Immersive VR Interaction and Scene Editing
@inproceedings{liu2026urbanverse,
title={UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos},
author={Mingxuan Liu and Honglin He and Elisa Ricci and Wayne Wu and Bolei Zhou},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
}
