π Paper Β | Β π Website Β | Β ποΈ City Environments (Download) Β
Lingjun Mao1, Jiawei Ren1, Kun Zhou1, Jixuan Chen1, Ziqiao Ma2, Lianhui Qin1
1 University of California, San Diego Β Β 2 University of Michigan
DeliveryBench is a city-scale embodied benchmark that evaluates whether VLM agents can earn profit under realistic, long-horizon constraints. Agents act as autonomous couriers in 3D cities, accepting and completing delivery orders across multiple in-game hours. They must manage resources (e.g., stamina, e-scooter battery), adapt to changing conditions, and balance efficiency, timing, and cost. When multiple agents coexist, they also face social dynamics such as competition and collaboration. By jointly modeling economic, physical, and social dynamics within a unified embodied environment, DeliveryBench provides a realistic, action-driven setting to test whether VLM-based agents can plan and act strategically to improve financial outcomes.
Compared with prior embodied benchmarks, DeliveryBench supports long-horizon tasks (several in-game hours; typically > 100 action steps) with multi-dimensional real-world constraints, covering:
β³ Time Constraints: Tasks have deadlines and time windows that determine when they can be performed. Agents must schedule actions to avoid late deliveries and make efficient use of limited working time.
πΊοΈ Spatial Constraints: Some actions are only valid at specific locations, so agents must navigate 3D cities and visit the right POIs in the right order (e.g., restaurants, charging stations).
π Resource Constraints: Agents must manage consumables such as stamina, vehicle battery, and cash to stay operational, sometimes transforming one resource into another (e.g., buying an energy drink to restore stamina).
βοΈ Physical Constraints: Environmental dynamics (e.g., temperature, motion, collisions) affect food quality, requiring agents to consider item fragility and perishability in route planning.
π΅ Economic Constraints: Agents earn income but also pay operational costs (e.g., recharging, renting, buying supplies), forcing them to balance short-term expenses against long-term profit.
π€ Social Constraints: In multi-agent settings, couriers collaborate and compete for limited opportunities (e.g., high-value orders, charging spots), shaping both strategy and outcomes.
evaluation/ # Evaluation and analysis utilities
maps/ # Test city maps and map configs used in benchmark tasks
simworld/ # Core simulation backend (Python API for the UE-based SimWorld engine);
# see the SimWorld repo for detailed documentation
vlm_delivery/ # VLM-based delivery agent implementation and runtime
actions/ # Concrete agent actions (e.g., ACCEPT_ORDER, MOVE_TO, BUY)
base/ # Shared base classes (e.g., timers, type definitions)
communicator/ # Interface between Python and the Unreal Engine environment (UnrealCV API)
entities/ # Entity classes (e.g., DeliveryMan, Order, vehicles)
gameplay/ # Runtime logic such as run_recorders, prompt construction
gym_like_interface/ # Gym-style wrappers and RL-compatible environment interface
input/ # Task and environment configuration (food types, agent count, etc.)
map/ # Map abstractions, waypoint systems, and visualization utilities used by the agent
scripts/ # Test scripts to quickly run DeliveryBench
utils/ # Helper utilities and common functions
vlm/ # Core VLM wrapper classes and model interfaces
.gitignore
README.mdMake sure to use Python 3.10 or later.
git clone https://github.com/mao1207/DeliveryBench.git
cd DeliveryBench
conda create -n deliverybench python=3.10
conda activate deliverybench
pip install -e .Our DeliveryBench UE server is built on top of SimWorld. Please first install the SimWorld base Unreal Engine backend by following the installation guide. Then, download the DeliveryBench Unreal Engine package (.pak) from HuggingFace and add it to SimWorld as an additional environment following the additional environments (plug-in) guide.
This UE server renders the 3D city and runs the underlying simulation for delivery tasks. Please choose the package that matches your operating system.
-
Windows: DeliveryBench Windows
-
Linux: DeliveryBench Linux
Start the DeliveryBench UE server first, then run the Python examples. From the extracted UE server package directory:
-
Windows: double-click
SimWorld.exe, or launch it from the command line:SimWorld.exe <map_name>
-
Linux: run:
./SimWorld.sh <map_name> -RenderOffscreen
Supported map_name options include (examples):
small-city-11, medium-city-22, large-city-26
See maps/ for the full list of available cities.
Configuration files are under vlm_delivery/input/:
experiment_config.json: Experiment-facing settings (e.g., which map to run). Edit this file for most runs.game_mechanics_config.json: Game mechanics parameters such as vehicle speed/cost and e-scooter charging rate. We recommend keeping this file unchanged to stay aligned with our default experimental setup.
In experiment_config.json, make sure the following fields are set correctly:
map_name: Must match the map name used when launchinggym_citynav.exeue_port: Must match the UE server port (default: 9000)multi_agent: Enable/disable multi-agent modeagent_count: Number of courier agents to spawn (only used whenmulti_agentis enabled)
For detailed configuration documentation, see configuration.md.
The VLM model is defined in:
vlm_delivery/input/model.json
You can directly swap in models supported by OpenRouter or OpenAI (e.g., gpt-4o, gpt-4.1, llama-3.1, etc.).
Just replace the model name and corresponding API key fields.
Before running, export your API key in the shell you will launch Jupyter from (e.g., export OPENROUTER_API_KEY=... or export OPENAI_API_KEY=...).
Open the quick-start notebook:
vlm_delivery/scripts/run_deliverybench.ipynb
This notebook will:
- connect to the UE server
- spawn courier agents
- run delivery episodes
- log and visualize results
After runs finish, JSON result files will be exported to the directory specified by lifecycle.export_path in vlm_delivery/input/experiment_config.json. You can then aggregate them into CSV summaries using:
python vlm_delivery/evaluation/agent_performance_analysis.py \
/path/to/result_json_folder \
-o /path/to/output_dir/path/to/result_json_foldershould point to a directory containing one or more JSON result files.- The script will automatically load all JSON files in the folder, compute aggregate statistics (e.g., per-model averages), and write the CSV reports into
/path/to/output_dir.
DeliveryBench maps under maps/<map_name>/ are generated in three stages. You can create your own city map assets (and the corresponding UE city environment) for training or evaluation by following the steps below.
Generate a new procedural city layout (roads + buildings) and export the raw map assets under maps/<map_name>/.
python city_generation/generate_city_layout.py \
--map-name <map_name> \
--num-segments 35 \
--seed 42This will create the following files in maps/<map_name>/: roads.json, buildings.json, elements.json, routes.json (routes / bus routes), and progen_world.json (UE-compatible world objects).
Take the base layout from Step 1 and add DeliveryBench-specific annotations (POI tags, bus routes, chargers). This produces the DeliveryBench-ready map file.
python city_generation/enrich_deliverybench_map.py \
--map-dir maps/<map_name> \
--seed 42This will generate maps/<map_name>/progen_world_enriched.json, the DeliveryBench-ready map file. It enriches the base layout with DeliveryBench-specific annotations, including POI tags (e.g., restaurant, store, roadside chargers) and bus routes (bus_routes).
Optional sanity check (headless): render the generated map to a PNG and quickly verify the layout/POIs/bus routes look correct:
python vlm_delivery/scripts/test_map.py --map-name <map_name> --out-global /path/to/map.pngIf --out-global is not provided, the image will be saved under outputs/map_debug/ by default.
After obtaining progen_world.json, you can generate the environment in UE. DeliveryBench leverages SimWorld's generation functionality. Please refer to the world_generation for detailed instructions on generating the world in UE.
We also support plugging in local VLMs. As a reference, we provide a minimal implementation of LLaVA-OneVision in:
vlm_delivery/vlm/base_model.py
You can adapt this file to wrap your own local model (e.g., by following the same forward / generate interface and image/text preprocessing pipeline).
The lightweight agentic workflow (including chain-of-thought reasoning and future planning) is implemented through:
vlm_delivery/gameplay/prompt.pyβ prompt templates for actions, CoT, and future plansvlm_delivery/gameplay/action_space.pyβ parsing model outputs into structured actionsvlm_delivery/utils/vlm_prompt.pyβ runtime prompt assembly (e.g., feeding the previous plan, observations, or action history back into the model)
You are free to extend this workflow with additional modules, such as:
- memory modules (e.g., episodic or long-term memory over past orders and routes)
- reflection / self-correction loops (e.g., asking the model to critique or refine its own plan)
- tool-use modules (e.g., calling external routing APIs or heuristic planners before acting)
To run multiple DeliveryBench instances in parallel, launch multiple SimWorld UE servers on different ports.
For each instance, edit the port in the extracted UE server package at:
gym_citynav/Binaries/Linux/unrealcv.ini
Then set the matching port in vlm_delivery/input/experiment_config.json:
gym_env.ue_port: Must match the UE server port for that instance
Once each server uses a unique port (e.g., 9000, 9001, 9002, ...), you can run multiple experiments concurrently (one per port).
We welcome contributions from the community! Whether you want to report bugs, suggest features, or submit code improvements, your input is valuable. Please check out our Contributing Guidelines for details on how to get started.

