Welcome to the RobotArena ∞, an open-source toolkit for evaluating robotic policies under diverse scene perturbations.
Our framework leverages custom-built simulated environments—called Robot Arenas—and supports a range of popular models including:
- CogAct
- RoboVLM
- Octo
- SpatialVLA
- open-pi-zero
- X-VLA
- First update the submodules in the repository to ensure all dependencies are correctly initialized:
git submodule update --init --recursiveWe have set up separate servers for each policy for evaluation. Please follow the steps below to set up the environment.
environment cogact (click to expand)
conda env create -f env/cogact.yml
conda activate cogact
cd ./CogACT
pip install -e .
pip install uvicorn fastapi "tomli>=1.1.0" "rpds-py>=0.7.1" "traitlets>=5.3"
cd ../SimplerEnv
python -m pip install pip==25.0.1
pip install -e .
pip install -r requirements_full_install.txt
cd ManiSkill2_real2sim
pip install -e .
pip install --upgrade typing_extensions
pip install "numpy<1.25"
pip install tensorflow_datasets==4.9.3
pip install --upgrade pydantic fastapi
cd ../..
cp ./SimplerEnv/simpler_env/policies/sim_cogact/adaptive_ensemble.py ./SimplerEnv/simpler_env/policies/sim_cogact/CogACT/adaptive_ensemble.pyenvironment robovlms (click to expand)
conda env create -f env/robovlm.yml
conda activate robovlms
cd ./RoboVLM
pip install -e .
cd ../SimplerEnv
pip install -e .
cd ManiSkill2_real2sim
pip install -e .
pip install "opencv-python<4.10" "numpy<2.0" "pyarrow<21.0.0"
pip install matplotlib fastapi json_numpy draccus uvicorn
cd ../..environment simpler_env (click to expand)
conda create -n simpler_env python=3.10
conda activate simpler_env
cd ./octo
pip install -e .
pip install -r requirements.txt
pip install --upgrade "jax[cuda11_pip]==0.4.20" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
cd ../SimplerEnv
pip install -e .
cd ManiSkill2_real2sim
pip install -e .
cd ../..
pip install uvicorn fastapi json_numpy draccus
pip install "scipy<1.13"
pip3 install torch torchvision torchaudio
pip install "transformers == 4.47.0"environment open-pi-zero (click to expand)
cd open-pi-zero
cd SimplerEnv
git submodule update --init
cd ../open-pi-zero
uv sync
uv pip install -e ../SimplerEnv
uv pip install -e ../SimplerEnv/ManiSkill2_real2sim
uv pip install uvicorn fastapi json-numpy
source scripts/set_path.shenvironment X-VLA (click to expand)
conda create -n XVLA python=3.10 -y
conda activate XVLA
cd ./X-VLA
pip install -r requirements.txtIf you encounter any problem when installing av, you can try replace the original requirements.txt with this:
av==15.0.0
transformers<=4.51.3
fastapi
tensorboard
peft==0.17.1
uvicorn==0.34.3
json_numpy==2.1.0
safetensors==0.4.5
numpy==1.26.3
scipy==1.15.0
einops==0.8.1
timm==1.0.12
mmengine==0.10.5
pyarrow==20.0.0
h5py==3.12.1
accelerate==1.2.1
mediapy==1.2.4
environment genesis (click to expand)
cd ..
git clone https://github.com/Genesis-Embodied-AI/Genesis
cd Genesis
git checkout v0.3.3
conda create -n genesis python=3.10
conda activate genesis
pip install -e .
pip3 install torch torchvision
pip install pyyaml sapien
pip install json-numpyenvironment gemini (click to expand)
conda env create -f env/gemini.ymlBefore running any evaluation scripts, the respective policy server must be active. These servers handle the action generation based on observations and instructions provided by the evaluation environment.
def run(self, host: str = "0.0.0.0", port: int = 9030) -> None:Each policy is assigned a default port like above. If you wish to use a different port, you can modify it directly in the corresponding script.
Detailed instructions for preparing checkpoint and config files (click to expand)
Before running the script, make sure to set the model path to your local model path in the model_config.json file.
{
"robovlm_ckpt": <path_to_your_model>,
"robovlm_config": <path_to_your_model_config>,
}You can get the checkpoint and configs from the RoboVLMs Hugging Face repository. (We use kosmos_ph_bridge-post-train.pt and kosmos_ph_bridge-post-train.json as the default checkpoint and config file.)
Then, you should also download folder kosmos-2-patch14-224 from here and put it in RoboVLM/.vlms/kosmos-2-patch14-224.
After setting up the model path, you can run the server with the following command:
cd RoboVLM
conda activate robovlms
python eval/simpler/server_robovlm.py- RoboVLM server is default to
9000
Activate the Conda environment and run the server:
conda activate simpler_env
export PYTHONPATH=$(pwd)
python src/server/server_octo.py- Octo server is default to port
9010
Detailed instructions for preparing checkpoint (click to expand)
Before running the script, make sure to set the model path to your local model path in the model_config.json file.
{
"spatial_path": <path_to_your_model>,
}You can download the model from instructions in the SpatialVLA repository
After setting up the model path, you can run the server with the following command:
conda activate simpler_env
export PYTHONPATH=$(pwd)
python src/server/server_spatial.py- spatialVLA server is default to port
9020
- You should refer to the instructions on how to download/use the CogAct model in the CogACT repository
Activate the Conda environment and run the server:
conda activate cogact
python src/server/server_cogact.py- CogACT server is default to port
9030
Details on downloading from Hugging Face
-
google/paligemma-3b-pt-224 must be downloaded
-
You can test
google/paligemma-3b-pt-224with:
cd open-pi-zero
uv run src/model/vla/pizero.py --text_only --load_pretrained_weights --use_bf16-
The author has provided these checkpoints on Hugging Face: Bridge-Uniform | Bridge-Beta | Fractal-Uniform | Fractal-Beta
-
Remember to confirm the checkpoint location in
slurm/eval_simpler_bridge_server.shis correct.
cd open-pi-zero/open-pi-zero
bash slurm/eval_simpler_bridge_server.sh- open-pi-zero server is default to port
9040
- Activate conda environment to start the server
conda activate XVLA
cd X-VLA
python deploy.py \
--model_path 2toINF/X-VLA-WidowX \
--host 0.0.0.0 \
--port 9050- X-VLA server is default to port
9050
We provide sample data for scene generation along with test scripts. The data is located in the examples/data directory and follows the structure shown below.
The full benchmark dataset, formatted to match the examples/data structure, is available on Hugging Face:
sourlreapwr/RobotArena-Benchmark
examples/data
├── assets/
├── bridge/
├── scene_background/
We provide several evaluation scripts to test the performance of the policy in different scenarios.
We have provided bash scripts to run all the tests in
bash_scriptsfolder. You can modify the arguments in the bash scripts as instructed below to run the tests.
bash bash_scripts/default_test.bash
bash bash_scripts/default_test_droid.bash
bash bash_scripts/default_test_rh20t.bash
bash bash_scripts/background_test.bash
bash bash_scripts/adv_background_test.bash
bash bash_scripts/camera_test.bash
bash bash_scripts/permute_test.bash
bash bash_scripts/pose_test.bash
bash bash_scripts/asset_test.bash
bash bash_scripts/all_test_default.bash
bash bash_scripts/all_test_generate.bash- You should change this line
source ~/miniconda3/etc/profile.d/conda.shto your conda installation path in the bash scripts if you are using a different conda installation path.
Here in configs/default.yaml, we provide a default configuration file for the test scripts.
base_folder: "./examples/data" # Root path to your generated scene data
robot: "WidowX" # Specifies the robot model
scene_name: "default1" # Identifier for the base scene for this test, you can change it to the scene you want to test e.g., `scene2` or `default3`This test evaluates the performance of the policy in a simulated scene with all the default settings. (camera angle, background, object positions, etc.)
The test is performed on both the default scene (e.g., default1) and the generated scenes (e.g., scene2).
Before running the test, make sure you have already run the server for the policy you want to test on the desired port.
| Policy | Command |
|---|---|
| SpatialVLA | bash bash_scripts/default_test.bash 9020 spatial generate |
| RoboVLM | bash bash_scripts/default_test.bash 9000 robovlm generate |
| Octo | bash bash_scripts/default_test.bash 9010 octo generate |
| CogAct | bash bash_scripts/default_test.bash 9030 cogact generate |
| open-pi-zero | bash bash_scripts/default_test.bash 9040 open_pi_zero generate |
| X-VLA | bash bash_scripts/default_test.bash 9050 xvla generate |
If you want to run the test on the default scenes, you can change from generate to default in the command.
You can also modify the following arguments in the bash script to customize the test:
-
--run_all: Set toTrueto run the test on all the scenes in the dataset, orFalseto run the test on a single scene specified byscene_namein the config file.For example, setting
--run_all Truewith test modegeneratewill run the test on all the 20 generated scenes, while setting--run_all Falsewill run the test on the scene specified byscene_namein the config file. -
--config: Path to the config file, default toconfigs/default.yaml. You can change it to your own config file if you customize the test settings.
This test evaluates the performance of the policy in a simulated scene with a different background image. It will test the scene on all the background images in the specified folder and 5 example background images for testing are provided in the examples/background folder.
- Command:
bash bash_scripts/background_test.bash 9020 spatial generateSame as the default_test.bash script, you can specify the policy name and port number to run the test on the desired policy server and adjust relevant arguments in the bash script.
Additionally, you can specify the different background images to use for the test by changing the background_folder parameter in the config file. The default background folder is set to examples/background.
This script evaluates how changing the color composition of the background in a simulated scene affects the robustness of a robotic policy. The background image is gradually blended with its RGB-transformed variant at various strengths, and a predefined test pipeline is executed on each variant.
Command:
bash bash_scripts/adv_background_test.bash 9020 spatial generateThis test evaluates the performance of the policy in a simulated scene with a different camera angle. It will move the camera up, down, left,right, forward, and backward by a certain distance and test the scene on all the camera angles to see how the policy performs.
- Command:
bash bash_scripts/camera_test.bash 9020 spatial generateThis test only evaluates the generated scenes. It will exchange the positions of the objects in the scene -- use different permutations of the objects in the scene to see how the policy performs.
- Command:
bash bash_scripts/permute_test.bash 9020 spatial generateThis test will only evaluate the default scenes. It will randomly generate different poses and rotations of the objects in the scene and test the scene on all the poses to see how the policy performs.
- Command:
bash bash_scripts/pose_test.bash 9020 spatial defaultThis test will only evaluate the default scenes. The original target object for the task will be replaced (e.g., from a default spoon to another object generated from another real scene specified by obj_cnt in the config), and the task is repeated.
- Only this script requires the
obj_cntparameter in the config file to specify which object to use for the target object variation, and its config is defaulted toconfigs/simpler.yaml.
Configuration Example (configs/simpler.yaml):
base_folder: < Root path to your generated scene data>
robot: "WidowX" # Specifies the robot model
scene_name: "default1" # Identifier for the base scene for this test
replace_name: "scene1" # Scene to be used for background variation and target object variation
obj_cnt: 4 # Index of objects to be used for target object variation [in this case it will be the banana in scene1]-
You can choose which object to use for the target object variation by changing the
obj_cntparameter in the config file. You can check the object index and its name in themasks/result.jsonfile in each scene's folder. -
Command:
bash bash_scripts/asset_test.bash 9020 spatial defaultYou can run all the tests in one go by using the following bash scripts:
- For default scenes:
bash bash_scripts/all_test_default.bash 9020 spatial trueThis will run all the tests on the spatial policy server on port 9020 on all 4 given default scenes, you can change the last argument to false to run the tests on a single scene specified by scene_name in the config file.
- For generated scenes:
bash bash_scripts/all_test_generate.bash 9020 spatial trueThis will run all the tests on the spatial policy server on port 9020 on all 20 generated scenes, you can change the last argument to false to run the tests on a single scene specified by scene_name in the config file.
If you have run all the tests, you will have the following structure in your output folder:
default_test
├── <policy_A>
│ ├── adv_background_test
│ ├── <scene_name>
│ ├── asset_test
│ ├── background_test
│ ├── camera_test
│ ├── default_test
│ ├──pose_test
├── <policy_A>
...
generate_test
├── <policy_A>
│ ├── adv_background_test
│ ├── <scene_name>
│ ├── background_test
│ ├── camera_test
│ ├── default_test
│ ├── permute_test
├── <policy_B>
...
This script provides automated task success and progress scoring using Gemini 2.5 Pro Preview, conditioning on both video and privileged simulation state. It supports multithreaded processing of trajectories and saves per-trajectory evaluation scores. An API key is required for accessing Gemini.
- Automated multimodal evaluation from video and simulation state (object and robot states)
- Fast multithreaded inference for many trials
- Supports single-shot and zero-shot prompting
- Saves structured results for downstream analysis
- Gemini 2.5 Pro API key
bash bash_scripts/GVL.bashPlease use the bash script bash_scripts/GVL.bash for this. In GVL.bash, you can modify the following arguments to customize the evaluation:
- use
policyto specify the policy you want to use, e.g.,spatial,robovlm,cogact,octo,open_pi_zero,xvla. - use
variantto specify the variant of the test you want to run, e.g.,background_test,default_test,camera_test, etc. - sepecify in
inferencethe path to the folder where you have saved the results of the test you want to evaluate, e.g.,./generate_test/$policy/$variant/for the generated scenes or./default_test/$policy/$variant/for the default scenes. - put your Gemini API key in
--keyargument in the bash script. - put the output folder path in
--dirargument in the bash script.
- Benchmark code release : Core benchmark environments, tasks, and evaluation scripts.
- Scene generation code release : Automated scene generation and dataset creation pipeline.
If you encounter bugs, have feature requests, or would like to contribute improvements:
- Issues: Please open a GitHub Issue on this repository with a clear description and, if possible, a minimal reproduction.
- Pull requests: We welcome PRs that improve documentation, add tests, or extend the benchmark. Please include a short summary and link to any related Issues.
- Questions: For general questions or collaboration inquiries, feel free to contact the maintainers by email.
Maintainers
- Yash Jangir —
offjangir@gmail.com - Yidi Zhang —
zhangyidi.lily@gmail.com - Pang Chi Lo —
pcseanlo@gmail.com
This project builds upon several excellent open-source efforts in robot learning and vision-language-action models. We thank the contributors of SimplerEnv · Octo · CogACT · X-VLA · RoboVLMs · Open-PI-Zero · SpatialVLA
If you use RobotArena in your research, please cite:
@misc{jangir2025robotarenainftyscalablerobot,
title={RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation},
author={Yash Jangir and Yidi Zhang and Pang-Chi Lo and Kashu Yamazaki and Chenyu Zhang and Kuan-Hsun Tu and Tsung-Wei Ke and Lei Ke and Yonatan Bisk and Katerina Fragkiadaki},
year={2025},
eprint={2510.23571},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2510.23571},
}This repository is released under the MIT License. See LICENSE for details.
