This guide walks you through setting up the OSWorld benchmark in AgentLab for GUI automation testing.
-
Clone and install OSWorld repository:
make osworld
-
Complete OSWorld setup:
- Navigate to the
OSWorld/directory - Follow the detailed setup instructions in the OSWorld README
- Download required VM images and configure virtual machines
- Navigate to the
The main entry point experiments/run_osworld.py is currently configured with hardcoded parameters. To modify the execution:
-
Edit the script directly to change:
n_jobs: Number of parallel jobs (default: 4, set to 1 for debugging)use_vmware: Set toTruefor VMware,Falsefor other platformsrelaunch: Whether to continue incomplete studiesagent_args: List of agents to test (OSWORLD_CLAUDE, OSWORLD_OAI)test_set_name: Choose between "test_small.json" or "test_all.json"
-
Environment Variables:
AGENTLAB_DEBUG=1: Automatically runs the debug subset (7 tasks fromosworld_debug_task_ids.json)
We provide different subsets of tasks:
- Debug subset: 7 tasks defined in
experiments/osworld_debug_task_ids.json - Small subset: Tasks from
test_small.json - Full subset: All tasks from
test_all.json
# Run with default debug subset using sequential execution in VMware VM
python experiments/run_osworld.pyTo run OSWorld in parallel using Docker, ensure you have Docker installed and configured. To install it, follow the section from the OSWorld README on Docker setup. Ensure that your docker installation support KVM, as OSWorld requires it for running VMs. We also recommend pulling the latest Docker image for OSWorld before running the benchmark:
docker pull happysixd/osworld-dockerAfter setting up Docker, you can change the use_vmware parameter in the script to False and run:
python experiments/run_osworld.pyYou can control number of parallel jobs by setting the n_jobs parameter in the script, the default is 4.
We recommend setting n_jobs to your_number_of_cpu_cores - 2 to leave some resources for the host system and the benchmark itself.
- VMware path: Currently hardcoded to
"OSWorld/vmware_vm_data/Ubuntu0/Ubuntu0.vmx" - Parallel execution: Automatically switches to sequential when using VMware
- Relaunch capability: Can continue incomplete studies by loading the most recent study