SpaceTools provides Toolshed, a distributed framework for hosting and replicating tools such as neural network models, VLMs, VLAs in a Ray cluster, and making them available for inference or reinforcement learning workflows. SpaceTools also ships a collection of ready-to-use vision and spatial reasoning tools built on top of Toolshed.
- Prerequisites
- Environment & Installation
- Quick Start: Agent Web UI
- Launching Toolshed in Python
- Toolshed Dashboard
- 🤖 Agentic Workflow
- Direct Tool Usage in Python
- Code Execution with Tool Access
- Multinode Deployment
- Toolshed CLI
- Creating New Tools
- Examples
Key features of Toolshed:
- Ray-based Architecture: Scaling across multiple heterogeneous machines
- Load Balancing: Host multiple tool instances with automatic request routing
- Queue Management: Automatic queuing when all actors are busy
- Environment Isolation: Each tool has a dedicated Python environment with its own dependencies
- Schema Generation: Converting tool docstrings to JSON schemas for interfacing with LLM frameworks
- Code Execution Interface: A Pythonic interface to the full toolkit, allowing tools to be used in generated code
Included tools (with the ability to add your own):
Core Tools:
- Code Executor: Execute Python code with access to the toolkit
Vision Tools:
- VLM: Vision-Language Model using Molmo for object detection and image Q&A (requires GPU)
- RoboRefer: Vision-Language Model using RoboRefer checkpoint for visual grounding and pointing (requires GPU)
- Depth Estimator: Monocular depth estimation and 3D point cloud generation using ml-depth-pro (requires GPU)
- SAM2 Segmentation: Instance segmentation using SAM-2 (requires GPU)
- Vision Ops: Computer vision operations for basic image and 3D vision manipulation
- Visual IO: Show images from variables to the LLM, or store images as variables for later use
3D & Spatial Tools:
- Bounding Box: Compute oriented bounding boxes for masked point clouds
- Grasp Generator: Generate grasps using GraspGen and project them into RGB images (requires GPU)
Robot Integration Tools:
- Robot: Control a robot system via HTTP API
Before installing SpaceTools, ensure you have the following:
| Requirement | Details |
|---|---|
| OS | Linux (tested on Ubuntu 20.04/22.04) |
| Python | 3.11 |
| Conda | Miniconda or Anaconda for environment management |
| CUDA | 11.8+ (required for GPU-accelerated vision tools) |
| GPU | NVIDIA GPU with >=40 GB VRAM recommended for vision tools |
| RAM | ≥32 GB recommended |
| Disk | ~30 GB free for model checkpoints (downloaded on first launch) |
💡 CPU-only tools (calculator, greeting, code executor, vision ops) work without a GPU. You can try the minimal examples without any GPU.
Toolshed uses separate conda environments for different vision tools because their dependencies are often incompatible. The key requirement is that Python and Ray versions must match exactly across all environments, and numpy/PIL major versions must match only for tools that send numpy/PIL data as inputs/outputs of a tool. We provide scripts to easily create matching environments and install tool dependencies.
conda create -n toolshed python==3.11 -y
conda activate toolshed
cd SpaceTools
pip install -e .💡 If you want to try Toolshed without installing vision tools, you can skip to Quick Start: Agent Web UI and run the minimal config. All minimal/non-vision examples should already work.
Follow these commands to install all SpaceTools tools in separate conda environments with their expected names. Some commands will ask questions about download locations.
The syntax for the commands is source install_tools/setup_tool_env.sh conda_env_name tool_name
# Make sure you're in the base environment
conda activate toolshed
# One of two pointing tools:
# Molmo VLM pointing tool (redundant w.r.t RoboRefer)
source install_tools/setup_tool_env.sh tool_vlm vlm
# RoboRefer pointing tool (better, but requires cloning a fork of RoboRefer and takes longer to install)
# Will prompt for checkpoint location
source install_tools/setup_tool_env.sh tool_roborefer roborefer
# Depth Estimator
# will prompt for checkpoint location
source install_tools/setup_tool_env.sh tool_depth depth
# SAM2 Segmentation
source install_tools/setup_tool_env.sh tool_sam2 sam2
# Bounding Box tool
source install_tools/setup_tool_env.sh tool_bbox bbox
# GraspGen
# will prompt for GraspGen clone location and checkpoint location
source install_tools/setup_tool_env.sh tool_graspgen graspgen
⚠️ If you use different environment names or checkpoint locations, you will have to change them in config files and some examples as well.
Command to test that all vision tools work (will take ~10 mins on first launch while model weights are downloaded):
export OPENAI_API_KEY=...
python examples/agentic.py --provider openai --model gpt-5 --config configs/vision_full.json --image examples/media/example_image.jpg "Estimate the 3D volume of the potato"To use external LLMs with Toolshed, set up the corresponding API keys:
export OPENAI_API_KEY=...
export ANTHROPIC_API_KEY=...The web interface exposes the toolshed agentic workflow, which connects tools with existing LLMs/VLMs, e.g. Anthropic, OpenAI, or your own models hosted with SGLang.
Minimal demo: launch Web UI with only calculator and greeting tools:
python toolshed/web_ui.py --config configs/minimal.jsonFull vision demo: Launch Web UI with all vision tools using RoboRefer for pointing:
python toolshed/web_ui.py --config configs/vision_full.jsonor using Molmo for pointing:
python toolshed/web_ui.py --config configs/vision_full_molmo.json⏰ These will take some time to launch, longer on the first launch while model weights are downloaded.
Running the web ui with viewers for saved conversations enabled:
bash toolshed/scripts/serve.shRunning the web ui with your own fine-tuned models with sglang:
bash toolshed/scripts/serve_sglang.shUse the start_toolkit function to start toolshed:
examples/minimal_start_tools.py
from toolshed import start_toolkit
from time import sleep
# Configure and start the toolkit
tool_configs = {
"greeting": {"num_actors": 2, "resources": {"num_cpus": 1, "num_gpus": 0}},
"calculator": {"num_actors": 2, "resources": {"num_cpus": 1, "num_gpus": 0}}
}
actor_handle = start_toolkit(tool_configs)
while True:
# Sleep forever to keep actor_handle alive.
sleep(1)You can do this in a separate process (like shown above), or in the same process where you call your tools.
tool_configs is the only required argument, the config format follows the example:
tool_configs = {
"tool_name": {
"num_actors": 2, # Number of tool instances to host
"conda_env": None, # Conda environment (optional)
"resources": { # Resource requirements for each instance
"num_cpus": 1,
"num_gpus": 0
},
"args": { # Tool-specific constructor arguments (optional)
"param1": "value1",
"param2": "value2"
}
}
}The start_toolkit function supports the following additional arguments:
tool_configs(required): Dictionary mapping tool names to their configurationsrouter_name: Name for the router actor (default:"toolshed_router")namespace: Ray namespace for cluster-wide discovery (default:"toolshed")ray_address: Ray cluster address to connect to (default:None, uses current Ray runtime or starts one if none are found)detached: Run in detached mode - actors persist after script exits (default:False)dashboard: Launch the Toolshed dashboard (default:False)dashboard_port: Port for dashboard server (default:7001)placement_group: Optional Ray placement group for resource managementplacement_group_size: Size for automatic placement group creation (int or"auto"). Pass "8" to ensure that all tools occupy one node with 8 gpus (default:"auto")
⚠️ Important: It is recommended to usedetached=False, but maintain a reference to the actor handle returned bystart_toolkit. Once the handle gets garbage collected, the entire toolkit including all tools will get shut down and cleaned up nicely.
The toolshed dashboard shows a timeline visualization of tool states (available, starting, running). Sound effects included, so you can listen to your tool orchestra. If you started toolshed with dashboard=True, it runs by default on port 7001.
The toolshed.agent module implements the agentic workflow: sending messages to the LLM, executing tool calls, and returning results. It supports different LLMs (OpenAI, Anthropic, Bedrock, SGLang) via provider wrappers.
It is easy to make agents that solve problems using tools:
from toolshed import start_toolkit, get_toolkit
from toolshed.agent import create_tool_agent
# Start toolkit and create agent
tool_configs = {"calculator": {"resources": {"num_cpus": 1, "num_gpus": 0}}}
handle = start_toolkit(tool_configs)
toolkit = get_toolkit()
agent = create_tool_agent(toolkit, provider="openai", model="gpt-4o")
# Create a session and send a message
session = agent.create_session("my_session")
session.add_system_message("You are a helpful assistant that uses tools to solve problems.") # Optional system prompt
session.add_user_message("What is 5 factorial?", images=[]) # images optional
# Get response (handles tool calls automatically)
response = agent.get_response("my_session")
print(response["response"])The above will block until the agent provides a final answer. For real-time progress updates, use get_response_async passing a step callback:
async def handle_step(step):
print(f"{step['type']}: {step.get('status', '')}")
response = await agent.get_response_async(session_id, step_callback=handle_step)The agent will work until it has produced the final answer, calling the step callback on every new event.
Step types emitted during execution:
| Step Type | Description |
|---|---|
synthesizing |
A signal that the LLM is being called |
reasoning |
LLM responded with text |
tool_decision |
Summary of tools to be called in this step |
tool_executing |
Tool execution started |
tool_result |
Tool execution completed |
complete |
Final response |
Each step type is a dictionary but has different keys. Common keys include type, iteration, status, and timestamp. See toolshed/agent_step_types.py for the complete schema of each step type.
See examples/agentic.py for a complete example of using the agent. You can try running the examples below:
Simple text-only example with calculator and greeting tools:
# Claude via Bedrock
python examples/agentic.py --provider bedrock --model us.anthropic.claude-sonnet-4-5-20250929-v1:0 --config configs/minimal.json "Compute 5! and repeat it to me backwards"
# GPT-5
python examples/agentic.py --provider openai --model gpt-5 --config configs/minimal.json "Compute 5! using the calculator tool and repeat it to me backwards"Example with vision tools and image input:
# Claude via Bedrock
python examples/agentic.py --provider bedrock --model us.anthropic.claude-sonnet-4-5-20250929-v1:0 --config configs/vision_full.json --image examples/media/example_image.jpg "Would the alarm clock fit on the shelf?"
# GPT-5
python examples/agentic.py --provider openai --model gpt-5 --config configs/vision_full.json --image examples/media/example_image.jpg "Would the alarm clock fit on the shelf?"Initialization Options:
agent = create_tool_agent(
toolkit,
provider="openai", # LLM provider: "openai", "anthropic", "bedrock", "sglang"
model="gpt-4o", # Model name: "gpt-5", "gpt-5-mini", "gpt-4o", etc. (provider-specific defaults if None)
enable_variables=True, # Enable variable storage/resolution (default: True)
inject_variable_instructions=True, # Auto-append variable handling prompt to system messages (default: True)
hide_tool_images=False, # Hide tool output images from LLM (default: False)
keep_dots_in_tool_names=False, # Keep dots in tool names vs convert to "__" (default: False)
api_key="...", # Provider-specific client kwargs
base_url="..." # e.g., for SGLang or custom endpoints
)Session Management:
Sessions track conversation history, images, and variables:
# Create session with optional initial images
session = agent.create_session("my_session", initial_images=[image1, image2])
# Add messages
session.add_system_message("You are a helpful assistant.")
session.add_user_message("What's in this image?", images=[image])
# Get response (handles tool calls automatically)
response = agent.get_response("my_session", max_iterations=5)Tool Schema Injection:
Tool schemas are automatically generated from toolshed tools and injected into each agent behind the scenes regardless of the system prompt. The injection method differs by provider:
- OpenAI/Anthropic/Bedrock: Tools are passed as structured API parameters in the
toolsfield. The agent callstoolkit.export_openai_schemas()to generate schemas, converts them to provider-specific format viaprovider.format_tools(), and includes them in the API call. - SGLang: Tool schemas are formatted as text and injected into the system message (or prepended to the first user message).
Variables:
Tools can store outputs as variables that persist across tool calls. LLMs can reference them with $var_name syntax:
# Tool returns: ToolResult(..., variables={"depth_map": array, "focal_length": 123.4})
# Next tool call can use: {"input": "$depth_map", "focal": "$focal_length"}Variables are automatically resolved when tools are called.
Disable variable engine with enable_variables=False, by default it is enabled.
System Prompts:
The toolshed/prompts/ directory contains reusable prompts:
COORDINATE_CONVENTIONS_PROMPT: Standardized coordinate system definitions for 2D and 3D coordinatesVARIABLE_HANDLING_PROMPT: Instructions for using the variable system ($var_namesyntax)SYSTEM_PROMPT: A complete system prompt template for vision tasks (inspatial_reasoning.py)
You can import and use these prompts in your own code.
The VARIABLE_HANDLING_PROMPT is auto-injected by the agent if inject_variable_instructions=True (the default).
Tool Filtering:
Control which of the running tools are exposed to the LLM agent:
# Filter to only vision tools
agent.set_tool_filter(lambda tool: "vision" in tool["function"]["name"].lower())
# Disable filtering (show all tools)
agent.set_tool_filter(None)LLM Provider Options:
Pass provider-specific parameters via **kwargs or client_kwargs. See modules in toolshed/integration/providers for the available options. E.g:
# Disable thinkin on Claude via Bedrock provider
response = agent.get_response("session", thinking_mode="false")
# SGLang custom endpoint
agent = create_tool_agent(toolkit, provider="sglang", base_url="http://localhost:30000/v1")You can use tools directly and build your own agent or tool pipelines. Tools can be called programmatically via the toolkit client.
E.g.: examples/minimal_call_running_tools.py
from toolshed import get_toolkit
# Retrieve the toolkit client - a class that provides access to the tools
toolkit = get_toolkit()
# Print available tools:
print(f"Available tools: {toolkit.get_available_tools()}")
# Call a tool via the pythonic interface
result = toolkit.calculator.add(4, 5)
print(f"Tool result value: {result.value}")
print(f"Tool result text: {result.text}")
print(f"Tool result images: {result.image}")
print(f"Tool result variables: {result.variables}")
# Call a tool via function interface
result = toolkit.call_tool("calculator", "add", 4, 5)
result = toolkit.call_tool("calculator", "add", a=4, b=5)
# Access the documentation of the running tools:
docstring = toolkit.get_documentation()
# Access the tool schemas in OpenAI format:
json_schemas = toolkit.export_openai_schemas()The code executor tool allows agents access to the code executor. You can equip it by adding it to your tool config:
{
"code_executor": {
"num_actors": 1,
"resources": {"num_cpus": 1, "num_gpus": 0},
"conda_env": "toolshed",
"timeout": 600
}
}The CodeExecutor class provides a way to run Python code with access to the full toolkit:
from toolshed import CodeExecutor
# Create a code executor
executor = CodeExecutor()
# Generate documentation to tell the LLM about available tools:
docs = executor.get_toolkit().get_documentation()
# Execute code with toolkit access
result, stdout, stderr = executor.exec("toolkit.greeting.greet('World')")
print(f"Result: {result}")
# Execute more complex code
code = """
# Use multiple tools
name = "Alice"
greeting = toolkit.greeting.greet(name)
sum_result = toolkit.calculator.add(10, 5)
# The result variable will be returned
result = {
'greeting': greeting,
'sum': sum_result
}
"""
result, stdout, stderr = executor.exec(code)
print(f"Complex result: {result}")The code execution context includes:
- All toolkit tools available as
toolkit.tool_name.method_name() - Common imports:
json,time,math,datetime,random,re - NumPy as
np(if available) - Standard built-ins:
print,len,range, etc.
⚠️ Important: This code executor is not sandboxed! Be extremely careful when using it.
When starting toolshed, ray will look for a locally running ray cluster and connect to it. If none are found, it will start a cluster that spans one machine and uses all its CPU and GPU resources.
If you wish to run tools across multiple nodes or computers, you can manually bring up a ray cluster using ray start commands.
An example script that does this on SLURM:
sbatch --nnodes 2 -A your_ppp examples/slurm_ray_cluster_launch.shToolshed provides two CLI commands:
toolshed-launch: Launch the toolkit with tools configured via JSONtoolshed-export-schemas: Export OpenAI-compatible tool schemas from running toolkit
Launch toolshed with a JSON config file and dashboard enabled:
# With dashboard visualization and custom port
toolshed-launch --config configs/vision_full.json --dashboard --dashboard-port 8080
# Custom namespace and router name to allow more than one instance on a single Ray cluster
toolshed-launch --config configs/minimal.json --namespace my_namespace --router-name my_routerExport schemas from a running toolkit for use in RL training or LLM frameworks:
# Basic export
toolshed-export-schemas -o tools.yaml
# Include code executor tool
toolshed-export-schemas -o tools.yaml --include-code-executor
# With tracing configuration for Weave
toolshed-export-schemas -o tools.yaml \
--trace-project my_project \
--trace-experiment exp1
# Custom router name/namespace (must match what was used in toolshed-launch)
toolshed-export-schemas -o tools.yaml \
--router-name my_router \
--namespace my_namespaceNote: If you use a custom namespace/router name with toolshed-launch, you must specify the same values when using toolshed-export-schemas or when getting the toolkit in Python code with get_toolkit(router_name="my_router", namespace="my_namespace").
If you're developing a new tool, use the blank environment script to get a clean slate with matching Python and Ray versions:
# From base environment
conda activate toolshed
# Creates env with Python + Ray only (no toolshed, no tool deps)
source install_tools/create_tool_env.sh my_new_tool_env
# Now install what you need
pip install -e . # Install toolshed
pip install torch transformers # Install your dependencies- Copy
toolshed/tools/tool_template.pyto a new Python file intoolshed/tools/, implement the__init__and add any tool methods. - Register the tool in
toolshed/tool_manifest.py:
TOOL_MANIFEST = {
# ... existing tools ...
"my_tool": "toolshed.tools.my_tool:MyTool",
}- Copy
toolshed/tools/tool_template.pyto a new Python file in your own project, implement the__init__and add any tool methods. - Specify
import_pathin the tool config - this tells toolshed where to find your tool class:
tool_configs = {
"my_tool": {
"import_path": "myproject.tools.my_tool:MyTool", # module:ClassName
"num_actors": 1,
"resources": {"num_gpus": 1}
}
}The import_path must be a valid Python import path in module.path:ClassName format. The module must be importable from both the base environment and the tool environment (if used).
You can then use your new tool:
executor = CodeExecutor()
result, stdout, stderr = executor.exec("toolkit.my_tool.do_my_thing()")- Inherit from
BaseTool: Import fromtoolshed.tools.base - Implement required method:
get_name() - Write a class docstring: The first line becomes the tool description shown to LLMs
- Use
@tool_methoddecorator: Only decorated methods are exposed to LLMs - Return
ToolResult: Wrap return values inToolResult(value, text=..., image=..., variables=...) - Write Google-style docstrings: Method descriptions and parameters are automatically converted into LLM-compatible tool schemas
Tool Results
ToolResult class supports multiple output channels for rich responses:
| Field | Type | Purpose |
|---|---|---|
value |
T |
The actual return value used in code execution |
text |
str |
Text description shown to the LLM (defaults to str(value)) |
image |
Any | List[Any] |
Image(s) to display inline in the LLM conversation |
variables |
Dict[str, Any] |
Variables to store in the execution context for later use |
is_error |
bool |
Whether this result represents an error condition (default: False) |
Conditional documentation syntax:
The docstrings may include conditional text blocks that may be present/absent depending on how the tool is launched.
[[if:text]]...[[/if:text]]— Included when text output is enabled[[if:image]]...[[/if:image]]— Included when image output is enabled[[if:vars]]...[[/if:vars]]— Included when variable output is enabled
These blocks are processed when generating documentation, so the LLM only sees descriptions relevant to the outputs it will actually receive.
The examples/ directory contains various usage examples. They have default values for arguments, so should work if called with no arguments. Their outputs will be saved in outputs/ directory.
| Example | Description |
|---|---|
| Agentic Workflow | |
examples/agentic.py |
LLM integration with tools, step callbacks, and optional image input |
| Minimal (no vision tools) | |
examples/minimal_start_tools.py |
Starting the toolkit with basic tools |
examples/minimal_call_running_tools.py |
Calling tools from a running toolkit |
examples/code_executor_usage.py |
CodeExecutor usage for executing Python code that calls tools |
examples/distributed_usage.py |
Multiple processes simultaneously contending for limited tools |
| Multi-Tool Vision Pipelines | |
examples/grasp_demo.py |
Full RoboRefer → SAM → Depth → GraspGen pipeline (validates vision tool setup) |
examples/visual_multi_tool_demo.py |
VLM → SAM2 → Depth pipeline to answer "How far is the object?" |
examples/visual_reasoning_scaling_demo.py |
Multi-GPU parallel processing: detect, segment, and estimate depth for multiple objects |
| Individual Tools | |
examples/tool_usage_*.py |
Per-tool usage demos to verify individual tool and environment setup |
| Advanced | |
examples/annotation/ |
Multi-agent concurrent annotation pipeline: processes multiple images in parallel using concurrent agents that share toolshed tools, finds 3D volumes of objects, and saves annotations in JSON format. Uses roborefer, depth estimator, SAM2, bounding box tools, and a custom save_data tool. Run with python examples/annotation/agentic_annotation_pipeline.py --num-images 4. Supports launching toolshed separately for faster iteration (see launch_toolshed_with_custom_tool.py). |
@misc{chen2025spacetoolstoolaugmentedspatialreasoning,
title={SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL},
author={Siyi Chen and Mikaela Angelina Uy and Chan Hee Song and Faisal Ladhak and Adithyavairavan Murali and Qing Qu and Stan Birchfield and Valts Blukis and Jonathan Tremblay},
year={2025},
eprint={2512.04069},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.04069}
}

