Dumb-E: agentic tabletop robotics, chess is only the demo
A voice-first, multi-agent robotic system for safe, explainable tabletop work. Chess is the clean showcase of the full loop, but the stack generalizes to pick and place with constraints, kit checks, puzzle assembly, bin sorting with rules, and expressive gesture interaction. Everything below is implemented and wired into a single, transparent system.
What the demo proves
- It knows when to move. A stability gate monitors frame-difference history and brightness deltas, so motion plans only publish when the scene is steady and hands are out.
- Watch and respond. It reads the last human move, updates state, chooses a legal reply, validates safety, then executes with velocity and acceleration limits.
- Voice in, voice out. You speak, it parses intent, confirms what it sees, states its plan, and narrates execution with a selectable persona that can bias style within legal play.
- Live transparency. The UI streams short, structured status from Vision, Coordinator, and Motion agents, so non-experts see what the robot believes and why it acts.
Why this is more than chess
Replace chess rules with a task grammar. Keep the loop: perception to discrete state, rule validation, candidate generation, safety verification, narrated execution. Because agents expose confidence, time, and cost, the system delegates to cheaper or faster tools when quality allows, then escalates to heavier models only as needed. The same control and explanation surfaces apply to checkers, peg boards, bin rules, or puzzle assembly.
Multi-agent architecture
Coordinator Agent
- Built with Gemini ADK and A2A
- Grounds dialog into structured intents, enforces task rules, chooses actions using a thinking budget that trades latency, confidence, and cost
- Calls tools: validate_state, propose_move or propose_action, simulate_collisions, execute, summarize_for_voice
Vision Agent
- Handles homography, pose, rectification, square crop generation, or whole-image detection
- Fuses per-square and per-detection probabilities, tracks stability and occlusion signals
- Publishes position hypotheses with calibrated likelihoods
Motion Agent
- Solves IK, plans time-parametrized trajectories, enforces soft joint limits, keep-out zones, approach heights, and end-effector presets
- Exposes guarded skills: pick(square), place(square), wave, nod_yes, nod_no, point, reset
Orchestration
- Agents communicate over WebSockets and A2A, each streams a compact state machine: idle, thinking, verifying, executing, error
- A capability registry describes tools with cost, expected latency, and quality bands
- A cost router selects model and tool chains per step, with fallbacks and caching
Tool routing and model delegation
- Fast path: local CV and heuristics, Gemini 2.5 Flash for short intent parsing, persona text, speech scripts
- Vision-language action: Gemini Robotics ER 1.5 for scene-grounded queries and tool calls when language must point at pixels
- Deep planning or multi-step tool synthesis: Gemini 2.5 Pro
- Speech: local TTS for low latency, ElevenLabs for show quality
- External apps: Computer-Use for logging PGN, saving run artifacts, posting summaries, or retrieving an opening line without leaving the demo
- Cost control: per-tool ceilings, dynamic step downs when confidence is already high, upgrade on disagreement
Perception and training
We use two complementary paths and a rules-aware selector.
Per-square classifier path
- Synthetic pretraining on rendered boards across many palettes and piece sets
- Compact CNN similar to AlexNet for 13 classes: empty plus white and black piece types
- Top-view adaptation using our camera, single corner pick to warp, crop 64 tiles, heavy augmentation
- Inception-family fine-tune with frozen early blocks, heads trained for 13 classes, tracked via MLflow
Whole-image detection path
- YOLOv8 trained with rotation and scale jitter on annotated top-view frames
- Detections mapped to rectified space, centers assigned to squares by linear assignment
- Confidences converted to per-square distributions and fused with the classifier stream when both are present
Stability and occlusion gating
- Rolling grayscale frame-diff ratios prevent inference while hands move
- Brightness deltas against last stable frame suppress occluded reads
Rules-aware selector From per-square probabilities, evaluate all legal successor boards plus no-move, score each with
[ \mathcal{L} = \sum_{i=1}^{64} -\log(p_{i,c_i}+\varepsilon), \ \varepsilon=10^{-8} ]
Pick the minimum-loss candidate, run a backward check when the top score is weak, correct minor per-square errors before they ever reach the planner.
YouTube mining for scalable supervision
- Select games with a visible digital board overlay and the physical board camera
- Extract two synchronized crops per time step, digital and real
- A small CNN reads the digital board tile by tile to produce logits, convert to probabilities, derive FEN over time
- Align the real-board crop sequence to this ground truth with a robust step schedule
- Emit structured artifacts per video:
gt,irl,pred,fen.csv,min_losses.csv - A helper tool speeds corner picking and time segment selection, flags overlays or arrows to avoid corrupt labels
- This pipeline bootstraps real, messy supervision without manual move labels and feeds both detector and selector tuning
Real-time inference and state management
- Camera pipeline stabilizes exposure, color, and undistorts via homography, ArUco or AprilTag utilities available for optional auto-cornering
- Position estimates and losses enter a Kalman-like filter at the board level to smooth transient jitter
- The Coordinator reconciles voice intent with the observed board, rejects illegal states, requests clarification when needed
- The game engine maintains FEN, PGN, clocks, and side-effects like captures and promotions, then proposes candidates to Planning
Planning and narrated execution
- Planning toolchain runs validate_state, propose_move or propose_action, simulate_collisions, execute
- Motion skills are parameterized by square coordinate, pick height, grip profile, approach and retreat vectors
- Narration composes short, factual lines, for example, board stable, I read Nf3, replying with d5, executing
- Personas can bias openings or style, never legality, and can choose between principled play or entertaining tactics based on audience mode
Safety envelope
- Soft joint and workspace limits, keep-out zone above pieces, velocity and acceleration caps near the board
- Continuous monitor of planner expectations versus actual end-effector velocity and pose
- Any mismatch, stale frame, or stability failure pauses the trajectory and requests a re-scan
- Gesture skills share the same limits, so wave and nod remain safe
Teleop, imitation, and sim
- Leader-follower teleoperation mirrors back-drivable leader to a stronger follower, logs synchronized joints, end-effector, and video
- Logged demos bootstrap imitation datasets and motion priors
- Isaac Sim and Isaac Lab scenes mirror table geometry for collision checks and quick trajectory validation
Evaluation, tracking, and debugging
- MLflow tracks runs, hyperparameters, checkpoints, confusion matrices
- The Debug Studio shows camera feeds, rectified overlays, heatmaps, FEN and PGN, latency per stage, alerts
- The Agent Graph is the default surface, lanes view gives a timeline of thinking, verifying, executing, error
- A triage panel highlights frames with high minimum loss, likely occlusion, or model disagreement
Generalizing beyond chess
- Replace chess legality with a task grammar: slot occupancy, color rules, peg patterns, kit manifests
- Swap the game engine module for a constraint engine that enumerates valid next states and candidate picks
- The same loss-over-legal-states trick applies when perception is noisy and decisions are discrete
- Voice intents can target non-chess verbs, for example sort by color, assemble pattern one, verify kit two
Privacy, deployment, and cost
- Default mode is local processing, only high-level state such as FEN, intents, and agent thoughts leave the device when a cloud voice is selected
- The cost router caps spend per session, caches tool outputs, steps down to cheaper models when confidence is already high, steps up only on disagreement or low margin
- The same policy applies to speech and intent parsing, so demos remain smooth and inexpensive
Built With
- agent2agent-protocol
- apriltag
- aruco
- asdkjhas
- computer-use
- docker
- elevenlabs
- fastapi
- flask
- gemini-2.5-flash
- gemini-2.5-pro
- gemini-adk
- gemini-live
- gemini-robotics-er-1.5
- gemma
- isaac-lab
- isaac-sim
- lerobot-so-101-sdk
- mcp-opencv-server
- mlflow
- node.js
- numpy
- nvidia-jetson-v4l2-uvc
- onnx-runtime-rocm
- opencv
- python
- python-chess
- pytorch
- react
- rocm-pytorch
- scipy
- shadcn/ui
- tailwind
- ultralytics-yolov8
- vite
- websockets

Log in or sign up for Devpost to join the conversation.