Dumb-E: agentic tabletop robotics, chess is only the demo

A voice-first, multi-agent robotic system for safe, explainable tabletop work. Chess is the clean showcase of the full loop, but the stack generalizes to pick and place with constraints, kit checks, puzzle assembly, bin sorting with rules, and expressive gesture interaction. Everything below is implemented and wired into a single, transparent system.

What the demo proves

It knows when to move. A stability gate monitors frame-difference history and brightness deltas, so motion plans only publish when the scene is steady and hands are out.
Watch and respond. It reads the last human move, updates state, chooses a legal reply, validates safety, then executes with velocity and acceleration limits.
Voice in, voice out. You speak, it parses intent, confirms what it sees, states its plan, and narrates execution with a selectable persona that can bias style within legal play.
Live transparency. The UI streams short, structured status from Vision, Coordinator, and Motion agents, so non-experts see what the robot believes and why it acts.

Why this is more than chess

Replace chess rules with a task grammar. Keep the loop: perception to discrete state, rule validation, candidate generation, safety verification, narrated execution. Because agents expose confidence, time, and cost, the system delegates to cheaper or faster tools when quality allows, then escalates to heavier models only as needed. The same control and explanation surfaces apply to checkers, peg boards, bin rules, or puzzle assembly.

Multi-agent architecture

Coordinator Agent

Built with Gemini ADK and A2A
Grounds dialog into structured intents, enforces task rules, chooses actions using a thinking budget that trades latency, confidence, and cost
Calls tools: validate_state, propose_move or propose_action, simulate_collisions, execute, summarize_for_voice

Vision Agent

Handles homography, pose, rectification, square crop generation, or whole-image detection
Fuses per-square and per-detection probabilities, tracks stability and occlusion signals
Publishes position hypotheses with calibrated likelihoods

Motion Agent

Solves IK, plans time-parametrized trajectories, enforces soft joint limits, keep-out zones, approach heights, and end-effector presets
Exposes guarded skills: pick(square), place(square), wave, nod_yes, nod_no, point, reset

Orchestration

Agents communicate over WebSockets and A2A, each streams a compact state machine: idle, thinking, verifying, executing, error
A capability registry describes tools with cost, expected latency, and quality bands
A cost router selects model and tool chains per step, with fallbacks and caching

Tool routing and model delegation

Fast path: local CV and heuristics, Gemini 2.5 Flash for short intent parsing, persona text, speech scripts
Vision-language action: Gemini Robotics ER 1.5 for scene-grounded queries and tool calls when language must point at pixels
Deep planning or multi-step tool synthesis: Gemini 2.5 Pro
Speech: local TTS for low latency, ElevenLabs for show quality
External apps: Computer-Use for logging PGN, saving run artifacts, posting summaries, or retrieving an opening line without leaving the demo
Cost control: per-tool ceilings, dynamic step downs when confidence is already high, upgrade on disagreement

Perception and training

We use two complementary paths and a rules-aware selector.

Per-square classifier path

Synthetic pretraining on rendered boards across many palettes and piece sets
Compact CNN similar to AlexNet for 13 classes: empty plus white and black piece types
Top-view adaptation using our camera, single corner pick to warp, crop 64 tiles, heavy augmentation
Inception-family fine-tune with frozen early blocks, heads trained for 13 classes, tracked via MLflow

Whole-image detection path

YOLOv8 trained with rotation and scale jitter on annotated top-view frames
Detections mapped to rectified space, centers assigned to squares by linear assignment
Confidences converted to per-square distributions and fused with the classifier stream when both are present

Stability and occlusion gating

Rolling grayscale frame-diff ratios prevent inference while hands move
Brightness deltas against last stable frame suppress occluded reads

Rules-aware selector From per-square probabilities, evaluate all legal successor boards plus no-move, score each with

[ \mathcal{L} = \sum_{i=1}^{64} -\log(p_{i,c_i}+\varepsilon), \ \varepsilon=10^{-8} ]

Pick the minimum-loss candidate, run a backward check when the top score is weak, correct minor per-square errors before they ever reach the planner.

YouTube mining for scalable supervision

Select games with a visible digital board overlay and the physical board camera
Extract two synchronized crops per time step, digital and real
A small CNN reads the digital board tile by tile to produce logits, convert to probabilities, derive FEN over time
Align the real-board crop sequence to this ground truth with a robust step schedule
Emit structured artifacts per video: gt, irl, pred, fen.csv, min_losses.csv
A helper tool speeds corner picking and time segment selection, flags overlays or arrows to avoid corrupt labels
This pipeline bootstraps real, messy supervision without manual move labels and feeds both detector and selector tuning

Real-time inference and state management

Camera pipeline stabilizes exposure, color, and undistorts via homography, ArUco or AprilTag utilities available for optional auto-cornering
Position estimates and losses enter a Kalman-like filter at the board level to smooth transient jitter
The Coordinator reconciles voice intent with the observed board, rejects illegal states, requests clarification when needed
The game engine maintains FEN, PGN, clocks, and side-effects like captures and promotions, then proposes candidates to Planning

Planning and narrated execution

Planning toolchain runs validate_state, propose_move or propose_action, simulate_collisions, execute
Motion skills are parameterized by square coordinate, pick height, grip profile, approach and retreat vectors
Narration composes short, factual lines, for example, board stable, I read Nf3, replying with d5, executing
Personas can bias openings or style, never legality, and can choose between principled play or entertaining tactics based on audience mode

Safety envelope

Soft joint and workspace limits, keep-out zone above pieces, velocity and acceleration caps near the board
Continuous monitor of planner expectations versus actual end-effector velocity and pose
Any mismatch, stale frame, or stability failure pauses the trajectory and requests a re-scan
Gesture skills share the same limits, so wave and nod remain safe

Teleop, imitation, and sim

Leader-follower teleoperation mirrors back-drivable leader to a stronger follower, logs synchronized joints, end-effector, and video
Logged demos bootstrap imitation datasets and motion priors
Isaac Sim and Isaac Lab scenes mirror table geometry for collision checks and quick trajectory validation

Evaluation, tracking, and debugging

MLflow tracks runs, hyperparameters, checkpoints, confusion matrices
The Debug Studio shows camera feeds, rectified overlays, heatmaps, FEN and PGN, latency per stage, alerts
The Agent Graph is the default surface, lanes view gives a timeline of thinking, verifying, executing, error
A triage panel highlights frames with high minimum loss, likely occlusion, or model disagreement

Generalizing beyond chess

Replace chess legality with a task grammar: slot occupancy, color rules, peg patterns, kit manifests
Swap the game engine module for a constraint engine that enumerates valid next states and candidate picks
The same loss-over-legal-states trick applies when perception is noisy and decisions are discrete
Voice intents can target non-chess verbs, for example sort by color, assemble pattern one, verify kit two

Privacy, deployment, and cost

Default mode is local processing, only high-level state such as FEN, intents, and agent thoughts leave the device when a cloud voice is selected
The cost router caps spend per session, caches tool outputs, steps down to cheaper models when confidence is already high, steps up only on disagreement or low margin
The same policy applies to speech and intent parsing, so demos remain smooth and inexpensive