myogen is a full-stack system that turns a live camera stream and on-hand biosignals into reliable grasp intent for a robotic hand; unlike existing solutions that rely on pose cycling or fixed macro grasps, Myogen makes intent explicit in language and auditable, then executes it via a lean, safety‑gated controller.

A lightweight perception stage (YOLO) compresses the scene into a compact text description; a fine-tuned language model chooses a finger-curl template; a controller maps curls into calibrated servo targets and executes them on an Arduino-driven hand. The hand itself is not passive: an onboard IMU (accelerometer + gyroscope) and EMG inputs gate activation, stabilize closing, and protect against unsafe actuation. This division of responsibility makes the policy portable across cameras and hands while keeping the control loop simple, inspectable, and fast.

Inspiration

Most visuomotor pipelines entangle perception, policy, and control into a brittle monolith. We wanted a system where perception is interchangeable, policy is transparent and auditable (via text prompts and categorical outputs), and control is deterministic and safe. Rich datasets like HO3D already contain strong supervisory signals for hand–object interaction (joints, contacts, distances). We convert that supervision into purely textual preference pairs and use ORPO to train a model that prefers contact-inducing curl templates over non-contact ones. At runtime, the same discrete language serves as a “contract” between perception, policy, and control: YOLO emits text bins, the LLM responds with a constrained template, and the controller maps that template to the hand. By keeping the policy in text, we gain editability, debuggability, and broad compatibility with tooling and deployment environments.

System Architecture Overview

At runtime, the camera feed is processed by YOLO to detect the most salient object (by confidence and size). From the detection’s bounding box and category, we construct a textual prompt that encodes object identity and discretized context: approximate size, distance, lateral/vertical position, and coarse orientation magnitude and axis. This prompt is fed to the fine-tuned language model which responds with an explicit string specifying the curl state for each finger in a fixed order. The controller parses that response, consults a per-finger calibration table to translate curl states to servo angles, and sends a compact command to the Arduino over serial or BLE. The Arduino validates safety gates using IMU and an EMG signal; if the wrist is stable and the user activation is present, the curl motion executes using a smooth velocity profile. The system logs every stage—prompt, model output, command, acknowledgements, IMU/EMG status—so any failure can be traced.

Robotics System: Hardware, Sensing, and Control

The robotic hand comprises five actuated fingers (Thumb/Index/Middle/Ring/Pinky) driven by direct-drive servos (direct-drive or tendon-driven via Bowden cables). The actuators are powered from a regulated rail (5 V), with a dedicated ground and adequate decoupling to prevent brownouts during high current draws. The microcontroller (Arduino-compatible) handles PWM generation for the servos and reads sensors over I²C/SPI (for the IMU) and analog inputs (for EMG).

The IMU provides 6-axis sensing at 100–200 Hz. We fuse accelerometer and gyroscope with a complementary filter to estimate short-term angular rate and tilt. Before honoring any curl command, the firmware checks that the wrist’s angular velocity is below a configurable threshold (for example, <60 °/s averaged in a 50–100 ms window) and that acceleration spikes are absent, preventing actuation during violent motion. This both protects the mechanism and improves grasp consistency. The EMG signal, sampled at 500–1000 Hz and bandpass filtered, is rectified and passed through an envelope detector (e.g., 50–100 ms time constant). We apply a debounced threshold with hysteresis to detect an intent “pulse”; only when EMG is “armed” within a configurable window (e.g., 300–800 ms) does the hand accept a curl command.

YOLO processes the camera stream on a laptop with 6–12 ms latency. From the top detection, we derive discrete text bins: approximate distance is estimated by the normalized bounding-box area with a camera-specific calibration (mapping area to “very close/close/arm’s-length/far/very far”); lateral and vertical bins are computed from the bounding box center relative to the image center (“left/centered/right” and “below/level/above” with a small deadband). Axis-angle orientation is emulated via simple heuristics: aspect ratio and box skew (or optional keypoints/PCA on the mask) translate to “slightly/moderately/strongly/heavily rotated” with a dominant axis label (x/y/z). These heuristics keep the prompt short, stable, and model-agnostic.

The prompt is rendered as a short, deterministic paragraph to minimize token count and improve reproducibility.

For example: Scene: A single everyday object is visible. Object identity: banana. Object size: small. Object position: close, right, above relative to the camera. Object orientation: slightly rotated around the y-axis. Task: Output only the finger curls in this exact format: pinky: ; ring: ; middle: ; index: ; thumb:

A typical end-to-end latency budget on a Runpod A100 is 20–40 ms for the decision stage (YOLO + prompt + LLM decode) plus 400–800 ms for actuation. Because the policy is text-first and the controller is deterministic, we can audit both intent and execution with human-readable logs.

Data and Learning Pipeline

To supervise the policy with strong signals, we process HO3D v3 meta .pkl files from link. HO3D v3 contains object and handpose descriptions annotated from the HOnnotate model. Finger curls are computed per finger from three joint angles (MCP, PIP, DIP) and discretized with simple thresholds: angles < 40° map to “no curl”, < 110° to “half curl”, and the remainder to “full curl”. Textual prompts are constructed from object identity and discretized bins for size, distance, lateral/vertical position, and rotation magnitude/axis; we avoid raw numbers to keep the prompt distribution tight at training and inference.

Contact supervision leverages HO3D’s vertex-level signals: a frame is considered “contact” if the number of contact vertices exceeds a threshold, or if any vertices lie within the object mesh (intersection), or if the minimum vertex-to-surface distance is below 4 mm. For each sequence, we rank all contact frames by a deterministic key that prioritizes intersection first, then higher contact count, then smaller min/mean vertex distance, and finally earlier frame index for tie-breaking. The chosen positive is paired with the nearest-in-time negative frame in the same sequence that also exhibits a different finger-curl tuple; this keeps prompts identical across the pair while isolating the effect of curl choice on contact outcome. The dataset is exported as JSONL lines with fields prompt, chosen, and rejected.

We fine-tune with ORPO using LoRA adapters. For small data regimes, we increase epochs (e.g., 50–200 for a few dozen pairs) and keep the learning rate conservative (5e-6 to 1e-5), with gradient checkpointing and short max sequence length (256–512) to minimize VRAM. For larger base models or datasets, we expand sequence length to 1024 and use cosine learning rate scheduling with a 5% warmup. The objective encourages the model to assign higher preference to the “chosen” completion over the “rejected” one under identical prompts, which empirically sharpens the model’s grasp intent selection on out-of-sample objects.

Deployment on Runpod A100 with uvicorn

We deploy on a Runpod A100 PyTorch CUDA pod using Transformers directly. The service is a small FastAPI app served by uvicorn. At container start, we set caches to a persistent NVMe directory (HF_HOME=/workspace/hf-cache) to avoid filling the root filesystem; we also authenticate to Hugging Face if the model is private. The app loads either a merged bf16/fp16 checkpoint or a supported base model plus a LoRA adapter. We keep decoding greedy and short (max_new_tokens ≤ 64) to bound latency, and we add a /health route that touches the model to ensure it is resident on GPU before we accept traffic. For security, we enforce a bearer token or place the service behind a private network. The service logs prompts, outputs, and decode times, and exposes the IMU/EMG/arbitration decisions from the controller for complete traceability.

Challenges and Lessons

On the robotics side, the interplay of IMU gating and EMG activation required careful tuning: thresholds too conservative created “sticky” behavior; too permissive allowed unsafe actuation. A structured calibration process and good visibility (oscilloscope traces for EMG, logged IMU magnitudes, event stamps) proved invaluable. With limited training sequences, the fine-tune provided only marginal gains; the system’s robustness currently leans more on deterministic pairing, tight prompts, and the controller’s IMU/EMG gating than on large-scale optimization.

What’s Next

From a robotics perspective, we plan to extend the output with approach-side tokens (left/right/front/behind) and action phases (open/close), making the policy even more informative to the controller. We will collect on-robot preference pairs conditioned on the same prompt bins (distance, lateral/vertical, orientation) to close the sim-to-real gap and continue ORPO on those distributions. We also plan to integrate tactile or contact sensing on the fingertips; this would enable closed-loop micro-adjustments, slip detection, and confidence-aware re-queries to the policy. Finally, we will package the full system—YOLO prompt builder, ORPO dataset/pairing scripts, training configs, Arduino firmware, controller daemon, and a production-ready uvicorn service—so others can clone, calibrate, and operate myogen on their own hardware with minimal friction.

Built With

Share this project:

Updates