Project Story
What inspired us The robotics industry is facing a latency crisis when trying to put general-purpose AI into homes. Processing real-time video, mapping depth, and querying an LLM to decide what to do creates massive bottlenecks. You either get a robot that is fast but dumb, or smart but paralyzed by processing delays. We were inspired by biological systems to create a "reflex-cognition split"—a system where the "reflexes" (vision and depth) operate instantly, while the "cognition" (LLM reasoning) runs asynchronously in the background. We wanted to build the brain for the next generation of household robots.
How we built it We built Remop as a monorepo with two distinct, decoupled pipelines:
- The Vision Path (Reflexes): A FastAPI server connects to a Next.js frontend via WebSockets. It decodes raw WebP/JPEG bytes from the webcam and runs them through a Vision Pipeline utilizing Ultralytics YOLO (for object detection) and MiDaS (for monocular depth estimation). This gives the system instant, low-latency spatial awareness.
- The Agent Path (Cognition): In parallel, an asynchronous thread pool takes snapshots of the visual data and feeds it to Google's Gemini 2.5 Flash Lite model. Gemini acts as our "household tidying agent," analyzing the grounded objects and outputting strict JSON commands (e.g., move_forward, pick_up {"target": "cup"}).
Finally, we built a Next.js client that renders the camera feed, draws the bounding boxes, and features a custom "Voice Gate" that manages Text-to-Speech so the robot only speaks when necessary (like when task states change) instead of constantly babbling.
Challenges we faced * Concurrency and Blocking: The biggest challenge was ensuring that the slow Gemini API calls (which take seconds) didn't block the high-speed WebSocket vision loop (running at 30+ FPS). We solved this using dedicated ThreadPoolExecutors and atomic state updates (LATEST_STATE).
- Grounding LLM Hallucinations: Standard LLMs will try to interact with objects that aren't there. We had to heavily engineer the system prompt and build a Python grounding.py module to strictly filter the objects Gemini was allowed to interact with based on the live camera depth data.
- UX / Voice Gating: Early iterations of the agent talked too much. We had to build a complex state machine (voice_gate.py) that tracks "motor fingerprints" and "dwell times" to decide when the AI should interrupt with speech versus silently queue actions.
What we learned We learned how to efficiently push binary frame formats over WebSockets, how to wrangle multimodal models into outputting strict, robotic-executable JSON schemas (using Pydantic and Gemini's structured outputs), and most importantly, how to architecture embodied AI software so it behaves naturally in the physical world.
Built With
- api
- fastapi
- html5-canvas-api
- next.js
- openai
- pydantic
- python
- react
- speech
- uvicorn
- web
- websockets
Log in or sign up for Devpost to join the conversation.