ApronAI | Devpost

Inspiration

This example showcases how real-time multimodal AI can enhance physical tasks by understanding human actions through vision and audio—a personal exploration of how AI is transforming hands-on work. ApronAI was inspired by this idea to demo a real-time kitchen copilot that can watch what you are doing, listen to you, and guide you step by step.

What it does

Streams microphone audio and camera frames to Gemini Live over WebSocket.
Returns low-latency voice responses.
Tracks recipe progress with explicit memory checkpoints.
Surfaces progress via a live AR HUD.
Supports multiple recipes loaded dynamically from knolwedge/*.json.
Provides two frontends:
/ for AR/WebXR mode.
/eval for smart phone camera/chat evaluation mode.

How we built it

Backend: FastAPI with a WebSocket bridge (main.py).
Model integration: Gemini Live wrapper using google-genai (gemini_live.py).
Frontend: Vanilla JS media pipeline + Three.js AR UI.
Memory and progress: Recipe-aware explicit memory and shared progress store (progress_tracker.py).
Knowledge system: Recipe prompts and steps in JSON files under knolwedge/.
Reliability: Session restart/resumption flow, queue backpressure handling, mobile HTTPS support.

Challenges we ran into

Session quality degraded in longer multimodal conversations when both audio and video were active.
Context continuity dropped over time, causing repeated questions.
Mobile and AR constraints (TLS trust, permissions, browser-specific behavior) complicated startup reliability. To solve this, we combined:
Context window compression with a sliding/context-shifting strategy for longer sessions.
Explicit memory management to preserve task state across context compression and session restarts.

Accomplishments that we're proud of

Stable real-time audio/video + voice-response loop.
Long practical session duration through compression + session recovery.
Explicit step memory that keeps the assistant on track.
AR mode with in-scene HUD/transcript plus /eval fallback for broader device support.
Dynamic recipe selection from backend knowledge files without hardcoded frontend prompts.

What we learned

Compression extends session length, but explicit memory is critical for continuity quality.
Real-time systems need robust queueing/restart behavior, not only model prompts.
Mobile and AR browser security models must be treated as first-class engineering constraints.
Good observability (tests, logs, API checks) dramatically reduces debugging time.

What's next for ApronAI

Richer structured memory (timers, ingredient state, parallel steps).
More recipes and tool integrations (timers, substitutions, pantry-aware guidance).
Better AR-first interaction patterns for hands-busy usage.
Better AR overlay UI (e.g., using SLAM mapping information onto objects)
Production hardening: auth, analytics, and multi-user deployment posture.

Built With

Updates

Homer Quan posted an update — Mar 10, 2026 02:16 PM EDT

To test the AR mode, you need a XR glasses support webxr standard. I am using a Meta Quest 3s. Open the first link: https://apronai-live-958177896637.us-central1.run.app in its browser. Start session first, then click "Start AR"

Log in or sign up for Devpost to join the conversation.

Homer Quan started this project — Mar 10, 2026 02:09 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.