## Inspiration

We wanted to reimagine how beginners learn to code. Traditional tutorials are passive — you read, copy-paste, and hope it clicks. We asked: what if learning to code felt like having a patient expert sitting next to you, watching your screen, talking you through problems in real time, and editing your code live as you learn?

## What it does

CodeBuddy is an AI-powered voice coding tutor that teaches Python through real-time conversation. Students talk to an AI tutor that can see their code, hear their questions, and respond with both
voice and live code edits.

Each lesson has two phases:

  • Demo phase — the AI narrates and live-types code into the editor, walking through concepts step by step
  • Practice phase — the student writes code while the AI watches, listens, and helps via voice conversation and live editor actions (inserting code, highlighting errors, showing hints)

The AI tutor uses Gemini's Live API for natural voice conversation with barge-in support (students can interrupt mid-sentence), a sidecar model that analyzes code in real time using tool calls
(highlight_error, insert_code, replace_code, show_hint, mark_topics_covered), and context compression to maintain long teaching sessions without losing track of what's been covered.

## How we built it

The backend is a FastAPI server that orchestrates two Gemini models simultaneously:

  1. Gemini Live API — handles real-time voice conversation with streaming audio in/out, server-side VAD, and proactive responses
  2. Gemini Flash (sidecar) — a text model with 5 tool declarations that analyzes the student's code after each pause in speech, then calls tools to edit the code editor, highlight errors, or show hints

The frontend is a React 19 + TypeScript app with a Monaco code editor, real-time audio capture/playback with gapless 24kHz streaming, and a chat transcript. Communication happens over WebSocket
(binary audio frames + JSON control messages) and REST endpoints for code execution and lesson management.

Firebase handles authentication (Google Sign-In) and persistence (Firestore stores lesson progress, code snapshots, transcripts, and covered topics so the AI never re-teaches concepts).

Key technical challenges we solved:

  • Barge-in without echo loops — the AI hearing its own audio output and interrupting itself. We implemented careful state tracking to distinguish student speech from echo
  • Sidecar injection timing — preventing the text model's code updates from cutting off the voice model mid-sentence by debouncing analysis until 2 seconds after speech ends
  • Context window management — using Gemini's sliding window compression (trigger at 25K tokens, compress to target) to keep long tutoring sessions coherent
  • Session resumption — auto-reconnect with conversation history baked into the system prompt so students don't lose context

## What we learned

Real-time multimodal AI requires careful orchestration. The hardest problems weren't the AI itself — they were the coordination between two models, the audio pipeline timing, and preventing the
system from talking over itself. We also learned that Gemini's Live API proactive audio feature is powerful for tutoring: the AI can decide when to jump in with a hint versus staying quiet while the student thinks.

## Challenges we faced

  • Voice cutting off mid-sentence from echo-triggered barge-in was our most persistent bug — it took multiple iterations to solve
  • Balancing sidecar responsiveness (analyzing code quickly) against interruption (not injecting tool results while the voice model is speaking)
  • Managing dynamic system prompts that grow with conversation history without exceeding context limits
  • TTS pre-rendering for demo phases required parallel batch fetching to avoid slow lesson starts

Built With

  • cloud-firestore
  • fastapi
  • firebase-auth
  • gemini-flash
  • gemini-live-api
  • google-cloud-tts
  • monaco-editor
  • python
  • react
  • typescript
  • uvicorn
  • vite
  • websocket
Share this project:

Updates