Inspiration

Years ago, my English speaking held me back. I grew up in a small Chinese town where English class meant "deaf and mute" learning: reading and writing only, never speaking. When I finally took my IELTS exam in 2022, I scored 6.5, one of the most exciting days of my life. But after moving to Canada for graduate school in 2024, I realized my speaking hadn't improved much despite months of effort. The issue wasn't quantity, it was quality. I was practicing without meaningful feedback.

In late 2023, when OpenAI released GPTs with voice, I built an IELTS Speaking Simulator GPT and open-sourced it. It went viral on Chinese social media. 50,000+ learners used it, with a 4.4/5 rating. But I noticed a critical flaw in my own usage: I rarely reviewed the feedback. I kept making the same mistakes. That's when I knew I needed to build something better: a tool focused not just on practice, but on deliberate practice with structured feedback and progress tracking.

Standardized tests like IELTS provide a well-defined rubric, which makes them an ideal evaluation framework for voice AI. The structured scoring criteria give both the model and the user a clear, measurable benchmark. This insight became the foundation of Joe Speaking.

What it does

Joe Speaking is an end-to-end English speaking practice platform with two core modes:

1. Real-time IELTS Speaking Simulator: Users have a natural voice conversation with an AI examiner that follows the official 3-part IELTS format. The AI examiner speaks, listens, and responds in real-time. After the session, users get instant band scores extracted via structured scoring.

2. Daily Recording Practice: Users record themselves speaking on any topic (IELTS, CELPIP, or freestyle) and get comprehensive AI feedback: band scores with criterion-level checks, edited transcripts with position-marked corrections, vocabulary pattern analysis, self-correction challenges, mini quizzes generated from their own errors, model responses, and native-speaker comprehensibility ratings.

Beyond individual sessions, the platform provides:

  • Version comparison: practice the same topic multiple times and see what's improving
  • Daily and weekly AI-generated progress reviews that identify persistent patterns
  • Spaced repetition for vocabulary and error correction
  • Full cloud sync across devices with local-first architecture

All at approximately $1/session vs. $25-60/hour for a human tutor: 100x cheaper with fully personalized feedback.

How we built it

Joe Speaking is a full-stack Next.js 14 application with Gemini as the AI backbone across four pillars:

  1. Real-time IELTS Speaking Simulator (Gemini Live API): A bidirectional voice WebSocket connection enables natural conversation with an AI examiner. The simulator uses part-transition detection, cue card display, and Gemini's function calling to extract structured band scores deterministically.

  2. Comprehensive AI Feedback (Gemini Structured Output): Every recording gets 10+ sections of feedback returned as typed JSON via Gemini's responseSchema. A modular prompt architecture with test-specific rubrics (IELTS/CELPIP) ensures consistent, actionable feedback.

  3. Smart Transcription (Transformers.js + Cloud ASR): A dual-mode ASR system uses in-browser Whisper via Transformers.js (free, private, WebGPU/WASM) or cloud providers (AssemblyAI for higher accuracy). Intelligent provider selection adapts based on user context.

  4. AI-Powered Progress Reviews: Gemini analyzes aggregated daily/weekly practice data to generate comprehensive progress reports, identifying persistent patterns and improvement areas.

The architecture supports both BYOK (Bring Your Own Key) and credit-based server mode, using the same prompts as a single source of truth. A combined API call strategy reduces input tokens by ~50% compared to parallel calls.

The entire product was built solo in under 80 days using Claude Code (Opus 4.5) and Google Antigravity (Gemini 3 Pro).

Challenges we ran into

Building a production-quality real-time voice application on the Gemini Live API was the biggest challenge. What was planned as a one-week integration became five weeks of deep iteration. Here are the core issues we encountered and solved:

  • WebSocket disconnects (1008/1011). Sessions disconnect mid-conversation after 8-12 minutes with error codes 1008 (Policy Violation) and 1011 (Internal Error). Error messages include "Failed to run inference", "Thread was cancelled", "RPC::DEADLINE_EXCEEDED", and "RESOURCE_EXHAUSTED". We built auto-reconnection with exponential backoff, resumption tokens, context re-injection from saved transcripts, and a server-side scoring fallback when the Live API fails.

  • Function calling unreliability. The reportScoringResults function is called in only ~60-70% of sessions. The Live API does not support toolConfig.functionCallingConfig with mode: ANY (unlike the standard Gemini API). We built a three-layer extraction workaround: function calling (primary), text block parsing with markers, and regex extraction from spoken transcript as a final fallback. This brought reliability to ~90%+.

  • Transcription truncation during long speech. inputTranscription events stop or degrade during continuous speech longer than ~30 seconds, critical for IELTS Part 2 monologues (1-2 minutes). A 2-minute speech would return only ~250 characters instead of ~2,000+. After 12 rounds of debugging, the solution was found via community: disable automaticActivityDetection and flush the buffer every 15 seconds with activityEnd/activityStart signals.

  • Band score extraction non-determinism. After the AI examiner says "That is the end of the Speaking test", it outputs scoring blocks correctly only 50-70% of the time, sometimes ending the turn immediately and ignoring the scoring prompt. This is confirmed as non-deterministic model behavior, not a code bug. We documented 15 incident fixes over December 2025 to January 2026, including race conditions, incomplete block extraction from streaming data, and context-length degradation in full-test sessions.

  • Per-session cost tracking. usageMetadata appears intermittently in API responses, not every message. Live API sessions do not appear in Google Cloud Console logs. We built a custom CostTracker that parses available metadata and estimates costs based on audio token rates (32 tokens/second). All sessions run at less than $1 each.

  • No API-side logging. Unlike the standard Gemini API, Live API sessions are invisible in Google Cloud Console logs. We cannot debug 1011 errors server-side or verify token counts. We built comprehensive client-side analytics with 11 documented failure mode categories, scoring diagnostics in Supabase, and Sentry correlation for error investigation.

We documented all these issues and shared them with the Gemini team: Gemini Live API Issues: 1008/1011 Disconnects, Per-Session Cost, Function Calling, API Logs

Accomplishments that we're proud of

  • Production-quality product built solo in 80 days. From zero to a live app at joespeaking.com with real users, payment system, and full test infrastructure.
  • 50,000+ learners have used the original IELTS Speaking Simulator GPT (4.4/5 rating), validating the core concept before building Joe Speaking.
  • Deep Gemini integration across 4 pillars: Live API for real-time voice, structured output for typed feedback, function calling for deterministic scoring, and multi-model strategy for different use cases.
  • Natural voice conversation experience. Users can speak naturally with an AI examiner, not click through menus or read prompts. The conversation flows like a real test.
  • Full practice lifecycle. Not just a test simulator, but an end-to-end practice platform with recording, transcription, feedback, version comparison, progress tracking, and spaced repetition review.
  • Comprehensive Gemini Live API feedback to Google. We documented all issues and shared a detailed feedback report with the Gemini team, contributing to the ecosystem.
  • Registered a company: Just Joe Technologies Inc. with the tagline "Building What Was Impossible Yesterday."

What we learned

  • Standardized tests are the ideal evaluation framework for voice AI. Tests like IELTS provide a well-defined rubric with structured scoring criteria. This gives both the AI model and the user a clear, measurable benchmark. It's the perfect starting point for building reliable voice AI feedback, and a natural hook that millions of learners already need.
  • Gemini Live API is production-ready for voice applications, with caveats. Building a real-time bidirectional voice conversation required solving disconnection, transcription, and scoring extraction challenges. The API delivers a natural conversation experience, but stability for sessions over 10 minutes and function calling reliability remain areas for improvement.
  • Structured output and function calling are game-changers. Getting Gemini to return reliable, typed JSON with 10+ feedback sections and deterministic band scores was the key to building a trustworthy scoring system. However, function calling in the Live API behaves differently from the standard API and requires multi-layer fallback strategies.
  • Building robust workarounds is essential. Every major Gemini Live API limitation required a non-trivial workaround: three-layer score extraction, periodic buffer flushing for transcription, auto-reconnection with context re-injection, and client-side analytics to compensate for missing server logs.
  • Solo development with AI assistance is a new paradigm. Claude Code (Opus 4.5) and Google Antigravity (Gemini 3 Pro) made it possible for one person to build a full production app in under 50 days.

What's next for Joe Speaking

  • More stable live conversation + extra providers. We hope Gemini Live API becomes more stable for 10+ minute conversations. We are also adding extra live conversation providers beyond Gemini and OpenAI for better coverage and fallback.
  • Evaluation framework for Live API and prompt improvement. Building a systematic eval pipeline to measure conversation quality, scoring accuracy, and function calling reliability across sessions. Using eval results to iteratively improve prompts and model configuration.
  • AI agent for personalized learning analysis. An AI agent that analyzes each user's strengths and weaknesses across sessions to provide more realistic, actionable suggestions tailored to their specific improvement areas.
  • TTS and pronunciation analysis. Text-to-speech integration and phoneme-level pronunciation analysis to give users detailed feedback on how they sound, not just what they say.
  • Mobile app. Native iOS/Android app for a better on-the-go experience (currently PWA).
  • Distribution features. Tools for sharing, embedding, and distributing content to reach more learners across platforms.

Built With

  • ai-sdk/google`)-gemini-live-api-gemini-structured-output-gemini-function-calling-vercel-ai-sdk-transformers.js-(hugging-face)-assemblyai-###-cloud-services-&-infrastructure-supabase-(postgresql
  • assemblyai
  • auth
  • backblaze-b2
  • chart.js
  • claude-code
  • cloudflare-workers
  • ffmpeg-wasm
  • framer-motion
  • gemini-function-calling
  • gemini-live-api
  • gemini-structured-output
  • google-antigravity
  • google-gemini
  • google/genai`
  • indexeddb
  • languages-&-frameworks-typescript-next.js-14-(app-router)-react-18-tailwind-css-framer-motion-###-ai-&-machine-learning-google-gemini-2.5-/-3-(`@google/generative-ai`
  • next.js-14
  • playwright
  • posthog
  • react
  • sentry
  • stripe
  • supabase
  • tailwind-css
  • three.js
  • transformers.js
  • typescript
  • upstash-redis
  • vercel
  • vercel-ai-sdk
  • vitest
  • zod
  • zustand
Share this project:

Updates