Inspiration
Years ago, my English speaking held me back. I grew up in a small Chinese town where English class meant "deaf and mute" learning: reading and writing only, never speaking. When I finally took my IELTS exam in 2022, I scored 6.5, one of the most exciting days of my life. But after moving to Canada for graduate school in 2024, I realized my speaking hadn't improved much despite months of effort. The issue wasn't quantity, it was quality. I was practicing without meaningful feedback.
In late 2023, when OpenAI released GPTs with voice, I built an IELTS Speaking Simulator GPT and open-sourced it. It went viral on Chinese social media. 50,000+ learners used it, with a 4.4/5 rating. But I noticed a critical flaw in my own usage: I rarely reviewed the feedback. I kept making the same mistakes. That's when I knew I needed to build something better: a tool focused not just on practice, but on deliberate practice with structured feedback and progress tracking.
Standardized tests like IELTS provide a well-defined rubric, which makes them an ideal evaluation framework for voice AI. The structured scoring criteria give both the model and the user a clear, measurable benchmark. This insight became the foundation of Joe Speaking.
What it does
Joe Speaking is an end-to-end English speaking practice platform with two core modes:
1. Real-time IELTS Speaking Simulator: Users have a natural voice conversation with an AI examiner that follows the official 3-part IELTS format. The AI examiner speaks, listens, and responds in real-time. After the session, users get instant band scores extracted via structured scoring.
2. Daily Recording Practice: Users record themselves speaking on any topic (IELTS, CELPIP, or freestyle) and get comprehensive AI feedback: band scores with criterion-level checks, edited transcripts with position-marked corrections, vocabulary pattern analysis, self-correction challenges, mini quizzes generated from their own errors, model responses, and native-speaker comprehensibility ratings.
Beyond individual sessions, the platform provides:
- Version comparison: practice the same topic multiple times and see what's improving
- Daily and weekly AI-generated progress reviews that identify persistent patterns
- Spaced repetition for vocabulary and error correction
- Full cloud sync across devices with local-first architecture
All at approximately $1/session vs. $25-60/hour for a human tutor: 100x cheaper with fully personalized feedback.
How we built it
Joe Speaking is a full-stack Next.js 14 application with Gemini as the AI backbone across four pillars:
Real-time IELTS Speaking Simulator (Gemini Live API): A bidirectional voice WebSocket connection enables natural conversation with an AI examiner. The simulator uses part-transition detection, cue card display, and Gemini's function calling to extract structured band scores deterministically.
Comprehensive AI Feedback (Gemini Structured Output): Every recording gets 10+ sections of feedback returned as typed JSON via Gemini's
responseSchema. A modular prompt architecture with test-specific rubrics (IELTS/CELPIP) ensures consistent, actionable feedback.Smart Transcription (Transformers.js + Cloud ASR): A dual-mode ASR system uses in-browser Whisper via Transformers.js (free, private, WebGPU/WASM) or cloud providers (AssemblyAI for higher accuracy). Intelligent provider selection adapts based on user context.
AI-Powered Progress Reviews: Gemini analyzes aggregated daily/weekly practice data to generate comprehensive progress reports, identifying persistent patterns and improvement areas.
The architecture supports both BYOK (Bring Your Own Key) and credit-based server mode, using the same prompts as a single source of truth. A combined API call strategy reduces input tokens by ~50% compared to parallel calls.
The entire product was built solo in under 80 days using Claude Code (Opus 4.5) and Google Antigravity (Gemini 3 Pro).
Challenges we ran into
Building a production-quality real-time voice application on the Gemini Live API was the biggest challenge. What was planned as a one-week integration became five weeks of deep iteration. Here are the core issues we encountered and solved:
WebSocket disconnects (1008/1011). Sessions disconnect mid-conversation after 8-12 minutes with error codes 1008 (Policy Violation) and 1011 (Internal Error). Error messages include "Failed to run inference", "Thread was cancelled", "RPC::DEADLINE_EXCEEDED", and "RESOURCE_EXHAUSTED". We built auto-reconnection with exponential backoff, resumption tokens, context re-injection from saved transcripts, and a server-side scoring fallback when the Live API fails.
Function calling unreliability. The
reportScoringResultsfunction is called in only ~60-70% of sessions. The Live API does not supporttoolConfig.functionCallingConfigwithmode: ANY(unlike the standard Gemini API). We built a three-layer extraction workaround: function calling (primary), text block parsing with markers, and regex extraction from spoken transcript as a final fallback. This brought reliability to ~90%+.Transcription truncation during long speech.
inputTranscriptionevents stop or degrade during continuous speech longer than ~30 seconds, critical for IELTS Part 2 monologues (1-2 minutes). A 2-minute speech would return only ~250 characters instead of ~2,000+. After 12 rounds of debugging, the solution was found via community: disableautomaticActivityDetectionand flush the buffer every 15 seconds withactivityEnd/activityStartsignals.Band score extraction non-determinism. After the AI examiner says "That is the end of the Speaking test", it outputs scoring blocks correctly only 50-70% of the time, sometimes ending the turn immediately and ignoring the scoring prompt. This is confirmed as non-deterministic model behavior, not a code bug. We documented 15 incident fixes over December 2025 to January 2026, including race conditions, incomplete block extraction from streaming data, and context-length degradation in full-test sessions.
Per-session cost tracking.
usageMetadataappears intermittently in API responses, not every message. Live API sessions do not appear in Google Cloud Console logs. We built a customCostTrackerthat parses available metadata and estimates costs based on audio token rates (32 tokens/second). All sessions run at less than $1 each.No API-side logging. Unlike the standard Gemini API, Live API sessions are invisible in Google Cloud Console logs. We cannot debug 1011 errors server-side or verify token counts. We built comprehensive client-side analytics with 11 documented failure mode categories, scoring diagnostics in Supabase, and Sentry correlation for error investigation.
We documented all these issues and shared them with the Gemini team: Gemini Live API Issues: 1008/1011 Disconnects, Per-Session Cost, Function Calling, API Logs
Accomplishments that we're proud of
- Production-quality product built solo in 80 days. From zero to a live app at joespeaking.com with real users, payment system, and full test infrastructure.
- 50,000+ learners have used the original IELTS Speaking Simulator GPT (4.4/5 rating), validating the core concept before building Joe Speaking.
- Deep Gemini integration across 4 pillars: Live API for real-time voice, structured output for typed feedback, function calling for deterministic scoring, and multi-model strategy for different use cases.
- Natural voice conversation experience. Users can speak naturally with an AI examiner, not click through menus or read prompts. The conversation flows like a real test.
- Full practice lifecycle. Not just a test simulator, but an end-to-end practice platform with recording, transcription, feedback, version comparison, progress tracking, and spaced repetition review.
- Comprehensive Gemini Live API feedback to Google. We documented all issues and shared a detailed feedback report with the Gemini team, contributing to the ecosystem.
- Registered a company: Just Joe Technologies Inc. with the tagline "Building What Was Impossible Yesterday."
What we learned
- Standardized tests are the ideal evaluation framework for voice AI. Tests like IELTS provide a well-defined rubric with structured scoring criteria. This gives both the AI model and the user a clear, measurable benchmark. It's the perfect starting point for building reliable voice AI feedback, and a natural hook that millions of learners already need.
- Gemini Live API is production-ready for voice applications, with caveats. Building a real-time bidirectional voice conversation required solving disconnection, transcription, and scoring extraction challenges. The API delivers a natural conversation experience, but stability for sessions over 10 minutes and function calling reliability remain areas for improvement.
- Structured output and function calling are game-changers. Getting Gemini to return reliable, typed JSON with 10+ feedback sections and deterministic band scores was the key to building a trustworthy scoring system. However, function calling in the Live API behaves differently from the standard API and requires multi-layer fallback strategies.
- Building robust workarounds is essential. Every major Gemini Live API limitation required a non-trivial workaround: three-layer score extraction, periodic buffer flushing for transcription, auto-reconnection with context re-injection, and client-side analytics to compensate for missing server logs.
- Solo development with AI assistance is a new paradigm. Claude Code (Opus 4.5) and Google Antigravity (Gemini 3 Pro) made it possible for one person to build a full production app in under 50 days.
What's next for Joe Speaking
- More stable live conversation + extra providers. We hope Gemini Live API becomes more stable for 10+ minute conversations. We are also adding extra live conversation providers beyond Gemini and OpenAI for better coverage and fallback.
- Evaluation framework for Live API and prompt improvement. Building a systematic eval pipeline to measure conversation quality, scoring accuracy, and function calling reliability across sessions. Using eval results to iteratively improve prompts and model configuration.
- AI agent for personalized learning analysis. An AI agent that analyzes each user's strengths and weaknesses across sessions to provide more realistic, actionable suggestions tailored to their specific improvement areas.
- TTS and pronunciation analysis. Text-to-speech integration and phoneme-level pronunciation analysis to give users detailed feedback on how they sound, not just what they say.
- Mobile app. Native iOS/Android app for a better on-the-go experience (currently PWA).
- Distribution features. Tools for sharing, embedding, and distributing content to reach more learners across platforms.
Built With
- ai-sdk/google`)-gemini-live-api-gemini-structured-output-gemini-function-calling-vercel-ai-sdk-transformers.js-(hugging-face)-assemblyai-###-cloud-services-&-infrastructure-supabase-(postgresql
- assemblyai
- auth
- backblaze-b2
- chart.js
- claude-code
- cloudflare-workers
- ffmpeg-wasm
- framer-motion
- gemini-function-calling
- gemini-live-api
- gemini-structured-output
- google-antigravity
- google-gemini
- google/genai`
- indexeddb
- languages-&-frameworks-typescript-next.js-14-(app-router)-react-18-tailwind-css-framer-motion-###-ai-&-machine-learning-google-gemini-2.5-/-3-(`@google/generative-ai`
- next.js-14
- playwright
- posthog
- react
- sentry
- stripe
- supabase
- tailwind-css
- three.js
- transformers.js
- typescript
- upstash-redis
- vercel
- vercel-ai-sdk
- vitest
- zod
- zustand
Log in or sign up for Devpost to join the conversation.