Joe Speaking | Devpost

Inspiration

Years ago, my English speaking held me back. I grew up in a small Chinese town where English class meant "deaf and mute" learning: reading and writing only, never speaking. When I finally took my IELTS exam in 2022, I scored 6.5, one of the most exciting days of my life. But after moving to Canada for graduate school in 2024, I realized my speaking hadn't improved much despite months of effort. The issue wasn't quantity, it was quality. I was practicing without meaningful feedback.

In late 2023, when OpenAI released GPTs with voice, I built an IELTS Speaking Simulator GPT and open-sourced it. It went viral on Chinese social media. 50,000+ learners used it, with a 4.4/5 rating. But I noticed a critical flaw in my own usage: I rarely reviewed the feedback. I kept making the same mistakes. That's when I knew I needed to build something better: a tool focused not just on practice, but on deliberate practice with structured feedback and progress tracking.

Standardized tests like IELTS provide a well-defined rubric, which makes them an ideal evaluation framework for voice AI. The structured scoring criteria give both the model and the user a clear, measurable benchmark. This insight became the foundation of Joe Speaking.

What it does

Joe Speaking is an end-to-end English speaking practice platform with two core modes:

1. Real-time IELTS Speaking Simulator: Users have a natural voice conversation with an AI examiner that follows the official 3-part IELTS format. The AI examiner speaks, listens, and responds in real-time. After the session, users get instant band scores extracted via structured scoring.

2. Daily Recording Practice: Users record themselves speaking on any topic (IELTS, CELPIP, or freestyle) and get comprehensive AI feedback: band scores with criterion-level checks, edited transcripts with position-marked corrections, vocabulary pattern analysis, self-correction challenges, mini quizzes generated from their own errors, model responses, and native-speaker comprehensibility ratings.

Beyond individual sessions, the platform provides:

Version comparison: practice the same topic multiple times and see what's improving
Daily and weekly AI-generated progress reviews that identify persistent patterns
Spaced repetition for vocabulary and error correction
Full cloud sync across devices with local-first architecture

All at approximately $1/session vs. $25-60/hour for a human tutor: 100x cheaper with fully personalized feedback.

How we built it

Joe Speaking is a full-stack Next.js 14 application with Gemini as the AI backbone across four pillars:

Real-time IELTS Speaking Simulator (Gemini Live API): A bidirectional voice WebSocket connection enables natural conversation with an AI examiner. The simulator uses part-transition detection, cue card display, and Gemini's function calling to extract structured band scores deterministically.
Comprehensive AI Feedback (Gemini Structured Output): Every recording gets 10+ sections of feedback returned as typed JSON via Gemini's responseSchema. A modular prompt architecture with test-specific rubrics (IELTS/CELPIP) ensures consistent, actionable feedback.
Smart Transcription (Transformers.js + Cloud ASR): A dual-mode ASR system uses in-browser Whisper via Transformers.js (free, private, WebGPU/WASM) or cloud providers (AssemblyAI for higher accuracy). Intelligent provider selection adapts based on user context.
AI-Powered Progress Reviews: Gemini analyzes aggregated daily/weekly practice data to generate comprehensive progress reports, identifying persistent patterns and improvement areas.

The architecture supports both BYOK (Bring Your Own Key) and credit-based server mode, using the same prompts as a single source of truth. A combined API call strategy reduces input tokens by ~50% compared to parallel calls.

The entire product was built solo in under 80 days using Claude Code (Opus 4.5) and Google Antigravity (Gemini 3 Pro).

Challenges we ran into

Building a production-quality real-time voice application on the Gemini Live API was the biggest challenge. What was planned as a one-week integration became five weeks of deep iteration. Here are the core issues we encountered and solved:

WebSocket disconnects (1008/1011). Sessions disconnect mid-conversation after 8-12 minutes with error codes 1008 (Policy Violation) and 1011 (Internal Error). Error messages include "Failed to run inference", "Thread was cancelled", "RPC::DEADLINE_EXCEEDED", and "RESOURCE_EXHAUSTED". We built auto-reconnection with exponential backoff, resumption tokens, context re-injection from saved transcripts, and a server-side scoring fallback when the Live API fails.
Function calling unreliability. The reportScoringResults function is called in only ~60-70% of sessions. The Live API does not support toolConfig.functionCallingConfig with mode: ANY (unlike the standard Gemini API). We built a three-layer extraction workaround: function calling (primary), text block parsing with markers, and regex extraction from spoken transcript as a final fallback. This brought reliability to ~90%+.
Transcription truncation during long speech. inputTranscription events stop or degrade during continuous speech longer than ~30 seconds, critical for IELTS Part 2 monologues (1-2 minutes). A 2-minute speech would return only ~250 characters instead of ~2,000+. After 12 rounds of debugging, the solution was found via community: disable automaticActivityDetection and flush the buffer every 15 seconds with activityEnd/activityStart signals.
Band score extraction non-determinism. After the AI examiner says "That is the end of the Speaking test", it outputs scoring blocks correctly only 50-70% of the time, sometimes ending the turn immediately and ignoring the scoring prompt. This is confirmed as non-deterministic model behavior, not a code bug. We documented 15 incident fixes over December 2025 to January 2026, including race conditions, incomplete block extraction from streaming data, and context-length degradation in full-test sessions.
Per-session cost tracking. usageMetadata appears intermittently in API responses, not every message. Live API sessions do not appear in Google Cloud Console logs. We built a custom CostTracker that parses available metadata and estimates costs based on audio token rates (32 tokens/second). All sessions run at less than $1 each.
No API-side logging. Unlike the standard Gemini API, Live API sessions are invisible in Google Cloud Console logs. We cannot debug 1011 errors server-side or verify token counts. We built comprehensive client-side analytics with 11 documented failure mode categories, scoring diagnostics in Supabase, and Sentry correlation for error investigation.

We documented all these issues and shared them with the Gemini team: Gemini Live API Issues: 1008/1011 Disconnects, Per-Session Cost, Function Calling, API Logs

Accomplishments that we're proud of

Production-quality product built solo in 80 days. From zero to a live app at joespeaking.com with real users, payment system, and full test infrastructure.
50,000+ learners have used the original IELTS Speaking Simulator GPT (4.4/5 rating), validating the core concept before building Joe Speaking.
Deep Gemini integration across 4 pillars: Live API for real-time voice, structured output for typed feedback, function calling for deterministic scoring, and multi-model strategy for different use cases.
Natural voice conversation experience. Users can speak naturally with an AI examiner, not click through menus or read prompts. The conversation flows like a real test.
Full practice lifecycle. Not just a test simulator, but an end-to-end practice platform with recording, transcription, feedback, version comparison, progress tracking, and spaced repetition review.
Comprehensive Gemini Live API feedback to Google. We documented all issues and shared a detailed feedback report with the Gemini team, contributing to the ecosystem.
Registered a company: Just Joe Technologies Inc. with the tagline "Building What Was Impossible Yesterday."

What we learned

Standardized tests are the ideal evaluation framework for voice AI. Tests like IELTS provide a well-defined rubric with structured scoring criteria. This gives both the AI model and the user a clear, measurable benchmark. It's the perfect starting point for building reliable voice AI feedback, and a natural hook that millions of learners already need.
Gemini Live API is production-ready for voice applications, with caveats. Building a real-time bidirectional voice conversation required solving disconnection, transcription, and scoring extraction challenges. The API delivers a natural conversation experience, but stability for sessions over 10 minutes and function calling reliability remain areas for improvement.
Structured output and function calling are game-changers. Getting Gemini to return reliable, typed JSON with 10+ feedback sections and deterministic band scores was the key to building a trustworthy scoring system. However, function calling in the Live API behaves differently from the standard API and requires multi-layer fallback strategies.
Building robust workarounds is essential. Every major Gemini Live API limitation required a non-trivial workaround: three-layer score extraction, periodic buffer flushing for transcription, auto-reconnection with context re-injection, and client-side analytics to compensate for missing server logs.
Solo development with AI assistance is a new paradigm. Claude Code (Opus 4.5) and Google Antigravity (Gemini 3 Pro) made it possible for one person to build a full production app in under 50 days.

What's next for Joe Speaking

More stable live conversation + extra providers. We hope Gemini Live API becomes more stable for 10+ minute conversations. We are also adding extra live conversation providers beyond Gemini and OpenAI for better coverage and fallback.
Evaluation framework for Live API and prompt improvement. Building a systematic eval pipeline to measure conversation quality, scoring accuracy, and function calling reliability across sessions. Using eval results to iteratively improve prompts and model configuration.
AI agent for personalized learning analysis. An AI agent that analyzes each user's strengths and weaknesses across sessions to provide more realistic, actionable suggestions tailored to their specific improvement areas.
TTS and pronunciation analysis. Text-to-speech integration and phoneme-level pronunciation analysis to give users detailed feedback on how they sound, not just what they say.
Mobile app. Native iOS/Android app for a better on-the-go experience (currently PWA).
Distribution features. Tools for sharing, embedding, and distributing content to reach more learners across platforms.

Built With

ai-sdk/google`)-gemini-live-api-gemini-structured-output-gemini-function-calling-vercel-ai-sdk-transformers.js-(hugging-face)-assemblyai-###-cloud-services-&-infrastructure-supabase-(postgresql
assemblyai
auth
backblaze-b2
chart.js
claude-code
cloudflare-workers
ffmpeg-wasm
framer-motion
gemini-function-calling
gemini-live-api
gemini-structured-output
google-antigravity
google-gemini
google/genai`
indexeddb
languages-&-frameworks-typescript-next.js-14-(app-router)-react-18-tailwind-css-framer-motion-###-ai-&-machine-learning-google-gemini-2.5-/-3-(`@google/generative-ai`
next.js-14
playwright
posthog
react
sentry
stripe
supabase
tailwind-css
three.js
transformers.js
typescript
upstash-redis
vercel
vercel-ai-sdk
vitest
zod
zustand

Updates

Joe Hu started this project — Feb 05, 2026 02:37 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.