SpeechLabs

Inspiration

67% of people fear public speaking more than death. We've all experienced it - shaking hands, racing heart, mind going blank during presentations. Traditional speech coaching costs $100-300 per hour and doesn't scale to help the millions struggling with public speaking anxiety.

We built SpeechLabs to democratize access to world-class speech coaching. Our platform combines cutting-edge AI to analyze speech emotions, optimize delivery, and provide personalized feedback that helps anyone speak with confidence - whether preparing for job interviews, class presentations, or TED Talks.

What it does

SpeechLabs is a comprehensive AI-powered speech improvement platform with two core features:

Feature 1: Video Speech Analysis

  • Upload any video of yourself speaking
  • Accurate speech-to-text transcription with word-level timestamps using Deepgram Nova-2
  • Emotion detection throughout your speech using Wav2Vec2 AI
  • Speaking rate analysis with words-per-second calculations and optimal range indicators (2.0-3.0 WPS)
  • Visual timeline graphs showing pacing patterns
  • Speech clarity scoring based on transcription confidence
  • Google Gemini AI generates personalized coaching insights analyzing your transcript, emotions, and pacing patterns
  • Text-to-speech feedback using Deepgram TTS for accessibility
  • PDF export for professional reports

Feature 2: Practice Conversations with Voice Agent

  • Real-time conversational practice using Deepgram's Voice Agent API with streaming Nova-3
  • Natural back-and-forth dialogue to improve everyday speaking skills
  • Autonomous LLM function calling: agent decides when to analyze your conversation and save results
  • Google Gemini AI processes conversation transcripts to analyze filler words, conversational flow, and engagement quality
  • Structured insights automatically saved to database for progress tracking
  • Chat history with text and audio responses

Progress Tracking

  • Google OAuth authentication for secure login
  • SQLite database stores complete analysis history for both speech analyses and practice sessions
  • Dashboard showing all past speeches and conversations
  • Timeline graphs tracking improvement over time
  • Unified history view for comprehensive progress monitoring

How we built it

Backend Architecture (Flask + Python)

  • Flask 3.1.2 REST API with Gunicorn production server
  • FFmpeg for video-to-audio extraction and segmentation
  • Parallel processing pipeline for efficiency

AI Integration Stack

  • Deepgram SDK for comprehensive speech services:
    • Nova-2 model for batch video transcription with Smart Formatting
    • Nova-3 model with streaming for real-time voice conversations
    • Voice Agent API orchestrating STT, LLM, and TTS in a single unified pipeline
    • Aura text-to-speech for natural voice responses and accessibility features
  • Google Gemini API (gemini-2.0-flash-exp) for:
    • Intelligent feedback generation analyzing speech patterns
    • Interactive coaching responses
    • Conversation analysis processing (filler words, flow, engagement)
    • Orchestrating autonomous function calls in voice agent
  • Wav2Vec2 (Hugging Face) for emotion recognition throughout speech

Data Pipeline

Video Upload → Audio Extraction (FFmpeg) → Segmentation → 
Deepgram Nova-2 STT → Wav2Vec2 Emotion Analysis → 
Metrics Calculation → Gemini Feedback → Database Storage → PDF Generation

Practice Conversation → Deepgram Voice Agent (Nova-3 streaming) →
LLM Function Call (analyze_conversation_practice) → Gemini Analysis →
LLM Function Call (save_conversation_to_history) → Database Storage

LLM Function Calling Architecture

  • Agent autonomously decides when to call backend functions based on conversation context
  • analyze_conversation_practice: Sends transcript to Gemini for structured insight generation
  • save_conversation_to_history: Writes Gemini-generated analysis to SQLite database
  • Client-side function execution with FunctionCallRequest/Response messaging pattern
  • Seamless integration with Deepgram's Voice Agent WebSocket protocol

Frontend (React 18)

  • React Router for SPA navigation
  • Recharts for interactive WPS timeline graphs and emotion visualizations
  • Deepgram SDK integration for Voice Agent WebSocket connections
  • AudioWorklet for low-latency audio processing in voice mode
  • Responsive CSS3 design

Database & Auth

  • SQLite with Flask-SQLAlchemy ORM
  • Two primary models: Analysis (video speeches) and PracticeSession (conversations)
  • Google OAuth 2.0 via Authlib
  • Flask-Login for session management

Document Generation

  • ReportLab for professional PDF reports
  • Plotly for visualizations in exports

Challenges we ran into

Multi-AI Coordination Synchronizing Deepgram and Gemini services while maintaining fast response times was complex. We implemented efficient processing pipelines that handle transcription and metric calculation before feeding results to Gemini for analysis.

Real-Time Voice Agent Integration Implementing Deepgram's Voice Agent API with client-side function calling required careful WebSocket state management, proper event handling for FunctionCallRequest messages, and robust error recovery. We had to ensure autonomous function execution didn't break conversation flow.

LLM Function Calling Reliability Getting the agent to consistently call functions at appropriate times required extensive prompt engineering. We had to teach the LLM when to analyze conversations versus when to just respond naturally, and ensure function parameters were properly formatted for backend processing.

WPS Calculation Accuracy Computing accurate words-per-second from Deepgram's word-level timestamps required careful handling of pauses, speech overlaps, and segment boundaries. We developed algorithms to identify natural speech units and filter out long pauses that would skew metrics.

Gemini JSON Consistency Getting reliable structured JSON outputs from Gemini for both speech analysis and conversation insights required extensive prompt engineering. We implemented robust error handling with JSON extraction from mixed text responses and fallback mechanisms when parsing fails.

OAuth Integration Implementing Google OAuth while maintaining seamless user experience across development and production environments required careful redirect URI configuration and state management.

Large File Handling Managing video uploads up to 700MB required optimizing Flask's multipart form handling, implementing streaming uploads, and automated temporary file cleanup to prevent server disk overflow.

Audio Streaming Architecture Building a reliable audio pipeline with AudioWorklet processors for capturing microphone input, converting to the correct format (linear16 PCM), and streaming to Deepgram's Voice Agent required understanding Web Audio APIs and browser compatibility nuances.

Accomplishments that we're proud of

Dual-Feature Platform - We built not one but two complete features: video speech analysis for prepared presentations and conversational practice for everyday speaking skills - addressing the full spectrum of communication improvement.

Streaming Voice Agent with Autonomous Function Calling - Successfully implemented Deepgram's Voice Agent API with client-side LLM function calling, where the agent autonomously decides when to analyze conversations and persist results to database - satisfying all hackathon requirements with true AI reasoning.

AI Integration Pipeline - Successfully orchestrated Deepgram (Nova-2, Nova-3, Voice Agent, TTS), Gemini AI, and Wav2Vec2 into seamless pipelines that deliver comprehensive analysis.

Production-Ready Authentication - Implemented secure Google OAuth with complete user session management and data isolation.

Beautiful Data Visualization - Complex speech metrics presented through intuitive interactive timelines that anyone can understand.

Professional Export Capability - PDF generation with formatted tables, charts, and comprehensive analysis summaries.

Persistent User Experience - Full database integration allowing users to track improvement over weeks and months with unified history for both speech analyses and practice conversations.

What we learned

AI Orchestration

  • How to coordinate multiple AI services (Gemini, Deepgram Nova-2/Nova-3, Wav2Vec2) in production
  • Implementing Deepgram's Voice Agent API with streaming speech-to-text for real-time conversations
  • Building client-side function calling with proper WebSocket event handling
  • Balancing API costs vs performance vs accuracy trade-offs
  • Prompt engineering for consistent structured outputs from LLMs and autonomous function calling decisions

Audio Processing & Speech Analysis

  • Deep dive into FFmpeg for audio extraction and manipulation
  • Understanding speech signal processing with Deepgram's Nova models
  • Computing meaningful speech metrics (WPS, clarity, speech patterns)
  • Web Audio API, AudioWorklet, and browser audio processing
  • Real-time audio streaming with low latency considerations

Two Distinct LLM Function Calling Patterns

  • Designing function schemas that guide LLM decision-making
  • Handling FunctionCallRequest/FunctionCallResponse message flows
  • Error recovery when functions fail during live conversations
  • Balancing function complexity vs. agent reasoning capabilities
  • Data creation for Conversation Practice Analysis using LLM Function
  • Data history created through LLM function using analysis data handling

Full-Stack OAuth Implementation

  • Google OAuth 2.0 flow with proper state management
  • Secure session handling with Flask-Login
  • CORS configuration for authenticated API requests

Database Design for Analytics

  • SQLAlchemy ORM patterns for user data and analysis relationships
  • Efficient querying for dashboard analytics
  • Data schema design for speech metrics and conversation history storage

Real-Time Web Technologies

  • WebSocket integration for Voice Agent connections
  • AudioWorklet for low-latency audio capture and processing
  • Managing WebSocket state across React component lifecycles

Production Deployment Considerations

  • Environment variable management across development and production
  • File upload security and validation
  • Error handling and user feedback strategies
  • Rate limiting and API quota management for multiple AI services

What's next for SpeechLabs

Enhanced Voice Agent Features - Expand function calling capabilities to include goal setting, personalized practice routines, and adaptive difficulty levels based on user progress.

Live Practice Mode - Real-time feedback while speaking with instant WPS meter, filler word detection, and immediate visual feedback during practice sessions.

Advanced Analytics Dashboard - Trend graphs showing improvement over time across both prepared speeches and conversational practice, predictive insights for goal achievement, and benchmark comparisons against optimal ranges.

Multi-Language Support - Expand Deepgram and Gemini integration to support speech analysis and practice in multiple languages, making the platform accessible globally.

Mobile Application - React Native app with on-device processing for practice anywhere, push notifications for practice reminders, and offline mode with sync.

Group Practice Sessions - Collaborative features where multiple users can practice conversations together with AI moderation and group insights.

Built With

Share this project:

Updates