Inspiration

In today's digital world, video content dominates communication—from online meetings and lectures to social media and entertainment. However, 26 million Americans with hearing disabilities face barriers to accessing this content. Traditional closed captions rely solely on audio, which fails in noisy environments, accented speech, or when audio quality is poor.

We were inspired by the challenge: What if we could "see" speech when we can't hear it? Just as humans naturally lip-read during loud conversations, we envisioned an AI system that combines audio transcription, visual lip reading, and facial recognition to create the most accurate, accessible captions possible.

What it does

Gorggle was born from the belief that accessibility shouldn't be an afterthought—it should be intelligent, robust, and built for real-world conditions. Accessibility should be at the forefront of bringing the world to our hands. Gorggle is an AI-powered video transcription platform, ideally integrated into Meta Ray Ban Glasses, that creates multimodal captions by fusing three data streams:

  • Audio Transcription - AWS Transcribe analyzes speech with speaker diarization
  • AI Lip Reading - Deep learning models (AV-HuBERT/AVSRCocktail) extract visual speech from lip movements
  • Face Tracking - AWS Rekognition detects and tracks speakers across video frames

The system intelligently fuses these sources to:

  • Provide accurate captions even in noisy environments
  • Identify "who said what" by matching speakers to faces
  • Fill gaps when audio is garbled or muted
  • Generate time-aligned overlays for accessibility

Use cases:

  • Meeting accessibility (Zoom, Teams recordings)
  • Educational content (lectures, online courses)
  • Media production (automatic subtitle generation)

How we built it

Architecture: Serverless + GPU-Accelerated Hybrid Backend (Serverless):

  • AWS Lambda (8 microservices: S3 trigger, media extraction, transcription, face detection, lip reading invocation, result fusion, API)
  • AWS Step Functions - Orchestrates parallel processing pipeline
  • AWS S3 - Video storage (uploads + processed outputs)
  • AWS DynamoDB - Job tracking and metadata
  • AWS Transcribe - Audio speech recognition with speaker diarization
  • AWS Rekognition - Face detection, tracking, and emotion analysis
  • LipCoordNet - Raw image sequences integrated with corresponding lip landmark coordinates, offering a more holistic view of speech articulation
  • Terraform - Infrastructure as Code for reproducible deployments
  • AWS EC2 (g6.xlarge) - NVIDIA L4 GPU instances for model inference
  • Docker - Containerized model deployment
  • FFmpeg - Video/audio extraction and preprocessing
  • JavaScript - Lightweight web interface
  • HTML5/CSS3 - Responsive UI with drag-and-drop upload
  • AWS API Gateway - RESTful API for caption retrieval
  • AWS CloudWatch - Logging and monitoring
  • FastAPI - API server (EC2)
  • dlib (19.24.0+) - Facial landmark detection
  • Pillow (9.3.0+) - Image manipulation
  • torchvision (0.16.0+) - Computer vision
  • torchaudio (2.1.0+) - Audio processing

Challenges we ran into

Lip reading model complexity: We were challenged with trying multiple different lip reading models. AV-HuBERT was too large and required video preprocessing despite having audio and visual embeddings. We were unable to fit AV-HuBERT to SageMaker, requiring us to pivot to LipCoordNet. With LipCoordNet, we were able connect the SageMaker endpoint to the model and receive a response. We input video into the model which processed some of the frames. However, LipCoordNet's output was inaccurate and unusable for our purposes. Therefore, we were limited by the spread of lip reading models.

Accomplishments that we're proud of

  • Successfully deployed a model through SageMaker and received a response from that model
  • Implemented the front-end with hopes of creating a website that is easily accessible for those who want to transcribe videos
  • Set up S3 Buckets, Lambda layers for audio and frame extraction, and familiarized ourselves with other features of AWS (Transcribe, Rekognition, EC2).

What we learned

  • We had hands-on experience with AWS features
  • We deepened our understanding of audio-visual speech recognition architectures
  • We learned how to deploy large PyTorch models on AWS infrastructure

What's next for Gorggles

With more time, we hope to train our own lip reading model and continue this project

Additional Features:

  • Smart glasses integration - Real-time captions on AR glasses for deaf/hard-of-hearing users
  • Sign language translation - ASL recognition and translation, allowing for a seamless conversation

Built With

Share this project:

Updates