Gorggles | Devpost

The results page that shows the transcriptions along with the video overlayed with the text.
Our upload page to take in video files.
Our workflow and description of services.

Inspiration

In today's digital world, video content dominates communication—from online meetings and lectures to social media and entertainment. However, 26 million Americans with hearing disabilities face barriers to accessing this content. Traditional closed captions rely solely on audio, which fails in noisy environments, accented speech, or when audio quality is poor.

We were inspired by the challenge: What if we could "see" speech when we can't hear it? Just as humans naturally lip-read during loud conversations, we envisioned an AI system that combines audio transcription, visual lip reading, and facial recognition to create the most accurate, accessible captions possible.

What it does

Gorggle was born from the belief that accessibility shouldn't be an afterthought—it should be intelligent, robust, and built for real-world conditions. Accessibility should be at the forefront of bringing the world to our hands. Gorggle is an AI-powered video transcription platform, ideally integrated into Meta Ray Ban Glasses, that creates multimodal captions by fusing three data streams:

Audio Transcription - AWS Transcribe analyzes speech with speaker diarization
AI Lip Reading - Deep learning models (AV-HuBERT/AVSRCocktail) extract visual speech from lip movements
Face Tracking - AWS Rekognition detects and tracks speakers across video frames

The system intelligently fuses these sources to:

Provide accurate captions even in noisy environments
Identify "who said what" by matching speakers to faces
Fill gaps when audio is garbled or muted
Generate time-aligned overlays for accessibility

Use cases:

Meeting accessibility (Zoom, Teams recordings)
Educational content (lectures, online courses)
Media production (automatic subtitle generation)

How we built it

Architecture: Serverless + GPU-Accelerated Hybrid Backend (Serverless):

AWS Lambda (8 microservices: S3 trigger, media extraction, transcription, face detection, lip reading invocation, result fusion, API)
AWS Step Functions - Orchestrates parallel processing pipeline
AWS S3 - Video storage (uploads + processed outputs)
AWS DynamoDB - Job tracking and metadata
AWS Transcribe - Audio speech recognition with speaker diarization
AWS Rekognition - Face detection, tracking, and emotion analysis
LipCoordNet - Raw image sequences integrated with corresponding lip landmark coordinates, offering a more holistic view of speech articulation
Terraform - Infrastructure as Code for reproducible deployments
AWS EC2 (g6.xlarge) - NVIDIA L4 GPU instances for model inference
Docker - Containerized model deployment
FFmpeg - Video/audio extraction and preprocessing
JavaScript - Lightweight web interface
HTML5/CSS3 - Responsive UI with drag-and-drop upload
AWS API Gateway - RESTful API for caption retrieval
AWS CloudWatch - Logging and monitoring
FastAPI - API server (EC2)
dlib (19.24.0+) - Facial landmark detection
Pillow (9.3.0+) - Image manipulation
torchvision (0.16.0+) - Computer vision
torchaudio (2.1.0+) - Audio processing

Challenges we ran into

Lip reading model complexity: We were challenged with trying multiple different lip reading models. AV-HuBERT was too large and required video preprocessing despite having audio and visual embeddings. We were unable to fit AV-HuBERT to SageMaker, requiring us to pivot to LipCoordNet. With LipCoordNet, we were able connect the SageMaker endpoint to the model and receive a response. We input video into the model which processed some of the frames. However, LipCoordNet's output was inaccurate and unusable for our purposes. Therefore, we were limited by the spread of lip reading models.

Accomplishments that we're proud of

Successfully deployed a model through SageMaker and received a response from that model
Implemented the front-end with hopes of creating a website that is easily accessible for those who want to transcribe videos
Set up S3 Buckets, Lambda layers for audio and frame extraction, and familiarized ourselves with other features of AWS (Transcribe, Rekognition, EC2).

What we learned

We had hands-on experience with AWS features
We deepened our understanding of audio-visual speech recognition architectures
We learned how to deploy large PyTorch models on AWS infrastructure

What's next for Gorggles

With more time, we hope to train our own lip reading model and continue this project

Additional Features:

Smart glasses integration - Real-time captions on AR glasses for deaf/hard-of-hearing users
Sign language translation - ASL recognition and translation, allowing for a seamless conversation

Built With

amazon-apigateway
amazon-cloudwatch
amazon-dynamodb
amazon-ec2
amazon-lambda
amazon-rekognition
amazon-stepfunctions
amazon-transcribe
amazon-web-services
av-hubert
css3
dlib
docker
fastapi
ffmpeg
html5
javascript
lipcoordnet
pillow
sagemaker
torchaudio
torchvision

Updates

Ariana Sun started this project — Oct 26, 2025 07:36 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.