ASLoud — About the Project
Inspiration
Video calls, classes, and streams still assume everyone can speak into a mic. Friends who sign (ASL) or prefer typing are often left out. We built ASLoud so hands and text can be heard anywhere—turning signs or sentences into a continuous voice that any app recognizes as a microphone.
What We Built
- ASL-to-words with YOLO: A fine-tuned YOLO model detects hands/fingerspelling and segments signing into recognizable units, extracting letters and word tokens from video in real time.
- Phrase context with Gemini: Recognized tokens/glosses flow into a Gemini prompt that connects, disambiguates, and rephrases them into natural, context-aware English (e.g., resolving pronouns, idioms, and domain terms).
- Continuous speech queue: Finalized sentences are enqueued and played strictly in order—no manual “next.”
- “Integrated mic” output: The voice stream can be routed into Zoom, Discord, OBS, etc., so anywhere that accepts a microphone can “hear” ASL or text.
How We Built It (high level)
- Queue-first design so every sentence is spoken to completion, then the next clip begins.
- Background TTS worker generates audio files atomically and signals the player when each is ready.
- Playback track stitches clips gaplessly and deletes them after play; emits clean silence while idle.
- Frontend keeps it simple: type/paste text or enable camera capture, select voice, view queue status.
Challenges (especially ML accuracy)
Data & Labels
- Limited labeled ASL data and inconsistent gloss conventions made generalization hard.
- Annotation noise (different labeling styles for the same sign) introduced instability.
- Class imbalance: common signs dominated; rare or domain-specific signs underperformed.
What helped
- Normalizing gloss conventions and adding a small mapping layer to reduce label drift.
- Targeted augmentation for webcam realities (crop, blur, lighting jitter, tempo changes).
- Oversampling rare classes and curriculum-style training schedules.
Signer & Environment Variability
- Different signers (hand shapes, pace, motion range) and recording conditions (webcams, lighting, angles) created domain shifts.
- Non-manual markers (facial expression, mouth movements) carry meaning but are easy to miss.
What helped
- Augmentations tuned to low-light, motion blur, and off-axis angles.
- Fusing hand/pose features with face cues where available.
- Domain-adaptive fine-tuning on a small, diverse validation set.
Temporal Modeling & Coarticulation
- Signs blend together; deciding boundaries between signs and fingerspelling is non-trivial.
- Fast, partially occluded fingerspelling confused letters with nearby signs.
What helped
- Sliding-window inference with overlap to reduce boundary errors.
- Lightweight sequence models for temporal context and rescoring.
- A dedicated fingerspelling branch and dictionary constraints for names/OOV words.
From Recognized Signs to Natural English
- Literal gloss sequences don’t read like English; early outputs felt robotic or wrong.
- ASL grammar doesn’t map 1:1 with English word order or pronouns.
What helped
- Using Gemini as a contextualizer to transform tokens/glosses into fluent phrases with minimal latency.
- Rule-based post-processing for names, numbers, and common idioms.
- Confidence-weighted merging of recognizer output before phrasing to avoid cascading errors.
Real-Time Constraints
- Latency vs. accuracy was a constant trade-off: bigger models helped accuracy but hurt responsiveness.
- On-device vs. cloud affected both performance and privacy.
What helped
- Distilling larger models into smaller students for inference.
- Early-exit heuristics on high-confidence frames; deferring heavy passes when confidence dips.
- Batching TTS requests without introducing audible gaps.
What We Learned
- Accuracy is holistic: data consistency, signer diversity, and temporal handling matter as much as model choice.
- Small UX details unlock adoption: atomic writes, ordered filenames, and silence frames kept the mic “alive” and trustworthy.
- Human-in-the-loop helps: quick correction tools (for names/new terms) dramatically improved downstream phrasing quality.
What’s Next
- Expand signer diversity and environments; add targeted active-learning loops.
- Improve fingerspelling with specialized detectors and better camera guidance in the UI.
- Offer user dictionaries and domain packs (classroom, medical, gaming) for context-aware phrasing.
- Optional on-device inference paths for privacy-sensitive users.
Tagline: ASLoud — From hands to voice, live.

Log in or sign up for Devpost to join the conversation.