ASLoud — About the Project

Inspiration

Video calls, classes, and streams still assume everyone can speak into a mic. Friends who sign (ASL) or prefer typing are often left out. We built ASLoud so hands and text can be heard anywhere—turning signs or sentences into a continuous voice that any app recognizes as a microphone.

What We Built

ASL-to-words with YOLO: A fine-tuned YOLO model detects hands/fingerspelling and segments signing into recognizable units, extracting letters and word tokens from video in real time.
Phrase context with Gemini: Recognized tokens/glosses flow into a Gemini prompt that connects, disambiguates, and rephrases them into natural, context-aware English (e.g., resolving pronouns, idioms, and domain terms).
Continuous speech queue: Finalized sentences are enqueued and played strictly in order—no manual “next.”
“Integrated mic” output: The voice stream can be routed into Zoom, Discord, OBS, etc., so anywhere that accepts a microphone can “hear” ASL or text.

How We Built It (high level)

Queue-first design so every sentence is spoken to completion, then the next clip begins.
Background TTS worker generates audio files atomically and signals the player when each is ready.
Playback track stitches clips gaplessly and deletes them after play; emits clean silence while idle.
Frontend keeps it simple: type/paste text or enable camera capture, select voice, view queue status.

Challenges (especially ML accuracy)

Data & Labels

Limited labeled ASL data and inconsistent gloss conventions made generalization hard.
Annotation noise (different labeling styles for the same sign) introduced instability.
Class imbalance: common signs dominated; rare or domain-specific signs underperformed.

What helped

Normalizing gloss conventions and adding a small mapping layer to reduce label drift.
Targeted augmentation for webcam realities (crop, blur, lighting jitter, tempo changes).
Oversampling rare classes and curriculum-style training schedules.

Signer & Environment Variability

Different signers (hand shapes, pace, motion range) and recording conditions (webcams, lighting, angles) created domain shifts.
Non-manual markers (facial expression, mouth movements) carry meaning but are easy to miss.

What helped

Augmentations tuned to low-light, motion blur, and off-axis angles.
Fusing hand/pose features with face cues where available.
Domain-adaptive fine-tuning on a small, diverse validation set.

Temporal Modeling & Coarticulation

Signs blend together; deciding boundaries between signs and fingerspelling is non-trivial.
Fast, partially occluded fingerspelling confused letters with nearby signs.

What helped

Sliding-window inference with overlap to reduce boundary errors.
Lightweight sequence models for temporal context and rescoring.
A dedicated fingerspelling branch and dictionary constraints for names/OOV words.

From Recognized Signs to Natural English

Literal gloss sequences don’t read like English; early outputs felt robotic or wrong.
ASL grammar doesn’t map 1:1 with English word order or pronouns.

What helped

Using Gemini as a contextualizer to transform tokens/glosses into fluent phrases with minimal latency.
Rule-based post-processing for names, numbers, and common idioms.
Confidence-weighted merging of recognizer output before phrasing to avoid cascading errors.

Real-Time Constraints

Latency vs. accuracy was a constant trade-off: bigger models helped accuracy but hurt responsiveness.
On-device vs. cloud affected both performance and privacy.

What helped

Distilling larger models into smaller students for inference.
Early-exit heuristics on high-confidence frames; deferring heavy passes when confidence dips.
Batching TTS requests without introducing audible gaps.

What We Learned

Accuracy is holistic: data consistency, signer diversity, and temporal handling matter as much as model choice.
Small UX details unlock adoption: atomic writes, ordered filenames, and silence frames kept the mic “alive” and trustworthy.
Human-in-the-loop helps: quick correction tools (for names/new terms) dramatically improved downstream phrasing quality.

What’s Next

Expand signer diversity and environments; add targeted active-learning loops.
Improve fingerspelling with specialized detectors and better camera guidance in the UI.
Offer user dictionaries and domain packs (classroom, medical, gaming) for context-aware phrasing.
Optional on-device inference paths for privacy-sensitive users.

Tagline: ASLoud — From hands to voice, live.