ASLoud — About the Project

Inspiration

Video calls, classes, and streams still assume everyone can speak into a mic. Friends who sign (ASL) or prefer typing are often left out. We built ASLoud so hands and text can be heard anywhere—turning signs or sentences into a continuous voice that any app recognizes as a microphone.

What We Built

  • ASL-to-words with YOLO: A fine-tuned YOLO model detects hands/fingerspelling and segments signing into recognizable units, extracting letters and word tokens from video in real time.
  • Phrase context with Gemini: Recognized tokens/glosses flow into a Gemini prompt that connects, disambiguates, and rephrases them into natural, context-aware English (e.g., resolving pronouns, idioms, and domain terms).
  • Continuous speech queue: Finalized sentences are enqueued and played strictly in order—no manual “next.”
  • “Integrated mic” output: The voice stream can be routed into Zoom, Discord, OBS, etc., so anywhere that accepts a microphone can “hear” ASL or text.

How We Built It (high level)

  • Queue-first design so every sentence is spoken to completion, then the next clip begins.
  • Background TTS worker generates audio files atomically and signals the player when each is ready.
  • Playback track stitches clips gaplessly and deletes them after play; emits clean silence while idle.
  • Frontend keeps it simple: type/paste text or enable camera capture, select voice, view queue status.

Challenges (especially ML accuracy)

Data & Labels

  • Limited labeled ASL data and inconsistent gloss conventions made generalization hard.
  • Annotation noise (different labeling styles for the same sign) introduced instability.
  • Class imbalance: common signs dominated; rare or domain-specific signs underperformed.

What helped

  • Normalizing gloss conventions and adding a small mapping layer to reduce label drift.
  • Targeted augmentation for webcam realities (crop, blur, lighting jitter, tempo changes).
  • Oversampling rare classes and curriculum-style training schedules.

Signer & Environment Variability

  • Different signers (hand shapes, pace, motion range) and recording conditions (webcams, lighting, angles) created domain shifts.
  • Non-manual markers (facial expression, mouth movements) carry meaning but are easy to miss.

What helped

  • Augmentations tuned to low-light, motion blur, and off-axis angles.
  • Fusing hand/pose features with face cues where available.
  • Domain-adaptive fine-tuning on a small, diverse validation set.

Temporal Modeling & Coarticulation

  • Signs blend together; deciding boundaries between signs and fingerspelling is non-trivial.
  • Fast, partially occluded fingerspelling confused letters with nearby signs.

What helped

  • Sliding-window inference with overlap to reduce boundary errors.
  • Lightweight sequence models for temporal context and rescoring.
  • A dedicated fingerspelling branch and dictionary constraints for names/OOV words.

From Recognized Signs to Natural English

  • Literal gloss sequences don’t read like English; early outputs felt robotic or wrong.
  • ASL grammar doesn’t map 1:1 with English word order or pronouns.

What helped

  • Using Gemini as a contextualizer to transform tokens/glosses into fluent phrases with minimal latency.
  • Rule-based post-processing for names, numbers, and common idioms.
  • Confidence-weighted merging of recognizer output before phrasing to avoid cascading errors.

Real-Time Constraints

  • Latency vs. accuracy was a constant trade-off: bigger models helped accuracy but hurt responsiveness.
  • On-device vs. cloud affected both performance and privacy.

What helped

  • Distilling larger models into smaller students for inference.
  • Early-exit heuristics on high-confidence frames; deferring heavy passes when confidence dips.
  • Batching TTS requests without introducing audible gaps.

What We Learned

  • Accuracy is holistic: data consistency, signer diversity, and temporal handling matter as much as model choice.
  • Small UX details unlock adoption: atomic writes, ordered filenames, and silence frames kept the mic “alive” and trustworthy.
  • Human-in-the-loop helps: quick correction tools (for names/new terms) dramatically improved downstream phrasing quality.

What’s Next

  • Expand signer diversity and environments; add targeted active-learning loops.
  • Improve fingerspelling with specialized detectors and better camera guidance in the UI.
  • Offer user dictionaries and domain packs (classroom, medical, gaming) for context-aware phrasing.
  • Optional on-device inference paths for privacy-sensitive users.

Tagline: ASLoud — From hands to voice, live.

Built With

Share this project:

Updates