Inspiration
Sci-fi movies, MIT AlterEgo, speech restoration devices.
What it does
It reads your lips and converts that into text. It then takes than text and synthesises your voice using ElevenLabs for voice cloning. The goal is for voice restoration for people who have had laryngectomies, and other speech impairments which renders speech painful or impossible.
How we built it
AutoAVSR visual speech recognition model into Claude for domain specific prompt engineering (viseme to phoneme based correction, based on linguistics and conversational context), and combining conversation context to intelligently correct the users output across multiple stages (first phonetically, then grammatically and contextually). Finally using ElevenLabs for real-time voice synthesis.
Challenges we ran into
Non-native english speakers mouth movements can be harder for the model to interpret due to the original training dataset being comprised of native speakers.
Accomplishments that we're proud of
Fully working real-time deployed live website with the pipeline working end to end with strong real world performance, within a day.
Novel method to achieve real-time performance by removing the default language model for the deep learning model and using Claude in place. Claude's very strong reasoning capabilities compensate for the language model loss and achieves lower word-error-rate (higher accuracy) while being faster.
What we learned
Lip reading still has its challenges and isn't perfect in the real world. However, in limited domains under certain conditions it could prove to be more useful.
What's next for Voice Keeper
Public release of the demo!
Log in or sign up for Devpost to join the conversation.