Inspiration
Have you ever, even just once in your life, wanted to become like Mr. Beast? Not only popular in North America with his over-the-top and expensive videos, Mr. Beast has also risen to fame in other parts of the world, largely thanks to him creating multiple channels with dubbed versions of his videos. Mr. Beast is able to accomplish this because he's wealthy enough to hire professional voice actors who let people watch his videos in their native tongue, a feat unfeasible for most other content creators. Our team recognized smaller content creators' need for an affordable way to appeal to wider audiences and developed a tool to assist them in growing their channels. In addition, some of the best educational videos (By the organic chemistry tutor, TedEd, 3Blue1Brown) are only in English.
What it does
Linguistify is an AI-powered translation tool for converting video audio into audio in different languages. It is a useful tool for any content creators, whether big or small, to efficiently translate their videos without needing to spend a fortune on hiring voice actors. Additionally, consumers may also use it to view content in their preferred language.
How we built it
This tool first leverages Deepgram's speech-to-text AI to transcribe video content, uses DeepL for accurate text translation, then employ prompt engineering and re-generate the speech to ensure they are spoken within the allocated duration using the gpt API, and finally employs ElevenLabs' text-to-speech API to deliver the translated audio back to the user. Our team developed a custom algorithm that utilizes timestamps and speech duration to control the AI's speaking rate, ensuring perfect synchronization of the translated audio with the original video.
Challenges we ran into
Video dubbing is difficult since the translated text must be spoken in the same amount of time as the original. If the translated text is spoken in too little time, there would be large, awkward gaps, and if spoken in too much time, the audio and video would not sync up. Additionally, speakers often pause between and within their sentences which need to be accounted for during the speech-to-text. Voice actors fix these problems by controlling their pace or rewording the sentences so that they are shorter/longer and replicating this decision making process using code is more difficult. We used a lot of pipelines between many different AI/ML tools
Accomplishments that we're proud of
To address the issue of pauses/gaps between utterances, we separated every utterance into individual elements of an array. We keep track of when the utterances begin and end, and calculate the pause duration between. We then implement the pause into the system.
Another challenging feat we accomplished is ensuring that the utterances are spoken within the allocated durations, to ensure that the audio syncs up with the video content. The initial translated text can be too lengthy, so inputting them into gpt can make them much more concise, given the proper context and durations. We then calculate the speed factor needed for a final adjustment to the utterances, which are then combined (along with pauses) into a single .wav file.
What we learned
What's next for Linguistify
- Implement a way to connect the program and YouTube to allow easy access to select different language tracks for videos. Can also allow users to upload MP4s directly onto the web app to view the videos.
- Increase the number of languages supported, for both input and output to further improve accessibility for people around the world.
- Continue to improve the aspect of distinguishing sound effects/background noise/music and voice in videos, as voice removal is important to effectively ensure that the dubbed language tracks are smooth.
Log in or sign up for Devpost to join the conversation.