Inspiration
Vocaloids sound better than people because you don't have to talk to them and you have to talk to people
What it does
First, it separate the original song to vocal and instrument part, then we transfer the vocal to text with the time stamps and get the frequency and notes from them using a homophonic transcription deep learning method. And we transfer this text to Google text to speech generated voice. We use this time stamps to match the new vocal and shift the pitch resulting in a newly layered, completely customizable vocals from any song.
How we built it
We choose Python as our primary language. We use Google Cloud Api--speech to text and text to speech to generate voice and to get lyrics. We used PyTorch to build a U-net for vocal and instrumental separation, TensorFlow and a convolutional neural net for homophonic transcription, and used librosa and signal processing for shifting frequencies. Sorry for all the buzwords.
Challenges we ran into
It didn't work.
Accomplishments that we're proud of
It kind of sort of works.
What we learned
How to make it work
What's next for Vocaloid
To take over the world
Log in or sign up for Devpost to join the conversation.