Inspiration
The development of music appreciation devices in recent years has made it possible to listen to various types of music regardless of location. On the other hand, if visual contents that best match music and lyrics can be expressed, we can expect a more expressive music appreciation experience by not only listening to the music but also watching relatable visuals.
What it does
- Using AssemblyAI's powerful audio transcription, it processes any input audio to generate lyrics transcripts.
- StabilityAI's stable-diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input, and it uses the transcripts word for word for effective image generation.
- Combine the images frame by frame to create an audio-visual experience and output it in video.
How we built it
- AssemblyAI
- StabilityAI
- Python
- Docker
- Streamlit
Challenges we ran into
- Processing images and corresponding audio together to make a properly synced video.
Accomplishments that we're proud of
- We learned about different use cases of audio translation and how it can be used to solve problems creatively.
What's next for VibeAI
- We plan to provide additional support for Spotify/ Youtube-Music audio in future.
- Creatively use this in audiobooks to give a visual experience of the book to children and adults and better engage them.
- Add some transition effects to our images
- Extract information out of music (bpm, used instruments, mood, genre) to increase the quality of our video
Built With
- assemblyai
- python
- replicate
- stabilityai
- streamlit

Log in or sign up for Devpost to join the conversation.