Inspiration
YouTube is a goldmine of knowledge, but it's a passive experience. We often wish we could pause a video and ask the creator a question — especially during tutorials, lectures, or podcasts. VoxTube was born from this simple idea: What if you could talk to the YouTuber at the exact moment you pause the video — and in their voice?
What it does
VoxTube lets users pause any YouTube video and have a voice-based conversation with the creator, powered by ElevenLabs. The AI understands the video’s content at the exact timestamp and answers user questions contextually — in the creator’s cloned voice. It makes learning and exploring content deeply engaging and interactive.
How we built it
Frontend: A custom video player with pause-and-chat controls Backend: Transcripts are mapped to timestamps using YouTube’s captions or Whisper Voice Cloning: ElevenLabs API to generate the creator’s voice dynamically Conversational AI: LLMs fine-tuned to respond based on paused video context State Management: Ensures responses align with current video timestamp
Challenges we ran into
Synchronizing accurate context with video timestamps Maintaining voice quality and low-latency responses Handling YouTube videos without captions Managing prompt injections and multi-turn queries Voice cloning with limited training data
Accomplishments that we're proud of
Seamless voice conversations with creators mid-video Real-time contextual understanding of paused moments Fully working MVP with ElevenLabs integration A novel UX that blends video consumption with AI-driven dialogue
What we learned
Fine-tuning AI responses based on timestamp context Efficient integration of ElevenLabs for natural-sounding voice output The importance of multi-modal context (text + video) in conversational AI Building usable UX around emerging AI capabilities
What's next for VoxTube
Support for multilingual videos and voices Real-time summarization and Q&A timeline Browser extension to work directly on YouTube Creator opt-in to personalize responses Integration with GPT-4o for audio + video comprehension
Log in or sign up for Devpost to join the conversation.