Inspiration

YouTube is a goldmine of knowledge, but it's a passive experience. We often wish we could pause a video and ask the creator a question — especially during tutorials, lectures, or podcasts. VoxTube was born from this simple idea: What if you could talk to the YouTuber at the exact moment you pause the video — and in their voice?

What it does

VoxTube lets users pause any YouTube video and have a voice-based conversation with the creator, powered by ElevenLabs. The AI understands the video’s content at the exact timestamp and answers user questions contextually — in the creator’s cloned voice. It makes learning and exploring content deeply engaging and interactive.

How we built it

Frontend: A custom video player with pause-and-chat controls Backend: Transcripts are mapped to timestamps using YouTube’s captions or Whisper Voice Cloning: ElevenLabs API to generate the creator’s voice dynamically Conversational AI: LLMs fine-tuned to respond based on paused video context State Management: Ensures responses align with current video timestamp

Challenges we ran into

Synchronizing accurate context with video timestamps Maintaining voice quality and low-latency responses Handling YouTube videos without captions Managing prompt injections and multi-turn queries Voice cloning with limited training data

Accomplishments that we're proud of

Seamless voice conversations with creators mid-video Real-time contextual understanding of paused moments Fully working MVP with ElevenLabs integration A novel UX that blends video consumption with AI-driven dialogue

What we learned

Fine-tuning AI responses based on timestamp context Efficient integration of ElevenLabs for natural-sounding voice output The importance of multi-modal context (text + video) in conversational AI Building usable UX around emerging AI capabilities

What's next for VoxTube

Support for multilingual videos and voices Real-time summarization and Q&A timeline Browser extension to work directly on YouTube Creator opt-in to personalize responses Integration with GPT-4o for audio + video comprehension

Share this project:

Updates