Inspiration: When you’re deep in a project, developers spend more time thinking about intent than typing syntax. Yet our tools still force us to poke at the keyboard for every little change. That gap got me thinking: what if you could talk to your code the same way you chat with a teammate? With the rapid progress in high quality voice AI especially ElevenLabs we saw an opportunity to push for a truly voice-first coding experience. Not just converting speech to text, but running a real, flowing conversation with your editor.

What VoxDiff Does:

VoxDiff is a voice-first Visual Studio Code extension that lets you:

  1. Speak natural language commands like “Add a null check here.”
  2. Have patches applied automatically without manual confirmations
  3. Hear natural AI voice explanations of what changed
  4. Keep talking in a hands free, conversational workflow
  5. The aim is to make coding feel less like issuing commands and more like having a dialogue with your editor.

How We Built It:

VoxDiff sits on three tightly integrated layers:

  1. VS Code Extension (Frontend) Built with the VS Code Extension API that uses browser-based speech recognition for speech to text that also Maintains chat history and the current code selection context and applies patches automatically without extra confirmation and plays AI generated voice responses directly in the editor

  2. Backend (FastAPI)

The backend, built with FastAPI, handles everything behind the scenes once a voice command is given. It receives the spoken instructions along with the selected code and uses Google Gemini for structured code understanding and precise patch generation. To ensure safety and consistency, the backend enforces strict JSON outputs, making every edit deterministic and reversible. Essentially, it orchestrates the entire process transforming user intent into clean, reliable code modifications.

  1. ElevenLabs Voice AI

On the voice side, ElevenLabs brings the assistant to life with natural, expressive speech. It converts the AI’s responses into spoken audio and streams them back to the VS Code extension as base64-encoded audio. This enables real-time spoken feedback after every change, making interactions feel fluid, human, and genuinely conversational rather than robotic.

What We Learned :

  1. Voice UX requires much tighter state management than text interfaces 2.Automatic edits must be predictable, reversible, and safe
  2. Speech latency has a big impact on perceived intelligence
  3. High-quality TTS greatly increases trust in an AI system
  4. We also got hands-on with:
  5. VS Code extension internals
  6. Real-time AI orchestration
  7. Audio streaming and playback
  8. Prompt-safe code generation
  9. Challenges We Faced
  10. Key hurdles included:
  11. Keeping editor state intact while flipping between voice input and code

What’s Next ? :

  1. Looking ahead, we’re eyeing improvements like:
  2. Continuous voice conversations without hitting a button
  3. Streaming ElevenLabs audio in real time as edits apply
  4. Voice-driven multi-file refactors
  5. Accessibility-first coding workflows
  6. Shared voice sessions for team collaboration

Final Thought:

VoxDiff isn’t about replacing coding; it’s about changing how we talk to code. With voice as the interface and ElevenLabs handling the voice, VoxDiff turns programming into a conversation.

Built With

Share this project:

Updates