Inspiration
Symphandy was born from a moment of awe while scrolling through X (formerly Twitter). I stumbled upon the work of measure_plan, specifically his radial music app, and was captivated by his fascinating use of computer vision to create sound from motion. It sparked a question: What if we could take that concept of "air conducting" and turn it into a full-fledged collaborative instrument?
Musical expression often comes with high barriers—expensive equipment and years of practice. Inspired by measure_plan's vision of touchless interfaces, we set out to democratize this experience. We wanted to turn the ubiquitous webcam into a high-fidelity MIDI controller, allowing anyone to "conduct" music as intuitively as waving their hands, blending the physical joy of gesture with the precision of digital synthesis.
What it does
Symphandy turns your browser into a touchless musical interface. By tracking your hand movements in real-time, it allows you to:
- Play Music with Gestures: Your left hand controls the bass, and your right hand controls the melody.
- Intuitive Control: Tilt your hand to change notes, pinch your fingers to control volume, and make fists to switch scales instantly.
- Jam with Friends: The collaborative mode lets two people play together over the internet with ultra-low latency—one on bass, one on melody.
- Compose with AI: Integrated LLMs can generate complex musical sequences based on your chosen mood and scale, acting as an endless source of backing tracks.
- Record & Export: Performances can be recorded as WAV files or exported as MIDI for use in professional DAWs like Ableton or Logic Pro.
How we built it
Symphandy is a complex orchestration of modern web technologies, binding Computer Vision, Real-Time Audio, and Generative AI.
Computer Vision: We used MediaPipe Tasks Vision for high-performance hand tracking. We map 21 3D landmarks per hand to musical parameters using trigonometry.
- Note Selection: We calculate the angle $\theta$ of the hand to map to a musical scale $S$: $$ Note_{index} = \lfloor \frac{\theta + \pi}{2\pi} \times |S| \rfloor \mod |S| $$
- Volume Control: We use the Euclidean distance $d$ between the thumb ($T$) and index finger ($I$): $$ d(T, I) = \sqrt{(x_T - x_I)^2 + (y_T - y_I)^2} $$
Audio Engine: Sound is synthesized entirely in the browser using the Web Audio API and Tone.js. We built a custom dual-oscillator synth engine with dynamic filters and effects (reverb, delay) to ensure rich, studio-quality sound without external samples.
AI Integration: We leveraged Groq and the Vercel AI SDK to generate musical compositions. The system prompts an LLM with musical theory constraints, returning JSON data that our sequencer plays back in real-time.
Real-Time Sync: Multi-user collaboration is powered by WebRTC and Firebase. We established peer-to-peer data channels to transmit gesture data and sequencer state with millisecond precision.
Challenges we ran into
The Latency Battle: Music requires absolute precision; anything above 30ms feels sluggish. The computer vision pipeline is heavy, so we had to decouple the visual tracking loop from the audio engine, using
requestAnimationFrameto interpolate values smoothly so the sound remains responsive even if the camera drops frames.Background Throttling: Browsers aggressively throttle tabs that aren't in focus, which initially broke our collaborative mode. We solved this by moving our timing logic into a Web Worker, which runs on a separate thread and maintains accurate timing even when the tab is in the background.
Musical Math: Mapping 3D vector space to a 2D musical scale was non-trivial. Hand movements are noisy; we had to implement smoothing algorithms to prevent "jittery" notes while keeping the system responsive enough for fast playing.
Solo Demonstration: Recording the demo and demonstrating the WebRTC/P2P collaborative features was a unique challenge. Since I was working alone, I had to simulate a multi-user environment using a single webcam across two split screens, carefully managing focus and input to show the real-time sync in action.
Accomplishments that we're proud of
- Seamless Collaboration: Getting two users to jam together over the internet with near-zero latency was a massive technical hurdle, and seeing it work seamlessly is magical.
- The Aesthetic: We didn't just want it to work; we wanted it to feel like a sci-fi artifact. Achieving the retro-futuristic, "synthwave" UI with real-time 3D visualizations (using Three.js and React Three Fiber) running alongside heavy AI and CV tasks without lagging the browser is a point of pride.
- AI as a Co-Pilot: Successfully integrating LLMs not just to generate text, but to write playable music that users can jam along with in real-time.
What we learned
- The Multi-LLM Workflow: This project taught us the power of orchestration across different AI providers. We iterated on the initial core concepts in Google AI Studio, built the bulk of the robust functionality inside Cursor, and used Antigravity for the complex finishing touches and polish. We even designed specific UI screens using Nanobanana and fed those designs back into our coding assistants to implement pixel-perfect styles.
- Technical Mastery: We gained deep hands-on experience with WebRTC signaling and the infinite intricacies of the Tone.js audio graph during the build. We also delved into ElevenLabs, learning how to leverage their API for dynamic voice synthesis to give the application a voice.
- Music Theory: We had to learn more music theory than expected—understanding scales, intervals, and harmonics was essential to translate raw hand angles into something that sounds pleasing rather than chaotic.
- UX for Invisible Interfaces: Designing for "air gestures" requires different thinking than mouse/keyboard. Visual feedback is critical—users need to see their virtual "skeleton" to understand where the invisible buttons are.
What's next for Symphandy
- VR/XR Support: Porting the experience to Vision Pro or Quest for a fully immersive 3D conducting experience.
- Custom Sound Design: Allowing users to build their own synth patches visually.
- Global Jam Rooms: Expanding collaboration from 1-on-1 to full bands (drums, chords, melody) with up to 4 players.
- Vocal Integration: Using the microphone to allow adding vocals or beatboxing alongside the hand-controlled melody.
Log in or sign up for Devpost to join the conversation.