kwindla (@kwindla) / X

kwindla

6,418 posts

kwindla

@kwindla

Infrastructure and developer tools for real-time voice, video, and AI. @trydaily // ᓚᘏᗢ // @pipecat_ai

San Francisco, CA

machine-theory.com

Joined September 2008

kwindla
@kwindla
Jul 23, 2024
Very, very fast voice bots. Llama 3.1 running on @GroqInc. 🚀 500ms voice-to-voice response times
00:00
386K
kwindla
@kwindla
Jun 27, 2024
How to build the world's fastest voice AI bot: - Self-host speech-to-text, LLM inference, and text-to-speech all together in the same container/cluster. - Route audio over the internet using WebRTC and edge networking. - Configure timings for voice activity detection,
00:00
283K
kwindla
@kwindla
Apr 10, 2025
We wrote down everything we've learned building voice AI agents over the past two years. Core technology choices, minimizing latency, managing multimodal context, interruption handling, turn detection, evals, state machines, guardrails, memory, async and realtime function
154K
kwindla
@kwindla
Mar 6, 2025
Open source, native audio turn detection 🎉🎉🎉 Most voice agents today do turn detection by waiting for speech pauses of a specific, short length. That's not how humans do turn detection when we talk to each other! I've been working with some friends on a new turn detection
164K
kwindla
@kwindla
Aug 6, 2025
A voice agent powered by gpt-oss. Running locally on my macBook. Demo recorded in a Waymo with WiFi turned off. I'm still on my space game voice AI kick, obviously. Code link below. For conversational voice AI, you want to set the gpt-oss reasoning behavior to "low". (The
00:00
202K
kwindla
@kwindla
May 8, 2024
Llama 2 70B in 20GB! 4-bit quantized, 40% of layers removed, fine-tuning to "heal" after layer removal. Almost no difference on MMLU compared to base Llama 2 70B. This paper, "The Unreasonable Ineffectiveness of the Deeper Layers," was my airplane reading on the way to a
136K
kwindla
@kwindla
Jul 27, 2025
Local voice AI with a 235 billion parameter LLM. ✅ - smart-turn v2 - MLX Whisper (large-v3-turbo-q4) - Qwen3-235B-A22B-Instruct-2507-3bit-DWQ - Kokoro All models running local on an M4 mac. Max RAM usage ~110GB. Voice-to-voice latency is ~950ms. There are a couple of
00:00
67K
kwindla
@kwindla
Apr 6, 2025
Llama 4 voice agent starter kit with @GroqInc and @pipecat_ai ➡️ Groq STT (distil-whisper-large-v3) ➡️ Groq Llama 4 (llama-4-scout-17b-16e-instruct) ➡️ Groq TTS (playai-tts) ➡️ Function calling ➡️ Deploy to Pipecat Cloud for production ➡️ Optionally add a @twilio phone number
00:00
59K
kwindla
@kwindla
Dec 22, 2024
Better/faster/cheaper voice AI turn detection with Gemini 2.0 The code that determines when the agent should respond to the user is some of the most important code in your voice AI agent. The technical terms for this job are "turn detection" or "phrase endpointing." If the
00:00
64K
kwindla
@kwindla
Jul 18, 2025
Smart Turn v2: open source, native audio turn detection in 14 languages. New checkpoint of the open source, open data, open training code, semantic VAD model on @huggingface, @fal, and @pipecat_ai. - 3x faster inference (12ms on an L40) - 14 languages (13 more than v1, which
00:00
42K
kwindla
@kwindla
Sep 4, 2024
Voice AI fast response - phrase endpointing How does a Voice AI bot/agent/process know when it should process input speech and respond? This problem is called phrase endpointing. Getting phrase endpointing right is critical for voice interactions. The Open Source,
00:00
53K
kwindla
@kwindla
May 9, 2025
New speech-to-speech model from Amazon! Nova Sonic has a bidirectional streaming API, multiple voices, tool calling, and good performance on standard benchmarks. There's full support for Nova Sonic in the most recent @pipecat_ai release (0.0.67). Link to docs and to Pipecat
00:00
30K
kwindla
@kwindla
Oct 7, 2024
OpenAI Realtime client in 75 lines of Python I've been hacking on an OpenAI Realtime API service for @pipecat_ai and it occurred to me that the core voice-to-voice loop in pseudo-code is quite small. (Which is a nice testament to the API design!)
77K
kwindla
@kwindla
Dec 20, 2024
Gemini Multimodal Live + WebRTC Build Gemini voice/video apps with WebRTC SDKs for: - Web - React - React Native - iOS - Android - C++ The SDKs support WebRTC, WebSocket, and HTTP network transport options. Change one line of code to switch protocols. Here's a simple Gemini +
00:00
53K