Log inSign up
kwindla
Daily
6,418 posts
Image
user avatar
kwindla
Daily
@kwindla
Infrastructure and developer tools for real-time voice, video, and AI. @trydaily // į“šį˜į—¢ // @pipecat_ai
San Francisco, CA
machine-theory.com
Joined September 2008
3,883
Following
14.6K
Followers
1
Subscription
  • user avatar
    kwindla
    Daily
    @kwindla
    Jul 23, 2024
    Very, very fast voice bots. Llama 3.1 running on @GroqInc. šŸš€ 500ms voice-to-voice response times
    Image
    00:00
    386K
  • user avatar
    kwindla
    Daily
    @kwindla
    Jun 27, 2024
    How to build the world's fastest voice AI bot: - Self-host speech-to-text, LLM inference, and text-to-speech all together in the same container/cluster. - Route audio over the internet using WebRTC and edge networking. - Configure timings for voice activity detection,
    Image
    00:00
    283K
  • user avatar
    kwindla
    Daily
    @kwindla
    Apr 10, 2025
    We wrote down everything we've learned building voice AI agents over the past two years. Core technology choices, minimizing latency, managing multimodal context, interruption handling, turn detection, evals, state machines, guardrails, memory, async and realtime function
    Image
    Image
    Image
    Image
    154K
  • user avatar
    kwindla
    Daily
    @kwindla
    Mar 6, 2025
    Open source, native audio turn detection šŸŽ‰šŸŽ‰šŸŽ‰ Most voice agents today do turn detection by waiting for speech pauses of a specific, short length. That's not how humans do turn detection when we talk to each other! I've been working with some friends on a new turn detection
    Image
    164K
  • user avatar
    kwindla
    Daily
    @kwindla
    Aug 6, 2025
    A voice agent powered by gpt-oss. Running locally on my macBook. Demo recorded in a Waymo with WiFi turned off. I'm still on my space game voice AI kick, obviously. Code link below. For conversational voice AI, you want to set the gpt-oss reasoning behavior to "low". (The
    Image
    00:00
    202K
  • user avatar
    kwindla
    Daily
    @kwindla
    May 8, 2024
    Llama 2 70B in 20GB! 4-bit quantized, 40% of layers removed, fine-tuning to "heal" after layer removal. Almost no difference on MMLU compared to base Llama 2 70B. This paper, "The Unreasonable Ineffectiveness of the Deeper Layers," was my airplane reading on the way to a
    Image
    136K
  • user avatar
    kwindla
    Daily
    @kwindla
    Jul 27, 2025
    Local voice AI with a 235 billion parameter LLM. āœ… - smart-turn v2 - MLX Whisper (large-v3-turbo-q4) - Qwen3-235B-A22B-Instruct-2507-3bit-DWQ - Kokoro All models running local on an M4 mac. Max RAM usage ~110GB. Voice-to-voice latency is ~950ms. There are a couple of
    Image
    00:00
    67K
  • user avatar
    kwindla
    Daily
    @kwindla
    Apr 6, 2025
    Llama 4 voice agent starter kit with @GroqInc and @pipecat_ai āž”ļø Groq STT (distil-whisper-large-v3) āž”ļø Groq Llama 4 (llama-4-scout-17b-16e-instruct) āž”ļø Groq TTS (playai-tts) āž”ļø Function calling āž”ļø Deploy to Pipecat Cloud for production āž”ļø Optionally add a @twilio phone number
    Image
    00:00
    59K
  • user avatar
    kwindla
    Daily
    @kwindla
    Dec 22, 2024
    Better/faster/cheaper voice AI turn detection with Gemini 2.0 The code that determines when the agent should respond to the user is some of the most important code in your voice AI agent. The technical terms for this job are "turn detection" or "phrase endpointing." If the
    Image
    00:00
    64K
  • user avatar
    kwindla
    Daily
    @kwindla
    Jul 18, 2025
    Smart Turn v2: open source, native audio turn detection in 14 languages. New checkpoint of the open source, open data, open training code, semantic VAD model on @huggingface, @fal, and @pipecat_ai. - 3x faster inference (12ms on an L40) - 14 languages (13 more than v1, which
    Image
    00:00
    42K
  • user avatar
    kwindla
    Daily
    @kwindla
    Sep 4, 2024
    Voice AI fast response - phrase endpointing How does a Voice AI bot/agent/process know when it should process input speech and respond? This problem is called phrase endpointing. Getting phrase endpointing right is critical for voice interactions. The Open Source,
    Image
    00:00
    53K
  • user avatar
    kwindla
    Daily
    @kwindla
    May 9, 2025
    New speech-to-speech model from Amazon! Nova Sonic has a bidirectional streaming API, multiple voices, tool calling, and good performance on standard benchmarks. There's full support for Nova Sonic in the most recent @pipecat_ai release (0.0.67). Link to docs and to Pipecat
    Image
    00:00
    30K
  • user avatar
    kwindla
    Daily
    @kwindla
    Oct 7, 2024
    OpenAI Realtime client in 75 lines of Python I've been hacking on an OpenAI Realtime API service for @pipecat_ai and it occurred to me that the core voice-to-voice loop in pseudo-code is quite small. (Which is a nice testament to the API design!)
    Image
    77K
  • user avatar
    kwindla
    Daily
    @kwindla
    Dec 20, 2024
    Gemini Multimodal Live + WebRTC Build Gemini voice/video apps with WebRTC SDKs for: - Web - React - React Native - iOS - Android - C++ The SDKs support WebRTC, WebSocket, and HTTP network transport options. Change one line of code to switch protocols. Here's a simple Gemini +
    Image
    00:00
    53K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

TermsĀ·PrivacyĀ·CookiesĀ·AccessibilityĀ·Ads InfoĀ·Ā© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement