00:00
Infrastructure and developer tools for real-time voice, video, and AI. @trydaily // įįᢠ// @pipecat_ai

- How to build the world's fastest voice AI bot: - Self-host speech-to-text, LLM inference, and text-to-speech all together in the same container/cluster. - Route audio over the internet using WebRTC and edge networking. - Configure timings for voice activity detection,
00:00 - We wrote down everything we've learned building voice AI agents over the past two years. Core technology choices, minimizing latency, managing multimodal context, interruption handling, turn detection, evals, state machines, guardrails, memory, async and realtime function
- Open source, native audio turn detection ššš Most voice agents today do turn detection by waiting for speech pauses of a specific, short length. That's not how humans do turn detection when we talk to each other! I've been working with some friends on a new turn detection
- A voice agent powered by gpt-oss. Running locally on my macBook. Demo recorded in a Waymo with WiFi turned off. I'm still on my space game voice AI kick, obviously. Code link below. For conversational voice AI, you want to set the gpt-oss reasoning behavior to "low". (The
00:00 - Llama 2 70B in 20GB! 4-bit quantized, 40% of layers removed, fine-tuning to "heal" after layer removal. Almost no difference on MMLU compared to base Llama 2 70B. This paper, "The Unreasonable Ineffectiveness of the Deeper Layers," was my airplane reading on the way to a
- Local voice AI with a 235 billion parameter LLM. ā - smart-turn v2 - MLX Whisper (large-v3-turbo-q4) - Qwen3-235B-A22B-Instruct-2507-3bit-DWQ - Kokoro All models running local on an M4 mac. Max RAM usage ~110GB. Voice-to-voice latency is ~950ms. There are a couple of
00:00 - Llama 4 voice agent starter kit with @GroqInc and @pipecat_ai ā”ļø Groq STT (distil-whisper-large-v3) ā”ļø Groq Llama 4 (llama-4-scout-17b-16e-instruct) ā”ļø Groq TTS (playai-tts) ā”ļø Function calling ā”ļø Deploy to Pipecat Cloud for production ā”ļø Optionally add a @twilio phone number
00:00 - Better/faster/cheaper voice AI turn detection with Gemini 2.0 The code that determines when the agent should respond to the user is some of the most important code in your voice AI agent. The technical terms for this job are "turn detection" or "phrase endpointing." If the
00:00 - Smart Turn v2: open source, native audio turn detection in 14 languages. New checkpoint of the open source, open data, open training code, semantic VAD model on @huggingface, @fal, and @pipecat_ai. - 3x faster inference (12ms on an L40) - 14 languages (13 more than v1, which
00:00 - Voice AI fast response - phrase endpointing How does a Voice AI bot/agent/process know when it should process input speech and respond? This problem is called phrase endpointing. Getting phrase endpointing right is critical for voice interactions. The Open Source,
00:00 - New speech-to-speech model from Amazon! Nova Sonic has a bidirectional streaming API, multiple voices, tool calling, and good performance on standard benchmarks. There's full support for Nova Sonic in the most recent @pipecat_ai release (0.0.67). Link to docs and to Pipecat
00:00 - OpenAI Realtime client in 75 lines of Python I've been hacking on an OpenAI Realtime API service for @pipecat_ai and it occurred to me that the core voice-to-voice loop in pseudo-code is quite small. (Which is a nice testament to the API design!)
- Gemini Multimodal Live + WebRTC Build Gemini voice/video apps with WebRTC SDKs for: - Web - React - React Native - iOS - Android - C++ The SDKs support WebRTC, WebSocket, and HTTP network transport options. Change one line of code to switch protocols. Here's a simple Gemini +
00:00









