Inspiration

This agent was inspired by a call to a retail pharmacy where the automated system worked perfectly, the menus were infinite, and human assistance existed purely as a theoretical concept.

After navigating multiple IVR layers, I waited over 30 minutes to speak to a person, said one sentence, and was placed on hold again for another 10 minutes.

Nothing was resolved — but the system performed exactly as designed.

try our troll pharmacy phone

💊 Welcome to our totally real pharmacy

Call +1 (619) 848-5106

Fight through multiple menus.

Endure holding music.

Attempt to reach a human.

Success is not guaranteed.

What it does

Waiting on the phone is a tax on people who can least afford it.

Many low-income residents, seniors, and non-native English speakers struggle to access basic community resources—not because they don’t exist, but because everything requires long, stressful phone calls.

Our project helps users navigate phone calls without replacing them. We use AI to prepare call scripts, wait through automated systems, and summarize essential information—while clearly disclosing that it’s an AI assistant and allowing the user to stay in control at all times.

The result: fewer missed resources, less anxiety, and more equitable access to help that already exists.

This is not about replacing humans—it’s about giving people their time and dignity back.

How we built it

We built this by combining Node.js to handle the high-concurrency telephony IO and Python to run our custom AI signaling agent. The core challenge was bridging legacy telephony audio (8kHz μ-law) with modern AI streams continuously, which we solved by building a custom bidirectional gRPC framework.

Most of the system runs in real time, with a custom State Machine (FSM) acting as the control layer to manage interruptions and turn-taking.

Tech Stack

We didn’t stick to a single language and instead split the stack to use the best ecosystem for each part of the system. Node.js handles Twilio’s high-volume media streams since it’s well suited for large numbers of concurrent WebSocket connections and async I/O. Python runs the AI “brain,” where tools like PyTorch and Google MeidiaPipe VAD are most mature and effective. The two services communicate through bi-directional gRPC streams to pass audio and events with very low latency, while a Next.js and XState frontend provides a real-time debug console to visualize conversation state.

Connections

The system works as a live streaming pipeline rather than a simple chain of API calls. Twilio streams raw audio to our Node.js service over WebSocket, which immediately forwards it to Python through gRPC. The Python service processes the audio in memory using Silero VAD, paying attention not just to speech but to silence and interruptions. When the user stops talking, Python signals back to Node.js to trigger the LLM or move the IVR to the next state.

Architecture

We didn’t use only Python because handling high-volume audio WebSockets can get heavy under load. Using Node lets us scale the “listening” layer separately from the AI logic. For the hackathon, we traded a simple one-container setup for better performance and flexibility, which made it easy to swap out the VAD model or LLM later without changing the telephony code.

Challenges we ran into

State Management

We had to keep state in sync across two services and a live phone call. To solve this, we built a custom finite state machine in Python that acts as the single source of truth, tracking stages like welcome, menu, and verification, and carefully controlling when the system listens versus when it responds so the AI never talks over the user.

Real-time Audio Surgery & Latency

Working with legacy phone systems turned out to be messy. Phone calls use 8kHz μ-law audio, while modern AI models expect 16kHz linear PCM, so we built a custom transcoding layer inside the gRPC stream to decode and resample audio on the fly without adding noticeable latency.

Natural Turn-Taking

Tuning voice activity detection was our biggest engineering challenge. If it was too sensitive, the AI would cut users off mid-sentence, and if it was too slow, responses felt awkwardly delayed. To balance this, we built VAD and ASR synchronization logic that buffers audio and can cancel a response when the user starts speaking again, allowing natural barge-in.

Accomplishments that we're proud of

It actually works (and it’s fast)

This isn't just a mocked video. We genuinely connected the telephone network (Twilio) to a local AI brain. You can call the number right now, and it will respond with sub-second latency. Getting the round-trip latency (Phone -> Node -> Python -> LLM -> Python -> Node -> Phone) low enough to feel like a natural conversation was a huge win.

Impact: Restoring Dignity to the Call Queue

We built an AI assistant that helps people navigate long phone systems and showed that technology can reduce stress and improve access to help without replacing human interaction. Instead of replacing the call center agents, we empowered the callers—giving them back their time and key information.

Custom gRPC Bridge

Instead of using slow HTTP requests, we engineered a bi-directional gRPC streaming service. This allowed us to pipe raw audio data and control events between our Node.js telephony server and Python AI server seamlessly. It was a complex systems engineering challenge that paid off in performance.

Handling Real-world Chaos

We successfully handled "Barge-in" (interruption). If you start talking while the AI is speaking, it shuts up immediately. Implementing this required a tight feedback loop between the audio output layer and the VAD input layer, which is something many commercial APIs still struggle with.

What we learned

For beginners on our team, we learn and use stacks we have never used before. But for the significance of a milestone, we learned that small, supportive uses of AI can make a big difference when they respect user control, transparency, and real-world constraints like existing phone systems.

What's next for CallBuddy for Access

  • Expand the system to handle more real-world call scenarios and edge cases.
  • Make it accessible to more people through easier onboarding and broader language support.
  • Partner with community organizations to test, refine, and deploy it where it can have the most impact.

Built With

Share this project:

Updates