Inspiration

Developers building GenAI applications often struggle to understand how latency and throughput impact user experience. With the rise of multi-agent systems, performance bottlenecks can come from coordination overhead, model inference delays, or network latency. We wanted to build a tool that helps developers evaluate, debug, and improve their multi-agent systems—starting with performance.

What it does

Captain Perf is a real-time benchmarking and simulation tool for AI agents built using the Agent Development Kit (ADK). It lets developers enter the endpoint of their multi-agent system, select test presets (e.g., chat, code generation, data retrieval), and measure real-world performance under load.

It reports detailed metrics like token-level latency, TTFT, throughput, and agent handoff timing. It also includes a unique Latency Simulation Chat that mimics how delays affect user experience across multi-agent conversations.

How we built it

We built Captain Perf using:

  • Python ADK for agent orchestration
  • Google Cloud Run to host inference services
  • React frontend for UI/UX
  • Visualization via Chart.js

Our benchmarking harness sends structured prompts across multiple types of agent workflows and records timing data at each step in the chain.

Challenges we ran into

  • Coordinating agent-level benchmarks and aggregating timing metrics
  • Designing a chat interface that accurately mimics delay patterns without confusing users

Accomplishments that we're proud of

  • Successfully measured and visualized performance across multi-agent workflows
  • Built a generic system that works with any ADK-based deployment
  • Enabled developers to not only see performance metrics but feel them via simulation

What we learned

  • Latency perception is nonlinear—users tolerate small delays, but stacking agent latencies breaks experience fast
  • ADK makes building modular agent systems easy, but surfacing performance telemetry requires planning

What's next for Captain Perf

  • Integrate with the Agent Engine to visualize internal task graphs and bottlenecks
  • Add synthetic user personas to simulate real-world behavior
  • Build presets for SDLC, customer support, and content workflows

Built With

Share this project:

Updates