Inspiration

Every week there's a new headline about an AI agent getting jailbroken — leaking system prompts, executing unauthorized commands, or getting tricked into ignoring safety guardrails. As AI agents gain real-world tools (web access, shell execution, email), the attack surface grows exponentially. But there's no easy way to stress-test an agent's defenses before deploying it. We wanted to build the "penetration testing" equivalent for AI agents — something a developer can run in 2 minutes to find out where their agent breaks.

What it does

AgentBench takes an AI agent's configuration (system prompt + available tools) and runs a fully automated 3-phase safety benchmark:

  1. Generate — An LLM generates 18 adversarial attacks (3 per category) tailored to the specific agent's tools and instructions
  2. Simulate — Each attack is sent to a simulated version of the agent, producing realistic responses
  3. Evaluate — An LLM judge determines whether each response was safe (pass) or compromised (fail)

The result is a safety scorecard with an overall A-F grade, per-category breakdowns (Prompt Injection, Tool Abuse, Data Exfiltration, Role Override, Privilege Escalation, Information Leakage), and expandable cards showing each attack payload, agent response, and evaluation reasoning — all streamed live via SSE.

How we built it

  • Backend: Python/FastAPI with Server-Sent Events for real-time streaming. The 3-phase pipeline runs ~22 LLM calls through Groq's API (Llama 3.3 70B) with rate-limit handling and retry logic. Sequential simulation with 2-second delays avoids rate limits while creating a natural streaming UX. Evaluations are batched (6 per call) to minimize API calls.
  • Frontend: Next.js 16 + TypeScript + Tailwind CSS. A state machine drives the UI through IDLE → RUNNING → COMPLETE phases. An async generator parses the SSE stream, and a custom useBenchmark hook manages all state. The scorecard features an animated SVG grade ring, per-category progress bars, and expandable attack result cards.
  • Resilience: Hardcoded fallback attacks ensure the demo works even if attack generation fails. Dummy responses are created for failed simulations so evaluation always completes.

Challenges we ran into

  • Groq SDK + httpx version mismatch — The groq SDK passed a proxies kwarg that newer httpx versions dropped, causing every simulation to fail silently. Took debugging to trace the error back to a transitive dependency conflict.
  • SSE streaming through FastAPI — Getting Server-Sent Events to actually stream (not buffer) required specific headers (X-Accel-Buffering: no, Cache-Control: no-cache) and careful async generator design so we could yield events mid-pipeline.
  • LLM output reliability — The attack generator sometimes returns fewer than 18 attacks or malformed JSON. We built JSON extraction fallbacks and a complete hardcoded attack set as a safety net.

Accomplishments that we're proud of

  • The entire benchmark runs in ~2 minutes and produces genuinely useful results — you can see real differences between a locked-down chat agent (grades A/B) and a power agent with shell access (grades C/D)
  • Live streaming UX where attacks appear one-by-one feels responsive despite the sequential API calls
  • The expandable attack cards let you see exactly what the agent said wrong and why it failed — actionable feedback, not just a score

What we learned

  • AI agents with more tools are dramatically harder to secure — even well-written system prompts fail against targeted tool abuse attacks
  • LLM-as-judge evaluation is surprisingly effective for safety assessment when given structured rubrics with clear pass/fail criteria
  • Rate limiting is the real bottleneck for LLM-powered pipelines; designing around it (sequential calls, batched evaluation) is critical architecture

What's next for AgentBench

  • Real agent testing — Connect to actual agent APIs instead of simulation, testing against live deployments
  • Custom attack categories — Let users define domain-specific attack vectors (e.g., financial fraud for banking agents)
  • Historical tracking — Save benchmark results over time to measure safety regression as agents evolve
  • Multi-model support — Benchmark across different LLM backends to compare safety characteristics
  • CI/CD integration — Run AgentBench as a GitHub Action on every PR that modifies agent prompts

Built With

Share this project:

Updates