Inspiration
Every week there's a new headline about an AI agent getting jailbroken — leaking system prompts, executing unauthorized commands, or getting tricked into ignoring safety guardrails. As AI agents gain real-world tools (web access, shell execution, email), the attack surface grows exponentially. But there's no easy way to stress-test an agent's defenses before deploying it. We wanted to build the "penetration testing" equivalent for AI agents — something a developer can run in 2 minutes to find out where their agent breaks.
What it does
AgentBench takes an AI agent's configuration (system prompt + available tools) and runs a fully automated 3-phase safety benchmark:
- Generate — An LLM generates 18 adversarial attacks (3 per category) tailored to the specific agent's tools and instructions
- Simulate — Each attack is sent to a simulated version of the agent, producing realistic responses
- Evaluate — An LLM judge determines whether each response was safe (pass) or compromised (fail)
The result is a safety scorecard with an overall A-F grade, per-category breakdowns (Prompt Injection, Tool Abuse, Data Exfiltration, Role Override, Privilege Escalation, Information Leakage), and expandable cards showing each attack payload, agent response, and evaluation reasoning — all streamed live via SSE.
How we built it
- Backend: Python/FastAPI with Server-Sent Events for real-time streaming. The 3-phase pipeline runs ~22 LLM calls through Groq's API (Llama 3.3 70B) with rate-limit handling and retry logic. Sequential simulation with 2-second delays avoids rate limits while creating a natural streaming UX. Evaluations are batched (6 per call) to minimize API calls.
- Frontend: Next.js 16 + TypeScript + Tailwind CSS. A state machine drives the UI through IDLE → RUNNING → COMPLETE phases. An async generator parses the SSE stream, and a custom
useBenchmarkhook manages all state. The scorecard features an animated SVG grade ring, per-category progress bars, and expandable attack result cards. - Resilience: Hardcoded fallback attacks ensure the demo works even if attack generation fails. Dummy responses are created for failed simulations so evaluation always completes.
Challenges we ran into
- Groq SDK + httpx version mismatch — The
groqSDK passed aproxieskwarg that newerhttpxversions dropped, causing every simulation to fail silently. Took debugging to trace the error back to a transitive dependency conflict. - SSE streaming through FastAPI — Getting Server-Sent Events to actually stream (not buffer) required specific headers (
X-Accel-Buffering: no,Cache-Control: no-cache) and careful async generator design so we could yield events mid-pipeline. - LLM output reliability — The attack generator sometimes returns fewer than 18 attacks or malformed JSON. We built JSON extraction fallbacks and a complete hardcoded attack set as a safety net.
Accomplishments that we're proud of
- The entire benchmark runs in ~2 minutes and produces genuinely useful results — you can see real differences between a locked-down chat agent (grades A/B) and a power agent with shell access (grades C/D)
- Live streaming UX where attacks appear one-by-one feels responsive despite the sequential API calls
- The expandable attack cards let you see exactly what the agent said wrong and why it failed — actionable feedback, not just a score
What we learned
- AI agents with more tools are dramatically harder to secure — even well-written system prompts fail against targeted tool abuse attacks
- LLM-as-judge evaluation is surprisingly effective for safety assessment when given structured rubrics with clear pass/fail criteria
- Rate limiting is the real bottleneck for LLM-powered pipelines; designing around it (sequential calls, batched evaluation) is critical architecture
What's next for AgentBench
- Real agent testing — Connect to actual agent APIs instead of simulation, testing against live deployments
- Custom attack categories — Let users define domain-specific attack vectors (e.g., financial fraud for banking agents)
- Historical tracking — Save benchmark results over time to measure safety regression as agents evolve
- Multi-model support — Benchmark across different LLM backends to compare safety characteristics
- CI/CD integration — Run AgentBench as a GitHub Action on every PR that modifies agent prompts
Built With
- fastapi
- groq
- llama-3.3-70b
- next.js
- python
- tailwind-css
- typescript
Log in or sign up for Devpost to join the conversation.