mayhem-monkey | Devpost

break it before users do

Ashlie Zhang · Cindy Li · Mukund Gaur · Skai Nzeuton

About The Project

Mayhem Monkey is an AI driven chaos engineer for the vibe coding era. Inspired by Netflix’s original Chaos Monkey, which strengthened production systems by deliberately breaking them, Mayhem Monkey automatically probes your application for weak assumptions, brittle dependencies, and hidden failure modes. When speed and automation make it hard to know what we can truly trust, Mayhem Monkey continuously tests resilience so your platform fails safely before users ever notice.

Features

🎙️ Voice-Controlled Chaos Testing

Speech-to-Text – Speak commands naturally into your microphone. The voice driver transcribes your speech in real time and routes it to the chaos engine, so you never have to touch a keyboard during a test session.
Text-to-Speech Narration – Every action the agent takes is narrated aloud via ElevenLabs TTS, giving you a live audio play-by-play of what's being tested and what was found.

🧠 Gemini-Powered Multi-Agent Reasoning

Three-Stage Agent Pipeline – Mayhem Monkey doesn't rely on a single prompt. It runs a coordinated pipeline of three Gemini agents:
1. Transcription Agent – Captures raw voice input and converts it to clean text using speech-to-text, ensuring accurate command capture even in noisy environments.
2. Formatting Agent – Takes the raw transcript and structures it into a normalized action intent (target URL, attack type, parameters) via Gemini Flash through the Dedalus API, so downstream agents receive consistent, machine-readable instructions.
3. Decision Agent – The core reasoning engine. Powered by Gemini 2.5 Pro, it receives live page HTML, a screenshot, and the full conversation history, then autonomously decides what to test next — choosing selectors, crafting payloads, interpreting evidence, and adapting its strategy turn by turn. This agent teams directly with Playwright to control the entire browser session end-to-end.
Full Device Control via Playwright – The decision agent doesn't just suggest actions — it executes them. It clicks buttons, fills forms, submits payloads, reads error responses, and captures JS dialogs, all through a real headed Chromium browser. The agent navigates complex, multi-step web flows autonomously, handling login forms, search pages, dynamic SPAs, and error states without human intervention.
Adaptive Multi-Turn Loop – After every action, evidence (console errors, reflected content, alert dialogs, URL changes, error banners) is fed back to the decision agent so it can refine its attack strategy in real time — pivoting from SQL injection to XSS to auth bypass as it learns the application's behavior.

🔌 Dedalus API Integration

LLM Routing via Dedalus – When a Dedalus API key is configured, voice commands are routed through the Dedalus platform using gemini-2.0-flash for fast, intelligent action inference. Falls back to keyword-based routing when unavailable.

🌐 Live Browser Automation

Playwright-Driven – A real Chromium browser runs headed on screen so you can watch every click, injection, and form submission happen live.
Evidence Capture Pipeline – The runner monitors JS alert/confirm/prompt dialogs, console errors, reflected payloads, URL redirects, and error banners — snapshotting everything the instant after submission, before the page can refresh it away.

🖥️ Overlayed Frontend

Clean Results – After exploring the results, cleanly gathers them into a report for easy analyzing.
Dancing Monkey – In house designed monkey that causes chaos!

🔒 Safety Guardrails

Navigation Lock – Clicks on links and elements that would navigate away from the target page are automatically blocked; accidental navigations are auto-reverted.
Read-Only Payloads – The system prompt enforces safe, non-destructive testing — no DROP, DELETE, UPDATE, or denial-of-service payloads.

Built With

Screenshot of scan

Technologies

Gemini: AI Reasoning & Computer Use

Gemini is the brain behind Mayhem Monkey's multi-agent architecture. Rather than a single monolithic prompt, we orchestrate three specialized agents in sequence: a transcription agent that converts voice to text, a formatting agent (Gemini 2.0 Flash via Dedalus) that normalizes raw speech into structured action intents, and a decision agent (Gemini 3 Pro) that autonomously drives the entire browser session. The decision agent reads live HTML and screenshots, reasons about what to test, crafts payloads, executes them through Playwright, and interprets the results — all in a multi-turn conversation loop. This means the AI doesn't just analyze web pages — it uses the computer to interact with real, complex web interfaces end-to-end: navigating login flows, probing search forms, injecting into dynamic SPAs, and adapting its strategy based on what the application throws back. It handles exactly the kind of difficult-to-navigate, evasive web pages that resist simple scraping — treating every page as a live adversary to reason about and overcome.

Dedalus API

Dedalus provides the LLM gateway that powers voice-to-action routing. When a user speaks a command, it's transcribed and sent through Dedalus to Gemini Flash, which infers the intended chaos action (click, inject text, extract links, scroll, etc.) and returns a structured JSON response. This gives us sub-second natural language understanding without building our own intent classifier.

Playwright

Playwright drives a real headed Chromium browser that the audience can watch live. It handles all page interaction — clicking elements, filling inputs, submitting forms — while our evidence pipeline monitors JS dialogs, console output, URL changes, and DOM mutations to capture proof of vulnerabilities the instant they occur, before page refreshes can wipe them away.

ElevenLabs

ElevenLabs powers the text-to-speech narration. Every action the agent takes is spoken aloud in real time, turning the testing session into a live commentary. The TTS runs fire-and-forget so it never blocks the automation loop.

Figma and Tailwind

Figma was used to design the UI. Tailwind CSS translated those designs into a clean, responsive, and customizable interface that feels both modern and immersive — featuring a dynamic song queue, quick-trigger sound bite buttons, and an LED visualizer with synced border flashing.

Vite & React

React powers the interactive UI. Vite makes development fast and efficient with hot reloading and optimized builds for smooth browser performance.

What We Learned

Multi-Agent Pipelines Need Clear Boundaries

Running three agents in sequence (transcription, formatting, decision) taught us that the handoff between agents matters more than any individual agent's quality. If the formatting agent returned a slightly ambiguous intent, the decision agent would misinterpret it. We learned to enforce strict JSON schemas at each boundary and validate outputs before passing them downstream. Loose coupling between agents sounds good in theory, but in practice you need rigid contracts.

Screenshot of scan

Why We Didn't Use MCP Servers

We initially explored using Model Context Protocol (MCP) servers to give Gemini direct browser control — the idea being that an MCP tool server could expose Playwright actions (click, type, navigate, screenshot) as callable tools, and Gemini would invoke them through structured tool calls. In theory, this is a cleaner architecture than our current approach of parsing raw JSON from the model. In practice, we ran into two issues: first, MCP tool-use adds latency per action since each tool call is a separate round-trip, and our chaos loop already had 3-5 second turns — adding MCP overhead would have made the experience feel sluggish. Second, we needed tight control over what the model could do (blocking navigation, capturing evidence immediately after submission, auto-submitting forms) — logic that lives between the model's decision and the browser's execution. With MCP, the model calls tools directly, which makes it harder to inject that middleware layer.

The Vibe Coding Security Gap Is Real

Building this project reinforced exactly why we started it. We used AI tools to build Mayhem Monkey itself — and in doing so, we shipped code with the same kinds of assumptions and shortcuts that Mayhem Monkey is designed to catch. It was a meta reminder that speed and security are in constant tension, and tools that bridge that gap aren't just nice to have — they're necessary.

Screenshot of scan

Challenges We Ran Into

TTS Blocking the Automation Loop

Our first integration of ElevenLabs TTS used synchronous playback — the entire chaos loop would freeze for 5-10 seconds while each narration finished playing through the speakers. This meant the AI would identify a vulnerability, start explaining it out loud, and the browser would just sit there doing nothing until the speech ended. For a live demo, it was painful. We had to decouple the narration layer from the automation pipeline entirely, offloading audio generation and playback to an asynchronous background process so the two systems could run in parallel. The challenge was ensuring the narration stayed contextually in sync with the browser actions without one blocking the other — essentially building a producer-consumer handoff where the reasoning engine emits narration events and the audio layer consumes them independently, keeping the demo fluid while still giving the audience a coherent play-by-play.

Coordinating Between Multiple Models

We run three different AI agents in sequence — transcription, formatting (Gemini Flash via Dedalus), and decision-making (Gemini Pro). Getting them to communicate cleanly was harder than expected. The formatting agent would sometimes return slightly ambiguous JSON, and the decision agent would misinterpret the intent. A voice command like "test the login page" might get formatted as {"action": "navigate", "target": "login"} when the decision agent expected a full URL. We had to add strict JSON schema validation at every handoff point and build fallback logic for when one agent's output didn't match the next agent's expected input. The pipeline only worked reliably once we treated each boundary as a contract, not a suggestion.

Minimizing the Context Window

Every turn of the chaos loop sends Gemini a screenshot (as a base64 image), up to 15,000 characters of HTML, the conversation history, and the evidence from the last action. That adds up fast — we were hitting context limits within 10-15 steps, causing the model to lose track of what it had already tried and start repeating attacks. We had to get strategic about what we sent: truncating HTML to the first 15K characters, summarizing previous findings instead of sending raw history, and stripping out irrelevant DOM elements (scripts, stylesheets, SVG paths) before sending the page content. Balancing "give the model enough context to reason well" against "don't blow the context window" was a constant tug-of-war throughout the project.

Evidence Disappearing Before We Could Capture It

When the AI successfully triggered an XSS alert or a SQL error page, the evidence would vanish almost instantly — dialogs get auto-dismissed, forms redirect after submission, error pages flash for a single render. Early in development, the AI would inject a payload, the page would reload, and by the time we checked, everything looked normal. We had to build a preemptive evidence pipeline: installing dialog listeners and console monitors before any action, then snapshotting the page state immediately after submission — capturing URL changes, DOM content, error banners, and reflected payloads within milliseconds. It was a race condition we didn't anticipate.

Form Submission Detection

After the AI filled a form with a payload, it would look at the HTML again and see... the same form. Input values set via JavaScript don't show up in page.content(), so Gemini thought nothing had changed and would try to fill the form again, creating an infinite loop. We had to build a separate function that queried the DOM for current input values via page.evaluate() and fed that summary back to Gemini alongside the HTML. Even then, getting the AI to actually click "Submit" instead of just re-examining the page required us to build auto-submission logic that programmatically found and clicked the form's submit button after every text injection.

Real-Time Frontend-Backend Sync

The frontend polls the Flask backend every 3 seconds to check if the scan is still running and if results are ready. But the scanner subprocess writes results to a JSON file only when it's completely done — meaning the user stares at a progress bar for 3-10 minutes with no feedback about what's actually happening. We initially tried Server-Sent Events (SSE) for real-time streaming but ran into issues with subprocess I/O and had to fall back to polling. Getting the frontend to feel responsive while the backend runs a long, unpredictable process was an ongoing UX challenge we never fully solved.

Screenshot of scan

Why We Built It

Netflix invented Chaos Monkey because they could afford to — a dedicated team of chaos engineers whose only job was to break things before customers did. That infrastructure-level discipline became legendary, but it was built for a company with thousands of engineers and the budget to match.

We're not Netflix. We're in an era of startups, solo founders, and vibe coding — where a working MVP can go from idea to production in a weekend. Tools like Cursor, Copilot, and v0 let us build faster than ever, but speed comes with a cost: the code ships before anyone really audits it. There's no security team reviewing your PR. There's no QA engineer running penetration tests. There's just you, your AI copilot, and a deploy button.

That gap is exactly what Mayhem Monkey fills. We built it because every indie developer, every hackathon team, every two-person startup deserves the same chaos engineering rigor that Netflix has — without needing a six-figure security consultant to get it. Mayhem Monkey is an AI-powered security engineer that you can point at your app and say "break it." It finds the SQL injections you didn't think about, the XSS holes your copilot introduced, and the auth bypasses hiding in forms you built at 3 AM.

If the vibe coding era means anyone can build anything, then anyone should be able to verify that what they built is safe. Break it before your users do.