💡 Inspiration
"The fear of GenAI isn't that it won't work... it's that it will work, and you won't know when it fails."
We've all seen it: A company launches a shiny new AI agent. It works perfectly for a week. Then, a user asks something unexpected, the model hallucinates, or worse—it chats politely but forgets to capture the sales lead. In a traditional system, this revenue is lost until an engineer manually reviews the logs next week.
We asked: What if the AI could catch its own mistakes and rewrite its own code to fix them instantly?
Enter Autonomic AI. We didn't just build a chatbot; we built a self-healing swarm that turns the "Black Box" of GenAI into a transparent "Glass Box" using Google Cloud Vertex AI and Datadog.
🚀 What it does
Autonomic AI is an event-driven system where the user-facing agent is continuously judged, refined, and upgraded by a backend swarm of AI workers.
To demonstrate this, we deployed "Car Auto Concierge" (carsalesman101), a sales agent for a car dealership.
- The Mistake: The user asks for the price of a Model X. The agent answers politely but fails to ask for the user's email address, violating a core business rule.
- The Audit: The Auditor Agent catches this breach immediately based on the
upgrade_configrules. - The Fix: The Refiner Agent (powered by Gemini 2.5 Flash) analyzes the failure and rewrites the agent's system prompt to enforce email capture.
- The Validation: The Evaluator Agent runs the new prompt in a sandbox against the failed conversation using a strict rubric.
- The Deployment: If the new prompt passes, the system automatically pushes
v1.2to production.
The entire process happens without human intervention—but is fully observable via our Datadog Control Center.
⚙️ How we built it
We utilized an Event-Driven Architecture on Google Cloud Platform:
1. The Gateway (The Body)
- Tech: Python / FastAPI.
- Function: Handles user traffic. After every turn of conversation, it triggers a
Google Pub/Subevent containing the chat logs, decoupling the user experience from the heavy lifting of the audit.
2. The Swarm (The Brain - Vertex AI)
The Pub/Sub trigger activates our backend agents, which share a specific agent_id configuration:
- The Auditor: Scores the conversation against a strict rule set.
- Rule Example:
"CRITICAL FAIL if the agent DOES NOT explicitly ask for an email address."
- Rule Example:
- The Refiner: If the Auditor returns a
FAIL, the Refiner ingests the conversation and the failure reason. It generates a "Prompt Patch" intended to fix the logic. - The Evaluator: Before deploying, this agent runs the patched prompt against a
Rubricin a sandbox environment:- Check 1: "Did the agent strictly follow the 'Ask for Email' protocol?"
- Check 2: "Did the agent offer the 'Incoming' Model X if asked?"
- Firestore: Stores the "DNA" (Prompt Configs) and version history of the agents.
🧬 Anatomy of an Autonomic Agent
Our system isn't hardcoded; it's configuration-driven. Here is the actual JSON configuration for carsalesman101 that powers the swarm logic:
{
"agent_id": "carsalesman101",
"model_id": "gemini-2.5-flash",
"temperature": 0.2,
"economics": {
"budget_per_message": "0.5$",
"input_token_count_prompt": 420
},
"upgrade_config": {
"auditor_rules": [
"CRITICAL FAIL if the user asks about a vehicle (Price, Specs, Availability) and the agent DOES NOT explicitly ask for an email address.",
"FAIL if the agent mentions a vehicle model that is NOT listed in the INVENTORY.",
"FAIL if the agent uses the words 'sorry' or 'apologize' more than once."
]
},
"evaluator_rubric": [
"Did the agent strictly follow the 'Ask for Email' protocol?",
"Did the agent offer the 'Incoming' Model X if asked?",
"Was the response concise and high-energy?"
]
}
🐶 Partner Challenge: Datadog (The "Glass Box")
To win the Datadog Challenge, we moved beyond simple monitoring. We implemented an "Autonomic Observability Strategy" where Datadog drives the business logic.
We built a custom "Autonomic AI Ops Center" dashboard that serves as the command center for the swarm.
1. Visualizing the "Thought Process"
We used Datadog Log Streams (service:autonomic-*) to visualize the chain of thought between agents. We can see the exact moment the Auditor hands off a failure to the Refiner.
- Current Active Version Widget: Tracks the live
autonomic.agent.current_versionmetric, visualized with a conditional format (Purple for v1, Warning colors for rollbacks). - Optimization Rate: A custom formula comparing successful deployments vs. failed fixes:
2. Automated Action (The Safety Net)
We configured a Detection Rule for "Optimization Failure".
- Trigger: If the Refiner fails to fix the agent after 3 attempts (or if the Evaluator rejects the fix).
- Action: A Datadog Workflow automatically opens a Datadog Case.
- Context: The Case is populated with the
Chat_IDandFailure_Reason, alerting a human engineer only when the AI cannot fix itself.
3. Economics & SLOs
We track the "Economics" of our agents in real-time to ensure the self-healing process doesn't bankrupt us.
- Budget Breach Monitor: Alerts if
autonomic.budget.breachexceeds $0.10 per interaction. - Latency Splits: We visualize the "User Facing Latency" (ms) vs. "Backend Latency" (Refiner/Evaluator time), ensuring the audit process never slows down the user chat.
🧠 Challenges we ran into
- Infinite Loops: In early tests, the Refiner would try to fix a problem, fail, and trigger itself again, burning tokens. We used Datadog's Budget Breach monitor to catch this behavior early.
- Prompt Drift: Ensuring the "Refiner" didn't accidentally delete safety rules while fixing logic errors. We solved this by splitting the JSON configuration into
Immutable(Safety) andMutable(Behavior) blocks.
🏅 Accomplishments that we're proud of
- Zero-Touch Deployment: Watching the Datadog dashboard timeline show a deployment of
v1.2while we were sleeping—because the agent fixed a bug itself. - Full Observability: We successfully mapped GenAI metrics (Token Usage, Hallucination Rate) to standard DevOps metrics (Latency, Error Rate) in a single pane of glass.
⏭️ What's next for Autonomic AI
- Multi-Modal Auditing: Using Gemini 1.5 Pro to audit voice and video interactions.
- Integration with Confluent: To handle high-throughput clickstream data for real-time personalization during the chat.
Log in or sign up for Devpost to join the conversation.