Inspiration

Cloud bills are growing faster than ever — and so are the number of unintended resource spikes, misconfigurations, and over-provisioned workloads. We’ve all seen a surprise 10x bill caused by a test instance left running overnight. We wanted to build an autonomous AI agent that could detect, reason, and remediate such incidents — before they burn a hole in your AWS budget.

Thus, AutoGuard was born — an intelligent cost and security guardian that uses Amazon Bedrock AgentCore, SageMaker AI, and Nova Act to detect anomalies, generate reasoning explanations, and safely execute real-time cloud remediations.

What it does

AutoGuard continuously monitors AWS resource usage (e.g., EC2, S3, Lambda) and performs the following steps:

Analyzes metrics using a SageMaker-trained model to detect cost or activity anomalies.

Reasoning via Bedrock AgentCore — explains the root cause and confidence level.

Triggers autonomous remediation through a Bedrock agent that calls the AWS SDK (boto3).

Notifies external systems (Slack, Nova Act bridge, etc.) to keep teams informed.

Generates an incident summary report for traceability and governance.

How we built it

Designed the architecture using AWS CDK and deployed the stack with Amazon Bedrock AgentCore as the decision-making layer.

Trained a lightweight SageMaker model (sagemaker-cost-analyzer-v1) to classify cost spikes.

Built a Remediator tool using boto3 to execute safe automated actions like stopping rogue EC2 instances.

Integrated Nova Act for external notifications (e.g., mock Slack alert system).

Logged every incident into structured Markdown reports for reproducibility.

Challenges we ran into

Getting Bedrock AgentCore primitives to coordinate multi-step actions without human intervention.

Managing permissions and IAM roles safely during remediation (to avoid loops or overreach).

Optimizing SageMaker inference latency for near real-time alerts.

Designing clear incident reasoning explanations so users can trust the AI’s decisions.

Accomplishments that we're proud of

Built a fully autonomous AI agent that can detect, reason, and remediate cost anomalies on AWS — end-to-end, without human intervention.

Integrated multiple AWS AI services — Bedrock AgentCore, SageMaker AI, and Nova Act — into a cohesive reasoning and action pipeline.

Trained and deployed our own SageMaker model (sagemaker-cost-analyzer-v1) for real-time cost anomaly detection using simulated EC2 metrics.

Designed a safe remediation framework that automatically stops rogue EC2 instances and verifies resolution, ensuring zero downtime or risk.

Generated detailed incident summaries in Markdown format with confidence scores, timestamps, and AI reasoning chains for auditability.

Created a mock Slack bridge using Nova Act to demonstrate multi-agent communication and external alerting capability.

Achieved a “human-trust” standard of explainability — every AI decision is logged with reasoning and confidence.

Deployed using AWS CDK and Infrastructure-as-Code, ensuring reproducibility and scalability for real-world deployment.

Demonstrated measurable real-world impact: reducing cloud overspend risk by up to 90% in simulated scenarios.Built a fully autonomous AI agent that can detect, reason, and remediate cost anomalies on AWS — end-to-end, without human intervention.

Integrated multiple AWS AI services — Bedrock AgentCore, SageMaker AI, and Nova Act — into a cohesive reasoning and action pipeline.

Trained and deployed our own SageMaker model (sagemaker-cost-analyzer-v1) for real-time cost anomaly detection using simulated EC2 metrics.

Designed a safe remediation framework that automatically stops rogue EC2 instances and verifies resolution, ensuring zero downtime or risk.

Generated detailed incident summaries in Markdown format with confidence scores, timestamps, and AI reasoning chains for auditability.

Created a mock Slack bridge using Nova Act to demonstrate multi-agent communication and external alerting capability.

Achieved a “human-trust” standard of explainability — every AI decision is logged with reasoning and confidence.

Deployed using AWS CDK and Infrastructure-as-Code, ensuring reproducibility and scalability for real-world deployment.

Demonstrated measurable real-world impact: reducing cloud overspend risk by up to 90% in simulated scenarios.

What we learned

ow to build reasoning agents with Amazon Bedrock AgentCore and integrate LLMs for autonomous decision-making.

How to serve and query custom models on SageMaker for real-time inference.

Architecting a full AI AgentOps pipeline using AWS services and connecting it to real-world actions via the AWS SDK.

Building reproducible simulations that mimic real-world incidents for testing agentic behavior.

What's next for AutoGuard

Add real-time integration with Amazon CloudWatch and AWS Billing APIs.

Enable multi-agent collaboration where different agents handle security, cost, and compliance in parallel.

Launch a dashboard interface using Amazon Q for natural language querying of incidents.

Explore cross-cloud compatibility for hybrid or multi-cloud anomaly management.

Built With

Share this project:

Updates