Inspiration
In my work as an AI Alignment Fellow, I ran through AI Agents performing ML workflows showing several critical rogue behaviors; most common rogue behaviors are data_leakage, access_dev_set, and agent_impatience. Thus, in this hackathon, I am presenting Agent Shield. :)
What it does
Agent Shield is a self-improving, fully autonomous AI Agent that helps other AI agents specifically conducting ML workflows in this case stay aligned to what a golden trajectories without rogue behaviors should look like.
How we built it
Claude Agent SDK [Opus] - The agent performs autonomous threat analysis by comparing agent trajectories against learned baseline patterns and taxonomy of 11 rogue behaviors. Makes independent decisions on whether agents are malicious with 95%+ confidence, no human intervention required.
Redis MCP: We used to store clean workflow patterns during training phase and caches known violations for instant recognition. Speeds up repeated detections from 8 seconds to 0.8 seconds (10x faster), making ShieldAgent self-improving through continuous learning.
Skyflow: We utilized Skyflow to tokenize sensitive data (API keys, passwords, secrets) before Claude analysis, reducing token count by 60% and ensuring PCI/HIPAA compliance. Protects credentials from leaking during security analysis while cutting inference costs.
Challenges we ran into
Integration was challenging, but we overcame it just fine.
Accomplishments that we're proud of
Deploying what actually works and seems to be able to self-correct trajectories showing rogue behaviors to zero rogue behaviors. If it continues to self-correct this way, then it can eventually produce its own guardrails and keep track of other AI Agents performing ML workflows.
What we learned
We are at the tip of an iceberg here with AI Agent Security, and we are just beginning to shield AI agents in a way that is contain-able.
What's next for Agent Shield
To turn this into a policy that can be implemented by other ML-performing AI Agents and publish a paper/code surrounding step-level security benchmark for AI Agents in Production.
Built With
- agent
- ai
- alignment
- anthropic
- claude
- javascript
- llm
- machine-learning
- opus
- python
- redis
- security
- skyflow
- typescript


Log in or sign up for Devpost to join the conversation.