Inspiration
I come from a cryptography & security background and am very interested in AI safety. I wanted to explore how we could borrow cybersecurity practices (CTFs, bug bounties, etc) to apply to AI safety models in the form of red teaming. I also think this could be a good tool for education purposes! Games are a great way to teach people how to extract specific information that they need — how can we schmooze prompts to fool LLMs? On the Trust and Safety side, we want to identify and defend against these unwanted behaviors early as LLMs scale to the broader public.
What it does
Claudered (or claudeRed) is a bug bounty and security game platform for AI safety researchers to collect data on malicious prompts that could produce harmful behavior. Game mode is the LLM equivalent to CTFs: here, I made a basic POC game where one tries to uncover a secret that Claude has been given. Playground mode is for people to try to get Claude to output "bad" behavior and they can report that behavior (in exchange for a possible bug bounty).
How I built it
Claude API and NextJS
What's next for claudered
More AI safety research! Would want to connect this platform to AI safety engineers so they can collect data on LLM safety and investigate unwanted behavior more. I would also add features where Blue Teams could go into the platform and ADD safety measures for Claude, such that it's a red team vs. blue team game. Users can get rewarded via bug bounties :)
Built With
- claude
- nextjs

Log in or sign up for Devpost to join the conversation.