Pinned
i’ve been working on this with @shrinathx over the last few days. we chose this problem because ai agents on solana are already executing real financial actions, while most benchmarks still only test whether a task can be completed, not whether it should.
that gap becomes
Introducing Gauntlet, benchmarking Solana AI agents for safety
Current benchmarks focus on execution, not judgment. Gauntlet flips that – scoring on safety, correct refusals, and task completion by testing them against various adversarial scenarios












