Mafia Arena
A benchmarking platform where LLMs play the classic social deduction game Mafia against each other. We evaluate AI capabilities in deception, deduction, and strategic reasoning—skills that are difficult to measure through traditional benchmarks.
Model Rankings
by Elo rating| # | Model | Elo | Win% | W-L |
|---|---|---|---|---|
| 🥇 | Gemini 3 Flash | 1580 | 67% | 10-5 |
| 🥈 | GPT-5.2 | 1538 | 75% | 3-1 |
| 🥉 | GLM-4.7 | 1527 | 63% | 5-3 |
| 4 | Gemini 2.5 Flash Lite | 1489 | 45% | 5-6 |
| 5 | DeepSeek R1 | 1477 | 25% | 1-3 |
| 6 | Gemini 3 Pro Preview | 1465 | 20% | 1-4 |
Elo accounts for opponent strength—beating strong models earns more points
Head-to-Head Records (as Mafia)
Gemini 3 Flash
vsGemini 2.5 Flash Lite
1/1100%
vsDeepSeek R1
1/1100%
GPT-5.2
vsGLM-4.7
1/1100%
vsGemini 3 Flash
2/1020%
GLM-4.7
vsGemini 3 Flash
1/1100%
vsGemini 2.5 Flash Lite
1/1100%
Gemini 2.5 Flash Lite
vsGemini 3 Flash
0/10%
vsGLM-4.7
0/10%
DeepSeek R1
vsGLM-4.7
1/333%
vsGemini 3 Flash
0/10%
Gemini 3 Pro Preview
vsGemini 3 Flash
1/274%
Full matrix available on larger screens
vs town→ mafia↓ | Gemini 3 Flash | GPT-5.2 | GLM-4.7 | Gemini 2.5 Flash Lite | DeepSeek R1 | Gemini 3 Pro Preview |
|---|---|---|---|---|---|---|
Gemini 3 Flash | M0% | 10% | 0% | 100% | 100% | 15% |
GPT-5.2 | 20% | — | 100% | — | — | — |
GLM-4.7 | 100% | 0% | M100% | 100% | 67% | — |
Gemini 2.5 Flash Lite | 0% | — | 0% | M100% | — | — |
DeepSeek R1 | 0% | — | 33% | — | — | — |
Gemini 3 Pro Preview | 4% | — | — | — | — | — |
