Evaluating Language Model Reasoning with Puzzle Duels
Inspired by 16th-century Italian mathematical duels, The Token Games (TTG) is a benchmark where LLMs challenge each other by creating and solving programming puzzles with no human-authored problems required.
In a duel, two models take turns as proposer and solver. The proposer crafts a Python function (mystery(x)) and provides a secret sample solution. The solver must find any input that makes it return True. Solutions are verified by simply running the code. The proposer loses the turn if its own solution is wrong. It wins if the solver fails to find a solution. If the solver suceeds, that round ends in a draw. Proposers can see their own history and outcomes of past rounds, so that they can adapt puzzle difficulty as the duel progresses.
Unlike static benchmarks, TTG cannot be saturated — stronger models can always design harder puzzles that current models cannot solve. We find that designing hard puzzles is itself a very hard task, with even the strongest models failing often to provide a valid solution to their own puzzle, or to create puzzles that are challenging for other models. Designing puzzles requires producing novelty (recycling known problems is suboptimal since opponents may know them too) and tests for self-calibration (you want to propose the hardest puzzle that you can safely produce a sample solution for: if your puzzle is too hard, you may end up hallucinating the sample solution).
TTG uses Programming Puzzles (a Python function that returns a boolean) as a universal format for encoding reasoning challenges. This format is flexible enough to represent everything from simple string constraints to NP-complete problems.
Each duel consists of 10 rounds. In each round, one model proposes a puzzle and the other tries to solve it, then they swap. A proposer scores if it provides a valid puzzle with a correct solution that the opponent fails to solve. If the proposer's own solution is wrong, the solver scores instead. If the solver succeeds, it's a draw.
We ran all 90 ordered pairings of 10 frontier models, each playing 10 rounds. From duel outcomes we compute Elo ratings using the Bradley-Terry model, yielding a ranking that closely matches expert-authored benchmarks like HLE and GPQA Diamond at a fraction of the cost.
Performance of 10 frontier models on TTG. Solv% = fraction of rounds won as solver. Prop% = fraction of rounds where the model proposed a valid unsolved puzzle. Penalty = fraction of proposer rounds where the model's own solution was wrong.
| # | Model | Solv% | Prop% (unsolved) | Penalty% |
|---|---|---|---|---|
| 1 | GPT-5.2 Pro | 100.0% | 50.6% | 14.4% |
| 2 | Gemini 3 Pro | 93.2% | 32.8% | 32.2% |
| 3 | Grok-4 | 91.9% | 11.9% | 25.6% |
| 4 | GPT-5 Mini | 89.1% | 18.2% | 26.7% |
| 5 | Claude Opus 4.5 | 84.9% | 15.2% | 12.2% |
| 6 | DeepSeek Reasoner | 77.4% | 24.6% | 27.8% |
| 7 | Gemini 2.5 Pro | 75.4% | 11.1% | 90.0% |
| 8 | Gemini 2.5 Flash | 73.8% | 0.0% | 52.2% |
| 9 | Claude Sonnet 4.5 | 68.3% | 4.1% | 18.9% |
| 10 | GPT-5.2 | 52.6% | 0.0% | 97.8% |
Are strong solvers also good proposers? We find a strong correlation (ρ = 0.85), but proposing is far harder: even GPT-5.2 Pro, which solved every puzzle, only stumped opponents 50.6% of the time as proposer.
When a proposer fails to score, it's either because the puzzle was too easy (the opponent solved it) or too ambitious (the proposer's own solution was wrong, incurring a penalty). The balance between these failure modes varies dramatically across models.
GPT-5.2 is extraordinarily overconfident, failing on its own solution 97.8% of the time. Claude Opus 4.5 errs in the other direction — opponents solve its puzzles 74.4% of the time.
Proposers can see the full history of the duel. Do they use it? Yes — puzzles created in later rounds are measurably harder. When GPT-5.2 and GPT-5 Mini attempt all puzzles independently, solve rates drop steadily from round 1 to round 10.
Browse all 90 duels and their puzzles in our interactive duel viewer. Here are some highlights from the paper:
| Puzzle | Proposer | Solver | Outcome | |
|---|---|---|---|---|
| String constraints with modular product | Claude Opus 4.5 | Claude Sonnet 4.5 | Solved | view |
| 8-digit number with 7 constraints | Claude Opus 4.5 | Claude Sonnet 4.5 | Solver Failed | view |
| MD5 hash + number theory + XOR | Gemini 2.5 Pro | Claude Opus 4.5 | Sample Solution Wrong | view |
| Prime year + Friday the 13th date puzzle | DeepSeek Reasoner | Claude Opus 4.5 | Solved | view |
| Reverse == 4x palindrome | Claude Sonnet 4.5 | Claude Opus 4.5 | Solved | view |
| Brainfuck VM with SHA-256 gate | Gemini 2.5 Pro | GPT-5.2 | Sample Solution Wrong | view |
| 12-char string with 13 constraints | GPT-5.2 Pro | Claude Opus 4.5 | Solver Failed | view |
| Weighted sum + symmetry + XOR chain | Claude Opus 4.5 | GPT-5 Mini | Solver Failed | view |
| ASCII sum perfect square (trivial) | Claude Sonnet 4.5 | Grok-4 | Solved | view |
| 8-digit palindrome with digit product | Claude Opus 4.5 | Gemini 2.5 Pro | Solved | view |
| Hallucinated hex + broken XOR + SHA-256 | GPT-5.2 | Gemini 2.5 Pro | Sample Solution Wrong | view |
If you use The Token Games in your research, please cite: