The Token Games

Inspired by 16th-century Italian mathematical duels, The Token Games (TTG) is a benchmark where LLMs challenge each other by creating and solving programming puzzles with no human-authored problems required.

In a duel, two models take turns as proposer and solver. The proposer crafts a Python function (mystery(x)) and provides a secret sample solution. The solver must find any input that makes it return True. Solutions are verified by simply running the code. The proposer loses the turn if its own solution is wrong. It wins if the solver fails to find a solution. If the solver suceeds, that round ends in a draw. Proposers can see their own history and outcomes of past rounds, so that they can adapt puzzle difficulty as the duel progresses.

Why The Token Games?

Unlike static benchmarks, TTG cannot be saturated — stronger models can always design harder puzzles that current models cannot solve. We find that designing hard puzzles is itself a very hard task, with even the strongest models failing often to provide a valid solution to their own puzzle, or to create puzzles that are challenging for other models. Designing puzzles requires producing novelty (recycling known problems is suboptimal since opponents may know them too) and tests for self-calibration (you want to propose the hardest puzzle that you can safely produce a sample solution for: if your puzzle is too hard, you may end up hallucinating the sample solution).

The Duel Protocol

TTG uses Programming Puzzles (a Python function that returns a boolean) as a universal format for encoding reasoning challenges. This format is flexible enough to represent everything from simple string constraints to NP-complete problems.

Each duel consists of 10 rounds. In each round, one model proposes a puzzle and the other tries to solve it, then they swap. A proposer scores if it provides a valid puzzle with a correct solution that the opponent fails to solve. If the proposer's own solution is wrong, the solver scores instead. If the solver succeeds, it's a draw.

Proposer

Here's a function f. Which s makes f(s) true?

→

Verify Puzzle

Test proposer's sample solution

→

Solver

Here's an x such that f(x) == True

→

Verify Solution

Test given solution

We ran all 90 ordered pairings of 10 frontier models, each playing 10 rounds. From duel outcomes we compute Elo ratings using the Bradley-Terry model, yielding a ranking that closely matches expert-authored benchmarks like HLE and GPQA Diamond at a fraction of the cost.

Model Performance

Performance of 10 frontier models on TTG. Solv% = fraction of rounds won as solver. Prop% = fraction of rounds where the model proposed a valid unsolved puzzle. Penalty = fraction of proposer rounds where the model's own solution was wrong.

#	Model	Solv%	Prop% (unsolved)	Penalty%
1	GPT-5.2 Pro	100.0%	50.6%	14.4%
2	Gemini 3 Pro	93.2%	32.8%	32.2%
3	Grok-4	91.9%	11.9%	25.6%
4	GPT-5 Mini	89.1%	18.2%	26.7%
5	Claude Opus 4.5	84.9%	15.2%	12.2%
6	DeepSeek Reasoner	77.4%	24.6%	27.8%
7	Gemini 2.5 Pro	75.4%	11.1%	90.0%
8	Gemini 2.5 Flash	73.8%	0.0%	52.2%
9	Claude Sonnet 4.5	68.3%	4.1%	18.9%
10	GPT-5.2	52.6%	0.0%	97.8%

Solver vs. Proposer Ability

Are strong solvers also good proposers? We find a strong correlation (ρ = 0.85), but proposing is far harder: even GPT-5.2 Pro, which solved every puzzle, only stumped opponents 50.6% of the time as proposer.

Measuring a Model's Overconfidence

When a proposer fails to score, it's either because the puzzle was too easy (the opponent solved it) or too ambitious (the proposer's own solution was wrong, incurring a penalty). The balance between these failure modes varies dramatically across models.

GPT-5.2 is extraordinarily overconfident, failing on its own solution 97.8% of the time. Claude Opus 4.5 errs in the other direction — opponents solve its puzzles 74.4% of the time.

Puzzles Get Harder Over Time

Proposers can see the full history of the duel. Do they use it? Yes — puzzles created in later rounds are measurably harder. When GPT-5.2 and GPT-5 Mini attempt all puzzles independently, solve rates drop steadily from round 1 to round 10.

Explore the Puzzles

Browse all 90 duels and their puzzles in our interactive duel viewer. Here are some highlights from the paper:

Puzzle	Proposer	Solver	Outcome
String constraints with modular product	Claude Opus 4.5	Claude Sonnet 4.5	Solved	view
8-digit number with 7 constraints	Claude Opus 4.5	Claude Sonnet 4.5	Solver Failed	view
MD5 hash + number theory + XOR	Gemini 2.5 Pro	Claude Opus 4.5	Sample Solution Wrong	view
Prime year + Friday the 13th date puzzle	DeepSeek Reasoner	Claude Opus 4.5	Solved	view
Reverse == 4x palindrome	Claude Sonnet 4.5	Claude Opus 4.5	Solved	view
Brainfuck VM with SHA-256 gate	Gemini 2.5 Pro	GPT-5.2	Sample Solution Wrong	view
12-char string with 13 constraints	GPT-5.2 Pro	Claude Opus 4.5	Solver Failed	view
Weighted sum + symmetry + XOR chain	Claude Opus 4.5	GPT-5 Mini	Solver Failed	view
ASCII sum perfect square (trivial)	Claude Sonnet 4.5	Grok-4	Solved	view
8-digit palindrome with digit product	Claude Opus 4.5	Gemini 2.5 Pro	Solved	view
Hallucinated hex + broken XOR + SHA-256	GPT-5.2	Gemini 2.5 Pro	Sample Solution Wrong	view

Citation

If you use The Token Games in your research, please cite:

@misc{hennigerpoesia2026, title={The Token Games: Evaluating Language Model Reasoning with Puzzle Duels}, author={Simon Henniger and Gabriel Poesia}, year={2026}, eprint={2602.17831}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2602.17831}, }