2026-03-29: We update the public leaderboard with the latest evaluation results on the current100-item pure-visual subset.2025-10-17: MathSticks is accepted by the NeurIPS MATH-AI Workshop, 2025.2025-09-30: We open-source the benchmark, the dataset construction pipeline, and the paper.
MathSticks is designed to be deceptively small but structurally demanding:
- Easy for humans, still hard for frontier models. Tasks that feel almost trivial to people can still expose major weaknesses in current multimodal reasoning systems.
- Vision alone is not enough. The model must correctly perceive lit vs. unlit segments and identify valid move locations.
- Symbolic manipulation alone is not enough. The moved sticks must still form legible digits and a valid arithmetic equation.
- Strict output parsing matters. The answer must be expressed in a boxed
Move(...)format, so hand-wavy reasoning does not count.
Evaluations across 14 VLMs reveal substantial limitations: closed-source models only become reliable on simpler cases, open-source models struggle in the pure-visual regime, and humans still maintain a large advantage. This makes MathSticks a compact but diagnostic stress test for multimodal reasoning.
- Full benchmark:
MathSticks_bench_400.jsonlwith the corresponding rendered images underimage/. - Public leaderboard subset:
MathSticks_bench_100.jsonl, a fixed100-example subset with25samples from each level. - Detailed baseline outputs:
baseline_eval_results_100_subset/, including per-example outputs, score summaries, and the pure-visual public leaderboard snapshot. - Evaluation scripts:
eval.py,cal_score.py,eval_api.py, andrun_api_eval.sh.
Results. Model performance across task regimes. |
Coverage. Difficulty, move complexity, multiplicity, and operator-flip statistics. |
The table below reports the current public snapshot on MathSticks_bench_100.jsonl. This release uses the pure-visual / image-only setting: the model is shown the puzzle image, but the equation string is not provided. So the task includes OCR-like reading of the sticks in addition to symbolic reasoning. Each row also has a corresponding detailed JSON file in baseline_eval_results_100_subset/, so the released results are fully inspectable.
Three quick observations stand out:
- Human performance remains clearly ahead at roughly
92/100on the public subset projection, showing that the benchmark still leaves substantial room above the best current model results. - Among released model baselines, Gemini 3.1 Pro Preview and GPT-5.4-high are tied at the top with
83/100, though their level-wise profiles differ. - Harder pure-visual cases remain brittle for most models, with many systems still collapsing to near-zero accuracy under strict move-format scoring.
The repo includes both a generic evaluator and the public leaderboard runner.
- Run generic evaluation
python eval.py \
--input MathSticks_bench_400.jsonl \
--image-dir ./image \
--output predictions.jsonl- Score predictions
python cal_score.py \
--pred predictions.jsonl \
--label MathSticks_bench_400.jsonl \
--output score.json- Reproduce the public
100-item leaderboard snapshot
API_KEY=xxx BASE_URL=https://your-proxy.example/v1 ./run_api_eval.shThis uses the committed MathSticks_bench_100.jsonl subset and writes detailed outputs under baseline_eval_results_100_subset/. The default public snapshot is pure-visual, so the model only receives the rendered image.
We welcome community submissions of new model results.
Please:
- Evaluate on
MathSticks_bench_100.jsonlso submissions stay comparable. - Add both the detailed per-example output file (
<model>.json) and the summary score file (<model>.score.json) underbaseline_eval_results_100_subset/. - Refresh
baseline_eval_results_100_subset/leaderboard.jsonandbaseline_eval_results_100_subset/leaderboard.md. - Mention the provider/API and any non-default evaluation settings in the PR description.
Benchmark JSONL format
Each line in the benchmark JSONL contains one puzzle with the following fields:
id: unique sample identifier, e.g."00075585".level: difficulty level in1-4.image: image path relative to repo root, e.g.level1/00075585_8-9=3.png.problem: the displayed equation string.solution_num:[one_move_count, two_move_count].mode_1_solution: canonical one-move solutions.mode_2_solution: canonical two-move solutions.option_answer: order-invariant move representation for robust parsing.
Example:
{
"id": "00075585",
"level": 1,
"problem": "8-9=3",
"image": "level1/00075585_8-9=3.png",
"solution_num": [0, 4],
"mode_1_solution": [],
"mode_2_solution": [
{"solution": "8 - 6 = 2", "moves": ["Move(B2, B5)", "Move(C3, C5)"]},
{"solution": "9 - 9 = 0", "moves": ["Move(A5, C5)", "Move(C0, C6)"]},
{"solution": "6 + 3 = 9", "moves": ["Move(A2, G0)", "Move(B6, C6)"]},
{"solution": "9 - 0 = 9", "moves": ["Move(A5, B5)", "Move(B0, C6)"]}
],
"option_answer": {
"mode_1": [],
"mode_2": [
{"pick": ["B2", "C3"], "place": ["B5", "C5"]},
{"pick": ["A5", "C0"], "place": ["C5", "C6"]},
{"pick": ["A2", "B6"], "place": ["G0", "C6"]},
{"pick": ["A5", "B0"], "place": ["B5", "C6"]}
]
}
}Evaluation protocol
- Input can be pure-visual or text-guided.
- The model must output a boxed
Move(...)or a pair of moves in the specified format. - Scoring checks both semantic validity and exact move-format parsing.
- Results can be broken down by level, move budget, solution multiplicity, and operator variation.
If MathSticks, the public leaderboard, or the evaluation pipeline helps your work, please cite:
@article{mathsticks2025,
title = {MathSticks: A Benchmark for Visual Symbolic Compositional Reasoning with Matchstick Puzzles},
author = {Ji, Yuheng and Tan, Huajie and Chi, Cheng and Xu, Yijie and Zhao, Yuting and Zhou, Enshen and Lyu, Huaihai and Wang, Pengwei and Wang, Zhongyuan and Zhang, Shanghang and Zheng, xiaolong},
journal = {arXiv preprint arXiv:2510.00483},
year = {2025}
}

