MathSticks: A Benchmark for Visual Symbolic Compositional Reasoning with Matchstick Puzzles

Bridging visual perception, symbolic manipulation, and arithmetic consistency.

MathSticks is a benchmark for Visual Symbolic Compositional Reasoning (VSCR) that jointly tests visual perception, symbolic manipulation, and arithmetic consistency. Each task presents an incorrect matchstick equation in a seven-segment style, and the model must move exactly one or two sticks to repair it under strict conservation and digit-legibility constraints.

Two evaluation regimes. Text-guided and pure-visual settings diagnose whether failure comes from reading the puzzle or reasoning over it.
Systematic coverage. The benchmark spans Levels 1-4, one-stick vs. two-stick edits, solution multiplicity, and operator changes.
Two release scales. This repo keeps the full 400-item benchmark and also provides a fixed 100-item public subset with 25 samples per level for open leaderboard submissions.
Important setting. The current public 100-item snapshot is pure-visual only: the model sees the rendered image but is not given the equation string directly.

The released benchmark remains the curated MathSticks_bench_400.jsonl. The public leaderboard in this repo is a lighter-weight, fixed subset based on MathSticks_bench_100.jsonl, intended for faster iteration and lower-cost reruns rather than replacing the full benchmark.

In other words, this public snapshot includes both OCR-like perception and symbolic reasoning: the model must first read the equation from the image and then decide which sticks to move.

Perhaps the most striking part of MathSticks is the human-model difficulty gap: these puzzles often feel simple and intuitive to people, but even today's strongest frontier multimodal models still make surprisingly frequent mistakes. On the current public pure-visual snapshot, the best released model baselines reach 83/100, while the human reference is about 92/100.

_{Example. Input puzzle, reasoning trace, and move-format prediction.}

🔥 News

2026-03-29: We update the public leaderboard with the latest evaluation results on the current 100-item pure-visual subset.
2025-10-17: MathSticks is accepted by the NeurIPS MATH-AI Workshop, 2025.
2025-09-30: We open-source the benchmark, the dataset construction pipeline, and the paper.

Why MathSticks?

MathSticks is designed to be deceptively small but structurally demanding:

Easy for humans, still hard for frontier models. Tasks that feel almost trivial to people can still expose major weaknesses in current multimodal reasoning systems.
Vision alone is not enough. The model must correctly perceive lit vs. unlit segments and identify valid move locations.
Symbolic manipulation alone is not enough. The moved sticks must still form legible digits and a valid arithmetic equation.
Strict output parsing matters. The answer must be expressed in a boxed Move(...) format, so hand-wavy reasoning does not count.

Evaluations across 14 VLMs reveal substantial limitations: closed-source models only become reliable on simpler cases, open-source models struggle in the pure-visual regime, and humans still maintain a large advantage. This makes MathSticks a compact but diagnostic stress test for multimodal reasoning.

What Is Released In This Repo?

Full benchmark: MathSticks_bench_400.jsonl with the corresponding rendered images under image/.
Public leaderboard subset: MathSticks_bench_100.jsonl, a fixed 100-example subset with 25 samples from each level.
Detailed baseline outputs: baseline_eval_results_100_subset/, including per-example outputs, score summaries, and the pure-visual public leaderboard snapshot.
Evaluation scripts: eval.py, cal_score.py, eval_api.py, and run_api_eval.sh.

_{Results. Model performance across task regimes.}

_{Coverage. Difficulty, move complexity, multiplicity, and operator-flip statistics.}

🏆 Leaderboard Snapshot

The table below reports the current public snapshot on MathSticks_bench_100.jsonl. This release uses the pure-visual / image-only setting: the model is shown the puzzle image, but the equation string is not provided. So the task includes OCR-like reading of the sticks in addition to symbolic reasoning. Each row also has a corresponding detailed JSON file in baseline_eval_results_100_subset/, so the released results are fully inspectable.

Three quick observations stand out:

Human performance remains clearly ahead at roughly 92/100 on the public subset projection, showing that the benchmark still leaves substantial room above the best current model results.
Among released model baselines, Gemini 3.1 Pro Preview and GPT-5.4-high are tied at the top with 83/100, though their level-wise profiles differ.
Harder pure-visual cases remain brittle for most models, with many systems still collapsing to near-zero accuracy under strict move-format scoring.

Rank	Model	Accuracy	Correct	L1	L2	L3	L4
👤 Ref.	`Human`	92.00%	92/100	96.00%	100.00%	84.00%	88.00%
🥇 1	`gemini-3.1-pro-preview`	83.00%	83/100	96.00%	76.00%	72.00%	88.00%
🥈 2	`gpt-5.4-high`	83.00%	83/100	88.00%	84.00%	76.00%	84.00%
🥉 3	`qwen3.5-397b-a17b`	55.00%	55/100	76.00%	36.00%	56.00%	52.00%
4	`qwen3.5-plus`	40.00%	40/100	76.00%	24.00%	24.00%	36.00%
5	`doubao-seed-2-0-pro`	17.00%	17/100	48.00%	4.00%	4.00%	12.00%
6	`kimi-k2.5`	2.00%	2/100	8.00%	0.00%	0.00%	0.00%
7	`gpt-5.4`	1.00%	1/100	4.00%	0.00%	0.00%	0.00%
8	`deepseek-v3.2`	0.00%	0/100	0.00%	0.00%	0.00%	0.00%
9	`glm-5`	0.00%	0/100	0.00%	0.00%	0.00%	0.00%
10	`gpt-5.2`	0.00%	0/100	0.00%	0.00%	0.00%	0.00%
11	`grok-4-1-fast-non-reasoning`	0.00%	0/100	0.00%	0.00%	0.00%	0.00%
12	`grok-4-1-fast-reasoning`	0.00%	0/100	0.00%	0.00%	0.00%	0.00%
13	`llama-4-maverick`	0.00%	0/100	0.00%	0.00%	0.00%	0.00%

Evaluate Your Own Model

The repo includes both a generic evaluator and the public leaderboard runner.

Run generic evaluation

python eval.py \
  --input MathSticks_bench_400.jsonl \
  --image-dir ./image \
  --output predictions.jsonl

Score predictions

python cal_score.py \
  --pred predictions.jsonl \
  --label MathSticks_bench_400.jsonl \
  --output score.json

Reproduce the public 100-item leaderboard snapshot

API_KEY=xxx BASE_URL=https://your-proxy.example/v1 ./run_api_eval.sh

This uses the committed MathSticks_bench_100.jsonl subset and writes detailed outputs under baseline_eval_results_100_subset/. The default public snapshot is pure-visual, so the model only receives the rendered image.

Collaboration and Open Evaluation

We welcome community submissions of new model results.

Please:

Evaluate on MathSticks_bench_100.jsonl so submissions stay comparable.
Add both the detailed per-example output file (<model>.json) and the summary score file (<model>.score.json) under baseline_eval_results_100_subset/.
Refresh baseline_eval_results_100_subset/leaderboard.json and baseline_eval_results_100_subset/leaderboard.md.
Mention the provider/API and any non-default evaluation settings in the PR description.

Benchmark JSONL format

Each line in the benchmark JSONL contains one puzzle with the following fields:

id: unique sample identifier, e.g. "00075585".
level: difficulty level in 1-4.
image: image path relative to repo root, e.g. level1/00075585_8-9=3.png.
problem: the displayed equation string.
solution_num: [one_move_count, two_move_count].
mode_1_solution: canonical one-move solutions.
mode_2_solution: canonical two-move solutions.
option_answer: order-invariant move representation for robust parsing.

Example:

{
  "id": "00075585",
  "level": 1,
  "problem": "8-9=3",
  "image": "level1/00075585_8-9=3.png",
  "solution_num": [0, 4],
  "mode_1_solution": [],
  "mode_2_solution": [
    {"solution": "8 - 6 = 2", "moves": ["Move(B2, B5)", "Move(C3, C5)"]},
    {"solution": "9 - 9 = 0", "moves": ["Move(A5, C5)", "Move(C0, C6)"]},
    {"solution": "6 + 3 = 9", "moves": ["Move(A2, G0)", "Move(B6, C6)"]},
    {"solution": "9 - 0 = 9", "moves": ["Move(A5, B5)", "Move(B0, C6)"]}
  ],
  "option_answer": {
    "mode_1": [],
    "mode_2": [
      {"pick": ["B2", "C3"], "place": ["B5", "C5"]},
      {"pick": ["A5", "C0"], "place": ["C5", "C6"]},
      {"pick": ["A2", "B6"], "place": ["G0", "C6"]},
      {"pick": ["A5", "B0"], "place": ["B5", "C6"]}
    ]
  }
}

Evaluation protocol

Input can be pure-visual or text-guided.
The model must output a boxed Move(...) or a pair of moves in the specified format.
Scoring checks both semantic validity and exact move-format parsing.
Results can be broken down by level, move budget, solution multiplicity, and operator variation.

Citation

If MathSticks, the public leaderboard, or the evaluation pipeline helps your work, please cite:

@article{mathsticks2025,
  title   = {MathSticks: A Benchmark for Visual Symbolic Compositional Reasoning with Matchstick Puzzles},
  author  = {Ji, Yuheng and Tan, Huajie and Chi, Cheng and Xu, Yijie and Zhao, Yuting and Zhou, Enshen and Lyu, Huaihai and Wang, Pengwei and Wang, Zhongyuan and Zhang, Shanghang and Zheng, xiaolong},
  journal = {arXiv preprint arXiv:2510.00483},
  year    = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
baseline_eval_results_100_subset		baseline_eval_results_100_subset
image		image
.DS_Store		.DS_Store
MathSticks_bench_100.jsonl		MathSticks_bench_100.jsonl
MathSticks_bench_400.jsonl		MathSticks_bench_400.jsonl
README.md		README.md
cal_score.py		cal_score.py
eval.py		eval.py
eval_api.py		eval_api.py
example.png		example.png
match_gen_flt.py		match_gen_flt.py
results.png		results.png
run_api_eval.sh		run_api_eval.sh
stat.png		stat.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MathSticks: A Benchmark for Visual Symbolic Compositional Reasoning with Matchstick Puzzles

Bridging visual perception, symbolic manipulation, and arithmetic consistency.

🔥 News

Why MathSticks?

What Is Released In This Repo?

🏆 Leaderboard Snapshot

Evaluate Your Own Model

Collaboration and Open Evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MathSticks: A Benchmark for Visual Symbolic Compositional Reasoning with Matchstick Puzzles

Bridging visual perception, symbolic manipulation, and arithmetic consistency.

🔥 News

Why MathSticks?

What Is Released In This Repo?

🏆 Leaderboard Snapshot

Evaluate Your Own Model

Collaboration and Open Evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages