Skip to content

Yuheng2000/MathSticks

Repository files navigation

MathSticks: A Benchmark for Visual Symbolic Compositional Reasoning with Matchstick Puzzles

Bridging visual perception, symbolic manipulation, and arithmetic consistency.

arXiv   Dataset   400-item benchmark   100-item public subset   Pure visual only   Leaderboard   Detailed JSONs

MathSticks is a benchmark for Visual Symbolic Compositional Reasoning (VSCR) that jointly tests visual perception, symbolic manipulation, and arithmetic consistency. Each task presents an incorrect matchstick equation in a seven-segment style, and the model must move exactly one or two sticks to repair it under strict conservation and digit-legibility constraints.

  • Two evaluation regimes. Text-guided and pure-visual settings diagnose whether failure comes from reading the puzzle or reasoning over it.
  • Systematic coverage. The benchmark spans Levels 1-4, one-stick vs. two-stick edits, solution multiplicity, and operator changes.
  • Two release scales. This repo keeps the full 400-item benchmark and also provides a fixed 100-item public subset with 25 samples per level for open leaderboard submissions.
  • Important setting. The current public 100-item snapshot is pure-visual only: the model sees the rendered image but is not given the equation string directly.

The released benchmark remains the curated MathSticks_bench_400.jsonl. The public leaderboard in this repo is a lighter-weight, fixed subset based on MathSticks_bench_100.jsonl, intended for faster iteration and lower-cost reruns rather than replacing the full benchmark.

In other words, this public snapshot includes both OCR-like perception and symbolic reasoning: the model must first read the equation from the image and then decide which sticks to move.

Perhaps the most striking part of MathSticks is the human-model difficulty gap: these puzzles often feel simple and intuitive to people, but even today's strongest frontier multimodal models still make surprisingly frequent mistakes. On the current public pure-visual snapshot, the best released model baselines reach 83/100, while the human reference is about 92/100.

MathSticks example
Example. Input puzzle, reasoning trace, and move-format prediction.

🔥 News

  • 2026-03-29: We update the public leaderboard with the latest evaluation results on the current 100-item pure-visual subset.
  • 2025-10-17: MathSticks is accepted by the NeurIPS MATH-AI Workshop, 2025.
  • 2025-09-30: We open-source the benchmark, the dataset construction pipeline, and the paper.

Why MathSticks?

MathSticks is designed to be deceptively small but structurally demanding:

  • Easy for humans, still hard for frontier models. Tasks that feel almost trivial to people can still expose major weaknesses in current multimodal reasoning systems.
  • Vision alone is not enough. The model must correctly perceive lit vs. unlit segments and identify valid move locations.
  • Symbolic manipulation alone is not enough. The moved sticks must still form legible digits and a valid arithmetic equation.
  • Strict output parsing matters. The answer must be expressed in a boxed Move(...) format, so hand-wavy reasoning does not count.

Evaluations across 14 VLMs reveal substantial limitations: closed-source models only become reliable on simpler cases, open-source models struggle in the pure-visual regime, and humans still maintain a large advantage. This makes MathSticks a compact but diagnostic stress test for multimodal reasoning.

What Is Released In This Repo?

MathSticks results summary
Results. Model performance across task regimes.
MathSticks statistics
Coverage. Difficulty, move complexity, multiplicity, and operator-flip statistics.

🏆 Leaderboard Snapshot

The table below reports the current public snapshot on MathSticks_bench_100.jsonl. This release uses the pure-visual / image-only setting: the model is shown the puzzle image, but the equation string is not provided. So the task includes OCR-like reading of the sticks in addition to symbolic reasoning. Each row also has a corresponding detailed JSON file in baseline_eval_results_100_subset/, so the released results are fully inspectable.

Three quick observations stand out:

  1. Human performance remains clearly ahead at roughly 92/100 on the public subset projection, showing that the benchmark still leaves substantial room above the best current model results.
  2. Among released model baselines, Gemini 3.1 Pro Preview and GPT-5.4-high are tied at the top with 83/100, though their level-wise profiles differ.
  3. Harder pure-visual cases remain brittle for most models, with many systems still collapsing to near-zero accuracy under strict move-format scoring.
Rank Provider Model Accuracy Correct L1 L2 L3 L4
👤 Ref. Human Human 92.00% 92/100 96.00% 100.00% 84.00% 88.00%
🥇 1 Google Gemini gemini-3.1-pro-preview 83.00% 83/100 96.00% 76.00% 72.00% 88.00%
🥈 2 OpenAI GPT gpt-5.4-high 83.00% 83/100 88.00% 84.00% 76.00% 84.00%
🥉 3 Qwen qwen3.5-397b-a17b 55.00% 55/100 76.00% 36.00% 56.00% 52.00%
4 Qwen qwen3.5-plus 40.00% 40/100 76.00% 24.00% 24.00% 36.00%
5 Doubao doubao-seed-2-0-pro 17.00% 17/100 48.00% 4.00% 4.00% 12.00%
6 Kimi kimi-k2.5 2.00% 2/100 8.00% 0.00% 0.00% 0.00%
7 OpenAI GPT gpt-5.4 1.00% 1/100 4.00% 0.00% 0.00% 0.00%
8 DeepSeek deepseek-v3.2 0.00% 0/100 0.00% 0.00% 0.00% 0.00%
9 GLM glm-5 0.00% 0/100 0.00% 0.00% 0.00% 0.00%
10 OpenAI GPT gpt-5.2 0.00% 0/100 0.00% 0.00% 0.00% 0.00%
11 Grok grok-4-1-fast-non-reasoning 0.00% 0/100 0.00% 0.00% 0.00% 0.00%
12 Grok grok-4-1-fast-reasoning 0.00% 0/100 0.00% 0.00% 0.00% 0.00%
13 Llama llama-4-maverick 0.00% 0/100 0.00% 0.00% 0.00% 0.00%

Evaluate Your Own Model

The repo includes both a generic evaluator and the public leaderboard runner.

  1. Run generic evaluation
python eval.py \
  --input MathSticks_bench_400.jsonl \
  --image-dir ./image \
  --output predictions.jsonl
  1. Score predictions
python cal_score.py \
  --pred predictions.jsonl \
  --label MathSticks_bench_400.jsonl \
  --output score.json
  1. Reproduce the public 100-item leaderboard snapshot
API_KEY=xxx BASE_URL=https://your-proxy.example/v1 ./run_api_eval.sh

This uses the committed MathSticks_bench_100.jsonl subset and writes detailed outputs under baseline_eval_results_100_subset/. The default public snapshot is pure-visual, so the model only receives the rendered image.

Collaboration and Open Evaluation

We welcome community submissions of new model results.

Please:

  1. Evaluate on MathSticks_bench_100.jsonl so submissions stay comparable.
  2. Add both the detailed per-example output file (<model>.json) and the summary score file (<model>.score.json) under baseline_eval_results_100_subset/.
  3. Refresh baseline_eval_results_100_subset/leaderboard.json and baseline_eval_results_100_subset/leaderboard.md.
  4. Mention the provider/API and any non-default evaluation settings in the PR description.

Benchmark format

Benchmark JSONL format

Each line in the benchmark JSONL contains one puzzle with the following fields:

  • id: unique sample identifier, e.g. "00075585".
  • level: difficulty level in 1-4.
  • image: image path relative to repo root, e.g. level1/00075585_8-9=3.png.
  • problem: the displayed equation string.
  • solution_num: [one_move_count, two_move_count].
  • mode_1_solution: canonical one-move solutions.
  • mode_2_solution: canonical two-move solutions.
  • option_answer: order-invariant move representation for robust parsing.

Example:

{
  "id": "00075585",
  "level": 1,
  "problem": "8-9=3",
  "image": "level1/00075585_8-9=3.png",
  "solution_num": [0, 4],
  "mode_1_solution": [],
  "mode_2_solution": [
    {"solution": "8 - 6 = 2", "moves": ["Move(B2, B5)", "Move(C3, C5)"]},
    {"solution": "9 - 9 = 0", "moves": ["Move(A5, C5)", "Move(C0, C6)"]},
    {"solution": "6 + 3 = 9", "moves": ["Move(A2, G0)", "Move(B6, C6)"]},
    {"solution": "9 - 0 = 9", "moves": ["Move(A5, B5)", "Move(B0, C6)"]}
  ],
  "option_answer": {
    "mode_1": [],
    "mode_2": [
      {"pick": ["B2", "C3"], "place": ["B5", "C5"]},
      {"pick": ["A5", "C0"], "place": ["C5", "C6"]},
      {"pick": ["A2", "B6"], "place": ["G0", "C6"]},
      {"pick": ["A5", "B0"], "place": ["B5", "C6"]}
    ]
  }
}
Evaluation protocol
  • Input can be pure-visual or text-guided.
  • The model must output a boxed Move(...) or a pair of moves in the specified format.
  • Scoring checks both semantic validity and exact move-format parsing.
  • Results can be broken down by level, move budget, solution multiplicity, and operator variation.

Citation

If MathSticks, the public leaderboard, or the evaluation pipeline helps your work, please cite:

@article{mathsticks2025,
  title   = {MathSticks: A Benchmark for Visual Symbolic Compositional Reasoning with Matchstick Puzzles},
  author  = {Ji, Yuheng and Tan, Huajie and Chi, Cheng and Xu, Yijie and Zhao, Yuting and Zhou, Enshen and Lyu, Huaihai and Wang, Pengwei and Wang, Zhongyuan and Zhang, Shanghang and Zheng, xiaolong},
  journal = {arXiv preprint arXiv:2510.00483},
  year    = {2025}
}

About

[NeurIPS 2025 Workshop] Official implementation of MathSticks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors