HorizonMath

Measuring AI Progress Toward Mathematical Discovery with Automatic Verification

Erik Y. Wang*, Sumeet Motwani, James V. Roggeveen, Eliot Hodges, Dulhan Jayalath,
Charles London, Kalyan Ramakrishnan, Flaviu Cipcigan, Philip Torr, Alessandro Abate

University of Oxford · Benchmark · Harvard University · Princeton University · Ellison Institute of Technology

*Correspondence: erik.wang@dtc.ox.ac.uk

Setup

uv sync

Some validators require SageMath. Install it separately (brew install sage on macOS, sudo apt-get install -y sagemath on Debian/Ubuntu). If Sage is not on your PATH, set SAGE_CMD=/path/to/sage.

Create a .env file with your API keys:

OPENAI_API_KEY=sk-...
GEMINI_API_KEY=AIza...

Problem Taxonomy

The benchmark contains 101 problems across 8 domains, classified by three fields in data/problems_full.json:

solvability (difficulty level): 0 (calibration, 10), 1 (likely solvable, 23), 2 (challenging, 60), 3 (possibly unsolvable, 8)
output_type (artifact type): constant (54), function (5), formula_discovery (3), construction (39)
evaluation_mode (how answers are checked): ground_truth_computable (59), benchmark_best_known (33), new_construction (9)

Solvability 0 problems have known solutions and serve as a verification step for the evaluation pipeline and a calibration for models.

Running the Benchmark

The benchmark has a two-phase pipeline:

Phase 1 — Generate responses (run_benchmark.py): Prompts models and saves raw responses to responses.jsonl.
Phase 2 — Evaluate responses (evaluate_responses.py): Evaluates saved responses against ground truths, validators, and baselines.

This separation means you can re-evaluate responses without re-prompting models, and long-running generation survives interruptions via --resume.

Running with tmux (recommended)

The tmux_run.sh wrapper runs both phases in a detached tmux session, so benchmark runs survive SSH disconnects. Output is logged to results/tmux_run_<timestamp>.log.

# Full benchmark with defaults (OpenRouter gpt-5.2, 5 parallel)
./scripts/tmux_run.sh

# Specify provider and model
./scripts/tmux_run.sh --provider openai --model gpt-5.2-pro

# Single problem
./scripts/tmux_run.sh --problem diff_basis_upper

# Resume an interrupted run
./scripts/tmux_run.sh --resume results/openrouter_openai-gpt-5.2_20260205_143022/

# Run only generation or evaluation
./scripts/tmux_run.sh --phase generate --provider openai --model gpt-5.2
./scripts/tmux_run.sh --phase evaluate --resume results/openrouter_openai-gpt-5.2_20260205_143022/

tmux attach -t openmath       # Attach to see live output
# Ctrl-b d                    # Detach without stopping
tmux kill-session -t openmath # Abort the run

Running each phase manually

Phase 1 — Generate responses:

uv run scripts/run_benchmark.py                                    # Full benchmark (OpenRouter gpt-5.2)
uv run scripts/run_benchmark.py --provider openai --model gpt-5.2-pro  # Use OpenAI directly
uv run scripts/run_benchmark.py --problem w4_watson_integral       # Single problem
uv run scripts/run_benchmark.py --parallel 10                      # Parallel generation
uv run scripts/run_benchmark.py --resume results/<run_dir>/        # Resume interrupted run

Phase 2 — Evaluate responses:

uv run scripts/evaluate_responses.py results/<run_dir>/            # Evaluate all responses
uv run scripts/evaluate_responses.py results/<run_dir>/ --force    # Re-evaluate from scratch

Output Structure

Results are saved to timestamped folders in results/:

results/openai_gpt-5.2_20260205_143022/
├── config.json          # Run configuration
├── prompts.jsonl        # Problem prompts (saved before API calls)
├── responses.jsonl      # Raw LLM responses (Phase 1 output)
├── evaluation.jsonl     # Per-problem evaluation results (Phase 2 output)
└── summary.json         # Detailed statistics

Both scripts support additional options — run with --help for full details.

Aggregating split runs

You can split a benchmark across parallel jobs using --range (0-based inclusive indices):

uv run scripts/run_benchmark.py --range 0-49 --provider openai --model gpt-5.2-pro
uv run scripts/run_benchmark.py --range 50-100 --provider openai --model gpt-5.2-pro

Then merge the result directories into a single report:

uv run scripts/aggregate_results.py results/openai_gpt-5.2-pro_*/ -o results/gpt-5.2-pro_combined/

Evaluating Individual Solutions

The evaluation script (scripts/evaluate.py) can evaluate a single LLM solution file. The mode is auto-detected from the problem's evaluation_mode:

Numeric (ground_truth_computable): compares returned digits against the ground truth.
Benchmark (benchmark_best_known): validates the construction and compares metrics against baselines.
Construction (new_construction): validates the construction (pass/fail, no baseline comparison).

uv run python scripts/evaluate.py --llm-output solution.txt --problem-index 34
uv run python scripts/evaluate.py --llm-output solution.txt --problem-id diff_basis_upper --json
uv run python scripts/evaluate.py --list-problems --mode benchmark

LLM Output Format

LLMs should output a proposed_solution() function:

def proposed_solution():
    # For numeric problems: return a number
    # For benchmark/construction problems: return a JSON-serializable dict/list
    return {"n": 10, "basis": [0, 1, 2, 6, 9]}

Each validator documents its expected input format in its docstring.

Contributing Problems

We welcome new problem contributions! To propose a problem, open a GitHub issue with the following:

Problem description — a clear mathematical statement, including the source (paper, Math Stack Exchange, etc.)
Classification — the proposed output_type, domain, evaluation_mode, and solvability level (see Problem Taxonomy)
Full implementation — depending on the evaluation mode, include:

Evaluation Mode	What to provide
`ground_truth_computable`	A numerics script that computes the answer to at least 50 digits of precision, or a reliable academic source providing a pre-computed numerical value
`benchmark_best_known`	A validator script and a baseline value with source citation
`new_construction`	A validator script

You must also provide a justification of the numerics or validator script that you provide below, explaining the method(s) used.

Numerics scripts

A numerics script computes the ground-truth value to high precision. It should be a standalone Python file:

from mpmath import mp
mp.dps = 110  # at least 100 digits of precision

def compute():
    # Your computation here
    return result

if __name__ == "__main__":
    print(str(compute()))

Validators

A validator checks whether a proposed construction is mathematically valid and returns metrics. It should export a single validate(solution) function:

from . import ValidationResult, success, failure

def validate(solution):
    """
    Expected input format:
        {"basis": [b0, b1, b2, ...]}
    """
    # 1. Parse the input
    if isinstance(solution, dict) and 'basis' in solution:
        basis = solution['basis']
    else:
        return failure("Expected dict with 'basis' key")

    # 2. Check mathematical validity
    if not all_differences_covered(basis):
        return failure("Not all differences covered", basis_size=len(basis))

    # 3. Return success with metrics
    return success(
        f"Valid basis of size {len(basis)}",
        basis_size=len(basis),
        ratio=len(basis)**2 / n
    )

success(message, **metrics) and failure(message, **metrics) are the only return types needed.
Document the expected input format in the docstring.
For benchmark problems, return metrics as keyword arguments — one of these is compared against the baseline.
Helper utilities available from the validators package: parse_integer, parse_rational, load_solution, run_sage_script.

Baselines (benchmark problems only)

For benchmark_best_known problems, provide a baseline entry for data/baselines.json:

{
  "problem_id": "diff_basis_upper",
  "baseline": {
    "value": "2.6390",
    "direction": "minimize",
    "metric": "ratio |B|^2/n for a difference basis of [1, n-1]",
    "metric_key": "ratio"
  }
}

direction: "minimize" or "maximize" — whether lower or higher values are better.
metric_key: which key from the validator's returned metrics to compare against the baseline.
Include a source citation for the baseline value.

Citation

@article{wang2026horizonmath,
  title     = {HorizonMath: Measuring AI Progress Toward Mathematical
               Discovery with Automatic Verification},
  author    = {Wang, Erik Y. and Motwani, Sumeet and Roggeveen, James V.
               and Hodges, Eliot and Jayalath, Dulhan and London, Charles
               and Ramakrishnan, Kalyan and Cipcigan, Flaviu
               and Torr, Philip and Abate, Alessandro},
  year      = {2026},
  note      = {Working Draft}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
docs		docs
numerics		numerics
scripts		scripts
tests		tests
validators		validators
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
paper.pdf		paper.pdf
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HorizonMath

Measuring AI Progress Toward Mathematical Discovery with Automatic Verification

Setup

Problem Taxonomy

Running the Benchmark

Running with tmux (recommended)

Running each phase manually

Output Structure

Aggregating split runs

Evaluating Individual Solutions

LLM Output Format

Contributing Problems

Numerics scripts

Validators

Baselines (benchmark problems only)

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HorizonMath

Measuring AI Progress Toward Mathematical Discovery with Automatic Verification

Setup

Problem Taxonomy

Running the Benchmark

Running with tmux (recommended)

Running each phase manually

Output Structure

Aggregating split runs

Evaluating Individual Solutions

LLM Output Format

Contributing Problems

Numerics scripts

Validators

Baselines (benchmark problems only)

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages