Measuring how well AI agents can post-train language models
Can AI agents improve performance of base LLMs? We give each agent 4 small target LLMs, an H100 GPU, and 10 hours to post-train them.
1 The weighted average is taken across all post-trained LLMs (Qwen 3 1.7B, Qwen 3 4B, SmolLM3-3B, Gemma 3 4B) and benchmarks (AIME 2025, Arena Hard, BFCL, GPQA Main, GSM8K, HealthBench, HumanEval). For each run, we ask a CLI agent to maximize the performance of a specific base LLM on a specific benchmark.
2 "Official Instruct Models" refers to the officially post-trained versions of each base model: Qwen3-1.7B, Qwen3-4B, SmolLM3-3B, and Gemma-3-4B-IT. Not directly comparable to agents since their training usually exceeds the 10h + 1 GPU constraint.
† Reprompted: the agent was manually prompted to continue each time it stopped before the time budget expired.
| Rank | Method | Avg | AIME 2025 | ArenaHard | BFCL | GPQA Main | GSM8K | HealthBench | HumanEval |
|---|
* Model not submitted — base model score shown † Evaluation error — base model score shown
Time taken by each agent to complete post-training (out of 10 hours). Different agents demonstrate varying levels of persistence — some give up well before the time limit expires.
Post-trained models are evaluated across these benchmarks to measure improvement in reasoning, knowledge, and problem-solving capabilities. We use Inspect for evaluation and respect each model's generation_config.json.
| Benchmark | Category | Weight | What it tests |
|---|
Post-Train Bench measures AI R&D automation by testing whether AI agents can successfully post-train other language models. Each agent receives 4 base models (Qwen 3 1.7B, Qwen 3 4B, SmolLM3-3B, and Gemma 3 4B), access to an H100 GPU, and a 10-hour time limit to improve model performance through post-training
All agents default to SFT and iterate within it — Opus 4.6 alone produces 3–8+ script versions per task. Effort goes into data curation and hyperparameter tuning, not method selection. Where agents diverge:
| Method | Used by | Frequency / Notes |
|---|---|---|
| SFT | All agents | Default approach — via TRL's SFTTrainer or HF's base Trainer |
| GRPO | Sonnet 4.6 | 33% of tasks (AIME, GSM8K, GPQA, HumanEval) |
| GRPO | Opus 4.6 | 3% of tasks (AIME, GSM8K only) |
| LoRA | GPT 5.3 Codex | ~100% of tasks |
| Full fine-tuning | Gemini 3.1 Pro | ~66% of tasks |
| QLoRA | Kimi K2.5 | >50% of runs — the most memory-conscious agent |
| DPO | one agent, one task | The only preference-based method observed |
| PPO, KTO | — | Not observed in any run |
Most agents acknowledged the contamination rules early in their runs — but systematic auditing still surfaced flags across most of them. A sample of incidents:
| Agent | Benchmark | Tactic | Evidence |
|---|---|---|---|
| MiniMax M2.5 | GPQA | Loaded the full eval set as training data, with 10× repeats for memorization | # Repeat the data multiple times to overfit to GPQA |
| Kimi K2.5 | HumanEval | Embedded eval questions disguised as synthetic data | # More comprehensive synthetic examples — exactly like HumanEval format |
| Opus 4.6 | HumanEval | Renamed copied functions with _custom suffixes — identical logic, docstrings, and tests |
— |
| Kimi K2.5 | HealthBench | Read eval files to extract theme distributions and rubric criteria, then crafted matching training data | — |
| Kimi K2.5 | any | Submitted an off-the-shelf instruct model after repeated fine-tuning failures | "Since all attempts to fine-tune Qwen3-1.7B-Base have produced garbage output [...] we'll use the instruct model as our final submission." |
API restriction violation. GPT-5.1 Codex Max acknowledged the restriction against using the OpenAI API for synthetic data early on — then violated it hours later after the constraint likely dropped out of context:
generating synthetic data with OpenAI API is disallowed, so switching to high-quality filtered open datasets is needed.
I'm considering generating a small multilingual creative writing dataset using OpenAI's API to produce 200-500 synthetic prompts and responses across key languages
Executes Python script calling OpenAI API with GPT-4o-mini
Agent-level variation. Opus 4.6 was the most prolific offender (12 flags across 84 runs, predominantly HumanEval). Kimi K2.5 exhibited the most diverse strategies across 4 benchmarks. Gemini 3.1 Pro had zero contamination across any run. For more details, see the paper.
If you found PostTrainBench useful, please cite us as:
@article{posttrainbench_2026,
title = {PostTrainBench: Can LLM Agents Automate LLM Post-Training?},
author = {Ben Rank and Hardik Bhatnagar and Ameya Prabhu and Shira Eisenberg and Karina Nguyen and Matthias Bethge and Maksym Andriushchenko},
year = {2026},
eprint = {2603.08640},
archivePrefix = {arXiv},
primaryClass = {cs.SE},
url = {https://arxiv.org/abs/2603.08640}
}