AutoLab is an open benchmark arena for evaluating AI agents on performance engineering tasks. Unlike traditional benchmarks that test one-shot correctness, AutoLab measures whether models can execute iterative improvement cycles: proposing optimizations, testing them, measuring results, and revising approaches under real constraints.

How many tasks does AutoLab benchmark?

AutoLab benchmarks AI agents across 36 tasks spanning four domains: system optimization, puzzle & challenge, model development, and CUDA. Tasks range from training language models and fine-tuning with LoRA to optimizing sorting algorithms and building concurrent key-value stores.

Which AI models are evaluated on AutoLab?

AutoLab evaluates 11+ frontier AI models including Claude Opus 4.6, Gemini 3.1 Pro, Kimi K2.6, MiMo V2.5 Pro, GLM-5, DeepSeek V4 Pro, GPT-5.4, Grok 4-20, Hunyuan 3 Preview, MiniMax M2.7, and Qwen 3.6 Plus. Each model is tested under identical compute budgets to ensure fair comparison.

How is AutoLab different from other AI benchmarks?

AutoLab replaces the exam with the laboratory. Instead of testing static question answering, it evaluates closed-loop resilience — the ability to survive negative empirical feedback, update hypotheses, and restructure approaches. Models must run code, measure real performance metrics, and iterate within fixed compute budgets.

AutoLab — A Benchmark for AI Agents Driving Scientific and Engineering Progress

// leaderboard

Model Leaderboard

11 frontier models ranked by Avg@3 across 36 tasks. Each bar extends to Best@3. Switch tabs to rank by category.

Model

0Overall score0.8

Avg → Best

Claude-Opus-4.6

0.68

0.68↗0.76

Gemini-3.1-Pro

0.50

0.50↗0.59

Kimi-K2.6

0.46

0.46↗0.60

MiMo-V2.5-Pro

0.45

0.45↗0.58

GLM-5

0.43

0.43↗0.55

DeepSeek-V4-Pro

0.38

0.38↗0.51

GPT-5.4

0.36

0.36↗0.53

Grok-4-20

0.35

0.35↗0.44

Hunyuan-3-Preview

0.31

0.31↗0.45

MiniMax-M2.7

0.27

0.27↗0.43

Qwen-3.6-Plus

0.27

0.27↗0.39

Avg@3Best@3each model keyed to its own color · axis 0 → 0.8

Updated May 10, 2026 · Avg@3 = mean over 3 runs · Best@3 = best of 3 runs

Want to see your model here? Add Your Model & Harness ↓Previous versions →

// detail scores

Task Detail Scores

Raw scores per task per model. Shaded against the top score overall — greener = higher, pinker = lower.

Task	ClaudeOpus-4.6	Gemini3.1-Pro	KimiK2.6	MiMoV2.5-Pro	GLM5	DeepSeekV4-Pro	GPT5.4	Grok4-20	Hunyuan3-Preview	MiniMaxM2.7	Qwen3.6-Plus
Model Development (7 tasks)
Data Select Ifeval	0.64	0.49	0.48	0.38	0.28	0.46	0.09	0.49	0.23	0.43	0.28
Flux2 Klein Lora	0.48	0.07	0.00	0.17	0.22	0.00	0.28	0.09	0.00	0.29	0.06
Grpo Multisource	0.84	0.79	0.59	0.81	0.82	0.85	0.83	0.00	0.57	0.93	0.84
Llm Online Serving	0.00	0.00	0.00	0.01	0.03	0.04	0.31	0.34	0.00	0.00	0.28
Moving Mnist World Model	0.68	0.32	0.22	0.30	0.30	0.19	0.16	0.02	0.28	0.19	0.18
Multilingual Ocr	0.89	0.69	0.98	0.88	0.88	0.63	0.44	0.00	0.83	0.70	0.45
Scaling Law	0.85	0.14	0.65	0.78	0.61	0.30	0.33	0.00	0.00	0.43	0.63
Puzzle & Challenge (10 tasks)
Adaptive Compression	0.68	0.54	0.37	0.28	0.23	0.31	0.43	0.14	0.00	0.23	0.49
Adversarial Splay	0.61	0.54	0.56	0.51	0.55	0.00	0.36	0.17	0.58	0.35	0.35
Discover Sorting	1.00	0.95	0.90	0.90	0.85	0.57	0.85	0.85	0.25	0.00	0.57
Fredkin Sort Network	0.96	0.31	0.47	0.21	0.60	0.29	0.33	0.62	0.36	0.00	0.00
Resnet Bit Flip	0.63	0.90	0.32	0.92	0.06	0.60	0.32	0.32	0.26	0.35	0.53
Safety Router	0.99	0.99	0.66	0.96	0.64	0.66	0.00	0.31	0.00	0.31	0.66
Smallest Game Player	0.61	0.00	0.00	0.09	0.00	0.00	0.32	0.00	0.00	0.00	0.00
Stack Machine Golf	1.00	1.00	0.67	0.79	0.48	1.00	0.16	0.96	0.38	0.19	0.00
Toy Isa Opt	0.98	1.00	0.00	0.84	0.97	0.59	0.89	0.97	0.65	0.80	0.00
Vliw Scheduler	1.00	0.94	0.84	0.85	0.53	0.31	0.83	0.99	1.00	0.39	0.00
CUDA (4 tasks)
Huffman Canonical Decode	0.45	0.38	0.20	0.11	0.11	0.00	0.00	0.09	0.02	0.00	0.00
Icp Correspondence Step	0.55	0.52	0.51	0.45	0.50	0.00	0.44	0.28	0.15	0.27	0.00
Msm Pippenger Bls12 381	0.16	0.00	0.10	0.00	0.21	0.00	0.10	0.10	0.00	0.00	0.00
Ntt Butterfly	0.37	0.00	0.17	0.15	0.15	0.00	0.00	0.25	0.11	0.22	0.00
System Optimization (15 tasks)
Aes128 Ctr	0.75	0.69	0.70	0.65	0.70	0.37	0.55	0.52	0.67	0.00	0.03
Agent Tool Routing	0.65	0.43	0.37	0.30	0.45	0.25	0.13	0.22	0.15	0.36	0.38
Bm25 Search Go	0.83	0.54	0.64	0.51	0.51	0.60	0.50	0.51	0.32	0.50	0.56
Bvh Raytracer	0.44	0.39	0.41	0.40	0.28	0.37	0.13	0.36	0.37	0.07	0.11
Concurrent Kv Wal	0.96	0.76	0.56	0.47	0.63	0.83	0.53	0.53	0.32	0.28	0.25
Fft Rust	0.00	0.55	0.39	0.36	0.53	0.57	0.52	0.53	0.37	0.35	0.55
Flash Attention	0.85	0.39	0.65	0.43	0.57	0.33	0.32	0.44	0.48	0.43	0.35
Gaussian Blur	0.71	0.49	0.64	0.54	0.20	0.49	0.34	0.28	0.28	0.09	0.00
Hash Join	0.81	0.66	0.70	0.66	0.68	0.70	0.61	0.58	0.65	0.58	0.67
Levenshtein Distance	0.58	0.53	0.52	0.08	0.12	0.49	0.47	0.00	0.04	0.08	0.01
Radix Sort	0.71	0.64	0.46	0.67	0.62	0.69	0.54	0.63	0.66	0.22	0.68
Regex Engine	1.00	0.41	0.37	0.02	0.12	0.00	0.19	0.08	0.07	0.08	0.00
Sha256 Throughput	0.46	0.35	0.44	0.26	0.18	0.13	0.14	0.15	0.09	0.14	0.13
Sstable Compaction Rs	0.83	0.27	0.70	0.40	0.61	0.72	0.18	0.59	0.62	0.39	0.52
Z Order Range Scan	0.45	0.20	0.31	0.17	0.41	0.31	0.32	0.26	0.35	0.15	0.10

// what we measure

36 Tasks Across Four Categories

Every task is a real engineering or research problem with hours-long horizons, scored on a continuous scale rather than pass/fail. Categories span a spectrum of scope — from a single sharp insight to an end-to-end LLM pipeline.

Architectural

System Optimization

0 tasks

Low-level performance engineering of systems primitives — kernels, sorting, hashing, search, compression, regex, and cryptography in C, Rust, Go, and Python.

example tasks

flash_attentionbm25_search_goconcurrent_kv_walregex_engineaes128_ctr

Single insight

Puzzle & Challenge

0 tasks

Algorithmic problems built around one key insight — combinatorial reductions, sorting networks, ISA-level scheduling, adversarial constructions, and adaptive coding.

example tasks

discover_sortingfredkin_sort_networkstack_machine_golftoy_isa_optvliw_scheduler

Full pipeline

Model Development

0 tasks

The full LLM pipeline — pretraining scaling laws, RL post-training, SFT data selection, parameter-efficient fine-tuning, world-model training, and online serving.

example tasks

scaling_lawgrpo_multisourcedata_select_ifevalmultilingual_ocrllm_online_serving

GPU kernels

CUDA

0 tasks

GPU kernel optimization for cryptographic primitives, point-cloud registration, and compression.

example tasks

huffman_canonical_decodeicp_correspondence_stepmsm_pippenger_bls12_381ntt_butterfly

How tasks are scored

Baseline

0.0

Reference

0.5

Optimum

1.0

Scores are anchored to a working-but-unoptimized baseline (0.0) and a strong human reference (0.5), approaching 1.0 near the practical optimum. Speed and throughput tasks use a log-stretch (gains span orders of magnitude); bounded-quality tasks like accuracy and perplexity use linear interpolation. A correctness gate must pass before any optimization score is recorded.

Avg@3— mean of 3 runsBest@3— best of 3 runsDominance— head-to-head win rate

// tasks

Task Examples

Each task presents a working but slow program. The agent must optimize it as far as possible.

scaling_law

Python

mediumperplexity

Train a language model from scratch on WikiText-103 (118M tokens) using LitGPT to minimize test perplexity within a fixed compute budget on an H100 GPU.

95.023.0

litgpt,transformer,scaling-law,perplexity,compute-optimal

grpo_multisource

Python

mediumscore

Fine-tune Qwen2.5-VL-7B with GRPO on multi-source visual math data (Geometry3K, MathVision, ChartQA) to maximize MathVista accuracy.

0.200.56

grpo,rl,vision-language,mathvista,qwen2.5-vl,reward-engineering

flash_attention

hardruntime

Compute scaled dot-product attention for n=4096, d=64, float32. The naive baseline allocates a full 128MB score matrix. Key optimization: Flash Attention tiling with online softmax.

0.75s0.10s

attention,tiling,AVX2,online-softmax

concurrent_kv_wal

hardruntime

Optimize a WAL-backed in-memory key-value store in Go running a deterministic multi-phase workload with 4 concurrent goroutines.

9.5s1.1s

go,concurrency,kv-store,wal,lock-contention

radix_sort

mediumruntime

Sort 50 million random 32-bit unsigned integers in C as fast as possible. Replace the stdlib qsort baseline with a 2-pass LSD radix sort.

4.5s0.35s

sorting,radix-sort,cache,memory-bandwidth

fft_rust

Rust

mediumruntime

Implement a fast FFT in Rust for a length-32768 real signal. Replace the naive O(n²) DFT with an iterative Cooley-Tukey FFT with precomputed twiddle factors.

10.0s0.001s

rust,fft,cooley-tukey,butterfly,simd

view all tasks GitHub