█████╗ ██╗   ██╗████████╗ ██████╗ ██╗      █████╗ ██████╗
██╔══██╗██║   ██║╚══██╔══╝██╔═══██╗██║     ██╔══██╗██╔══██╗
███████║██║   ██║   ██║   ██║   ██║██║     ███████║██████╔╝
██╔══██║██║   ██║   ██║   ██║   ██║██║     ██╔══██║██╔══██╗
██║  ██║╚██████╔╝   ██║   ╚██████╔╝███████╗██║  ██║██████╔╝
╚═╝  ╚═╝ ╚═════╝    ╚═╝    ╚═════╝ ╚══════╝╚═╝  ╚═╝╚═════╝

Can models participate in the experimental loops that drive scientific and engineering progress? AutoLab benchmarks auto research capability across system optimization and LLM development tasks.

autolab
scaling_law.pythonoptimizing...
Joint research with
Stanford
MIT
UW
UCSD
UCSB
Princeton
Notre Dame
Waterloo
NUS
BakeLab
Google
Nvidia
IBM
// leaderboard

Model Leaderboard

11 frontier models ranked by Avg@3 across 36 tasks. Each bar extends to Best@3. Switch tabs to rank by category.

#
Model
00.8
1
ImageClaude-Opus-4.6
0.68
2
ImageGemini-3.1-Pro
0.50
3
ImageKimi-K2.6
0.46
4
ImageMiMo-V2.5-Pro
0.45
5
ImageGLM-5
0.43
6
ImageDeepSeek-V4-Pro
0.38
7
ImageGPT-5.4
0.36
8
ImageGrok-4-20
0.35
9
ImageHunyuan-3-Preview
0.31
10
ImageMiniMax-M2.7
0.27
11
ImageQwen-3.6-Plus
0.27
Avg@3Best@3each model keyed to its own color · axis 0 → 0.8
Updated May 10, 2026 · Avg@3 = mean over 3 runs · Best@3 = best of 3 runs
// detail scores

Task Detail Scores

Raw scores per task per model. Shaded against the top score overall — greener = higher, pinker = lower.

TaskClaudeOpus-4.6Gemini3.1-ProKimiK2.6MiMoV2.5-ProGLM5DeepSeekV4-ProGPT5.4Grok4-20Hunyuan3-PreviewMiniMaxM2.7Qwen3.6-Plus
Model Development (7 tasks)
Data Select Ifeval0.640.490.480.380.280.460.090.490.230.430.28
Flux2 Klein Lora0.480.070.000.170.220.000.280.090.000.290.06
Grpo Multisource0.840.790.590.810.820.850.830.000.570.930.84
Llm Online Serving0.000.000.000.010.030.040.310.340.000.000.28
Moving Mnist World Model0.680.320.220.300.300.190.160.020.280.190.18
Multilingual Ocr0.890.690.980.880.880.630.440.000.830.700.45
Scaling Law0.850.140.650.780.610.300.330.000.000.430.63
Puzzle & Challenge (10 tasks)
Adaptive Compression0.680.540.370.280.230.310.430.140.000.230.49
Adversarial Splay0.610.540.560.510.550.000.360.170.580.350.35
Discover Sorting1.000.950.900.900.850.570.850.850.250.000.57
Fredkin Sort Network0.960.310.470.210.600.290.330.620.360.000.00
Resnet Bit Flip0.630.900.320.920.060.600.320.320.260.350.53
Safety Router0.990.990.660.960.640.660.000.310.000.310.66
Smallest Game Player0.610.000.000.090.000.000.320.000.000.000.00
Stack Machine Golf1.001.000.670.790.481.000.160.960.380.190.00
Toy Isa Opt0.981.000.000.840.970.590.890.970.650.800.00
Vliw Scheduler1.000.940.840.850.530.310.830.991.000.390.00
CUDA (4 tasks)
Huffman Canonical Decode0.450.380.200.110.110.000.000.090.020.000.00
Icp Correspondence Step0.550.520.510.450.500.000.440.280.150.270.00
Msm Pippenger Bls12 3810.160.000.100.000.210.000.100.100.000.000.00
Ntt Butterfly0.370.000.170.150.150.000.000.250.110.220.00
System Optimization (15 tasks)
Aes128 Ctr0.750.690.700.650.700.370.550.520.670.000.03
Agent Tool Routing0.650.430.370.300.450.250.130.220.150.360.38
Bm25 Search Go0.830.540.640.510.510.600.500.510.320.500.56
Bvh Raytracer0.440.390.410.400.280.370.130.360.370.070.11
Concurrent Kv Wal0.960.760.560.470.630.830.530.530.320.280.25
Fft Rust0.000.550.390.360.530.570.520.530.370.350.55
Flash Attention0.850.390.650.430.570.330.320.440.480.430.35
Gaussian Blur0.710.490.640.540.200.490.340.280.280.090.00
Hash Join0.810.660.700.660.680.700.610.580.650.580.67
Levenshtein Distance0.580.530.520.080.120.490.470.000.040.080.01
Radix Sort0.710.640.460.670.620.690.540.630.660.220.68
Regex Engine1.000.410.370.020.120.000.190.080.070.080.00
Sha256 Throughput0.460.350.440.260.180.130.140.150.090.140.13
Sstable Compaction Rs0.830.270.700.400.610.720.180.590.620.390.52
Z Order Range Scan0.450.200.310.170.410.310.320.260.350.150.10
// what we measure

36 Tasks Across Four Categories

Every task is a real engineering or research problem with hours-long horizons, scored on a continuous scale rather than pass/fail. Categories span a spectrum of scope — from a single sharp insight to an end-to-end LLM pipeline.

01
Architectural

System Optimization

0 tasks

Low-level performance engineering of systems primitives — kernels, sorting, hashing, search, compression, regex, and cryptography in C, Rust, Go, and Python.

example tasks
flash_attentionbm25_search_goconcurrent_kv_walregex_engineaes128_ctr
02
Single insight

Puzzle & Challenge

0 tasks

Algorithmic problems built around one key insight — combinatorial reductions, sorting networks, ISA-level scheduling, adversarial constructions, and adaptive coding.

example tasks
discover_sortingfredkin_sort_networkstack_machine_golftoy_isa_optvliw_scheduler
03
Full pipeline

Model Development

0 tasks

The full LLM pipeline — pretraining scaling laws, RL post-training, SFT data selection, parameter-efficient fine-tuning, world-model training, and online serving.

example tasks
scaling_lawgrpo_multisourcedata_select_ifevalmultilingual_ocrllm_online_serving
04
GPU kernels

CUDA

0 tasks

GPU kernel optimization for cryptographic primitives, point-cloud registration, and compression.

example tasks
huffman_canonical_decodeicp_correspondence_stepmsm_pippenger_bls12_381ntt_butterfly
How tasks are scored
Baseline
0.0
Reference
0.5
Optimum
1.0

Scores are anchored to a working-but-unoptimized baseline (0.0) and a strong human reference (0.5), approaching 1.0 near the practical optimum. Speed and throughput tasks use a log-stretch (gains span orders of magnitude); bounded-quality tasks like accuracy and perplexity use linear interpolation. A correctness gate must pass before any optimization score is recorded.

Avg@3— mean of 3 runsBest@3— best of 3 runsDominance— head-to-head win rate
// tasks

Task Examples

Each task presents a working but slow program. The agent must optimize it as far as possible.

// contribute

Add Your Model & Harness

We welcome new model & harness. Open a PR, and we'll benchmark it on all tasks.

> submit_pr