Model Leaderboard
11 frontier models ranked by Avg@3 across 36 tasks. Each bar extends to Best@3. Switch tabs to rank by category.
Task Detail Scores
Raw scores per task per model. Shaded against the top score overall — greener = higher, pinker = lower.
| Task | ClaudeOpus-4.6 | Gemini3.1-Pro | KimiK2.6 | MiMoV2.5-Pro | GLM5 | DeepSeekV4-Pro | GPT5.4 | Grok4-20 | Hunyuan3-Preview | MiniMaxM2.7 | Qwen3.6-Plus |
|---|---|---|---|---|---|---|---|---|---|---|---|
Model Development (7 tasks) | |||||||||||
| Data Select Ifeval | 0.64 | 0.49 | 0.48 | 0.38 | 0.28 | 0.46 | 0.09 | 0.49 | 0.23 | 0.43 | 0.28 |
| Flux2 Klein Lora | 0.48 | 0.07 | 0.00 | 0.17 | 0.22 | 0.00 | 0.28 | 0.09 | 0.00 | 0.29 | 0.06 |
| Grpo Multisource | 0.84 | 0.79 | 0.59 | 0.81 | 0.82 | 0.85 | 0.83 | 0.00 | 0.57 | 0.93 | 0.84 |
| Llm Online Serving | 0.00 | 0.00 | 0.00 | 0.01 | 0.03 | 0.04 | 0.31 | 0.34 | 0.00 | 0.00 | 0.28 |
| Moving Mnist World Model | 0.68 | 0.32 | 0.22 | 0.30 | 0.30 | 0.19 | 0.16 | 0.02 | 0.28 | 0.19 | 0.18 |
| Multilingual Ocr | 0.89 | 0.69 | 0.98 | 0.88 | 0.88 | 0.63 | 0.44 | 0.00 | 0.83 | 0.70 | 0.45 |
| Scaling Law | 0.85 | 0.14 | 0.65 | 0.78 | 0.61 | 0.30 | 0.33 | 0.00 | 0.00 | 0.43 | 0.63 |
Puzzle & Challenge (10 tasks) | |||||||||||
| Adaptive Compression | 0.68 | 0.54 | 0.37 | 0.28 | 0.23 | 0.31 | 0.43 | 0.14 | 0.00 | 0.23 | 0.49 |
| Adversarial Splay | 0.61 | 0.54 | 0.56 | 0.51 | 0.55 | 0.00 | 0.36 | 0.17 | 0.58 | 0.35 | 0.35 |
| Discover Sorting | 1.00 | 0.95 | 0.90 | 0.90 | 0.85 | 0.57 | 0.85 | 0.85 | 0.25 | 0.00 | 0.57 |
| Fredkin Sort Network | 0.96 | 0.31 | 0.47 | 0.21 | 0.60 | 0.29 | 0.33 | 0.62 | 0.36 | 0.00 | 0.00 |
| Resnet Bit Flip | 0.63 | 0.90 | 0.32 | 0.92 | 0.06 | 0.60 | 0.32 | 0.32 | 0.26 | 0.35 | 0.53 |
| Safety Router | 0.99 | 0.99 | 0.66 | 0.96 | 0.64 | 0.66 | 0.00 | 0.31 | 0.00 | 0.31 | 0.66 |
| Smallest Game Player | 0.61 | 0.00 | 0.00 | 0.09 | 0.00 | 0.00 | 0.32 | 0.00 | 0.00 | 0.00 | 0.00 |
| Stack Machine Golf | 1.00 | 1.00 | 0.67 | 0.79 | 0.48 | 1.00 | 0.16 | 0.96 | 0.38 | 0.19 | 0.00 |
| Toy Isa Opt | 0.98 | 1.00 | 0.00 | 0.84 | 0.97 | 0.59 | 0.89 | 0.97 | 0.65 | 0.80 | 0.00 |
| Vliw Scheduler | 1.00 | 0.94 | 0.84 | 0.85 | 0.53 | 0.31 | 0.83 | 0.99 | 1.00 | 0.39 | 0.00 |
CUDA (4 tasks) | |||||||||||
| Huffman Canonical Decode | 0.45 | 0.38 | 0.20 | 0.11 | 0.11 | 0.00 | 0.00 | 0.09 | 0.02 | 0.00 | 0.00 |
| Icp Correspondence Step | 0.55 | 0.52 | 0.51 | 0.45 | 0.50 | 0.00 | 0.44 | 0.28 | 0.15 | 0.27 | 0.00 |
| Msm Pippenger Bls12 381 | 0.16 | 0.00 | 0.10 | 0.00 | 0.21 | 0.00 | 0.10 | 0.10 | 0.00 | 0.00 | 0.00 |
| Ntt Butterfly | 0.37 | 0.00 | 0.17 | 0.15 | 0.15 | 0.00 | 0.00 | 0.25 | 0.11 | 0.22 | 0.00 |
System Optimization (15 tasks) | |||||||||||
| Aes128 Ctr | 0.75 | 0.69 | 0.70 | 0.65 | 0.70 | 0.37 | 0.55 | 0.52 | 0.67 | 0.00 | 0.03 |
| Agent Tool Routing | 0.65 | 0.43 | 0.37 | 0.30 | 0.45 | 0.25 | 0.13 | 0.22 | 0.15 | 0.36 | 0.38 |
| Bm25 Search Go | 0.83 | 0.54 | 0.64 | 0.51 | 0.51 | 0.60 | 0.50 | 0.51 | 0.32 | 0.50 | 0.56 |
| Bvh Raytracer | 0.44 | 0.39 | 0.41 | 0.40 | 0.28 | 0.37 | 0.13 | 0.36 | 0.37 | 0.07 | 0.11 |
| Concurrent Kv Wal | 0.96 | 0.76 | 0.56 | 0.47 | 0.63 | 0.83 | 0.53 | 0.53 | 0.32 | 0.28 | 0.25 |
| Fft Rust | 0.00 | 0.55 | 0.39 | 0.36 | 0.53 | 0.57 | 0.52 | 0.53 | 0.37 | 0.35 | 0.55 |
| Flash Attention | 0.85 | 0.39 | 0.65 | 0.43 | 0.57 | 0.33 | 0.32 | 0.44 | 0.48 | 0.43 | 0.35 |
| Gaussian Blur | 0.71 | 0.49 | 0.64 | 0.54 | 0.20 | 0.49 | 0.34 | 0.28 | 0.28 | 0.09 | 0.00 |
| Hash Join | 0.81 | 0.66 | 0.70 | 0.66 | 0.68 | 0.70 | 0.61 | 0.58 | 0.65 | 0.58 | 0.67 |
| Levenshtein Distance | 0.58 | 0.53 | 0.52 | 0.08 | 0.12 | 0.49 | 0.47 | 0.00 | 0.04 | 0.08 | 0.01 |
| Radix Sort | 0.71 | 0.64 | 0.46 | 0.67 | 0.62 | 0.69 | 0.54 | 0.63 | 0.66 | 0.22 | 0.68 |
| Regex Engine | 1.00 | 0.41 | 0.37 | 0.02 | 0.12 | 0.00 | 0.19 | 0.08 | 0.07 | 0.08 | 0.00 |
| Sha256 Throughput | 0.46 | 0.35 | 0.44 | 0.26 | 0.18 | 0.13 | 0.14 | 0.15 | 0.09 | 0.14 | 0.13 |
| Sstable Compaction Rs | 0.83 | 0.27 | 0.70 | 0.40 | 0.61 | 0.72 | 0.18 | 0.59 | 0.62 | 0.39 | 0.52 |
| Z Order Range Scan | 0.45 | 0.20 | 0.31 | 0.17 | 0.41 | 0.31 | 0.32 | 0.26 | 0.35 | 0.15 | 0.10 |
36 Tasks Across Four Categories
Every task is a real engineering or research problem with hours-long horizons, scored on a continuous scale rather than pass/fail. Categories span a spectrum of scope — from a single sharp insight to an end-to-end LLM pipeline.
System Optimization
Low-level performance engineering of systems primitives — kernels, sorting, hashing, search, compression, regex, and cryptography in C, Rust, Go, and Python.
Puzzle & Challenge
Algorithmic problems built around one key insight — combinatorial reductions, sorting networks, ISA-level scheduling, adversarial constructions, and adaptive coding.
Model Development
The full LLM pipeline — pretraining scaling laws, RL post-training, SFT data selection, parameter-efficient fine-tuning, world-model training, and online serving.
CUDA
GPU kernel optimization for cryptographic primitives, point-cloud registration, and compression.
Scores are anchored to a working-but-unoptimized baseline (0.0) and a strong human reference (0.5), approaching 1.0 near the practical optimum. Speed and throughput tasks use a log-stretch (gains span orders of magnitude); bounded-quality tasks like accuracy and perplexity use linear interpolation. A correctness gate must pass before any optimization score is recorded.
Task Examples
Each task presents a working but slow program. The agent must optimize it as far as possible.
scaling_law
Train a language model from scratch on WikiText-103 (118M tokens) using LitGPT to minimize test perplexity within a fixed compute budget on an H100 GPU.
grpo_multisource
Fine-tune Qwen2.5-VL-7B with GRPO on multi-source visual math data (Geometry3K, MathVision, ChartQA) to maximize MathVista accuracy.
flash_attention
Compute scaled dot-product attention for n=4096, d=64, float32. The naive baseline allocates a full 128MB score matrix. Key optimization: Flash Attention tiling with online softmax.
concurrent_kv_wal
Optimize a WAL-backed in-memory key-value store in Go running a deterministic multi-phase workload with 4 concurrent goroutines.
radix_sort
Sort 50 million random 32-bit unsigned integers in C as fast as possible. Replace the stdlib qsort baseline with a 2-pass LSD radix sort.
fft_rust
Implement a fast FFT in Rust for a length-32768 real signal. Replace the naive O(n²) DFT with an iterative Cooley-Tukey FFT with precomputed twiddle factors.
