
BakeLab Public Benchmark ///
Last updated February 2026
A large-scale evaluation of frontier AI models on artist-curated artworks across fine art, photography, and illustration — benchmarking model judgments against domain-expert evaluations across 400 comparison sets.
BakeLab · UW · UCSB · Stanford · Notre Dame · IBM Research
Leaderboard
■ ■ ■ sorted by overall score ■ ■ ■
Detailed Scores
Scroll to explore the full table
| Artwork | Illustration | Photography | |||
|---|---|---|---|---|---|
| Top-1 | TB-1 | Average | Average | Average | |
| Human-Expert | 77.7 | 68.9 | 74.7███████████████░░░░░ | 54.4███████████░░░░░░░░░ | 72.4██████████████░░░░░░ |
| Model | |||||
| 1Claude-Sonnet-4.6 | 40.3 | 26.5 | 34.2███████░░░░░░░░░░░░░ | 19.0████░░░░░░░░░░░░░░░░ | 23.0█████░░░░░░░░░░░░░░░ |
| 2Claude-Opus-4.6 | 35.5 | 20.0 | 23.6█████░░░░░░░░░░░░░░░ | 17.0███░░░░░░░░░░░░░░░░░ | 18.0████░░░░░░░░░░░░░░░░ |
| 3Gemini-3.1-Pro | 35.0 | 22.3 | 26.1█████░░░░░░░░░░░░░░░ | 13.0███░░░░░░░░░░░░░░░░░ | 24.5█████░░░░░░░░░░░░░░░ |
| Gemini-3-Pro | 35.0 | 22.0 | 29.8██████░░░░░░░░░░░░░░ | 14.0███░░░░░░░░░░░░░░░░░ | 18.7████░░░░░░░░░░░░░░░░ |
| Claude-Opus-4.5 | 34.3 | 20.3 | 24.2█████░░░░░░░░░░░░░░░ | 14.0███░░░░░░░░░░░░░░░░░ | 20.1████░░░░░░░░░░░░░░░░ |
| Doubao-Seed-2.0-Pro | 33.3 | 23.5 | 32.9███████░░░░░░░░░░░░░ | 6.0█░░░░░░░░░░░░░░░░░░░ | 25.2█████░░░░░░░░░░░░░░░ |
| GPT-5 | 32.3 | 21.8 | 25.5█████░░░░░░░░░░░░░░░ | 9.0██░░░░░░░░░░░░░░░░░░ | 26.6█████░░░░░░░░░░░░░░░ |
| Qwen-3.5-Plus | 30.8 | 19.3 | 19.9████░░░░░░░░░░░░░░░░ | 13.0███░░░░░░░░░░░░░░░░░ | 23.0█████░░░░░░░░░░░░░░░ |
| Qwen-3.5-397B | 29.8 | 17.3 | 17.4███░░░░░░░░░░░░░░░░░ | 8.0██░░░░░░░░░░░░░░░░░░ | 23.7█████░░░░░░░░░░░░░░░ |
| GPT-4.1 | 29.5 | 21.3 | 25.5█████░░░░░░░░░░░░░░░ | 15.0███░░░░░░░░░░░░░░░░░ | 20.9████░░░░░░░░░░░░░░░░ |
| GPT-5.1 | 29.5 | 20.0 | 29.8██████░░░░░░░░░░░░░░ | 11.0██░░░░░░░░░░░░░░░░░░ | 15.1███░░░░░░░░░░░░░░░░░ |
| Gemini-3-Flash | 28.0 | 15.8 | 20.5████░░░░░░░░░░░░░░░░ | 8.0██░░░░░░░░░░░░░░░░░░ | 15.8███░░░░░░░░░░░░░░░░░ |
| Kimi-K2.5 | 26.8 | 15.0 | 18.0████░░░░░░░░░░░░░░░░ | 14.0███░░░░░░░░░░░░░░░░░ | 12.2██░░░░░░░░░░░░░░░░░░ |
| o4-mini | 25.0 | 21.5 | 23.0█████░░░░░░░░░░░░░░░ | 7.0█░░░░░░░░░░░░░░░░░░░ | 30.2██████░░░░░░░░░░░░░░ |
| Claude-Sonnet-4.5 | 24.3 | 14.5 | 19.3████░░░░░░░░░░░░░░░░ | 8.0██░░░░░░░░░░░░░░░░░░ | 13.7███░░░░░░░░░░░░░░░░░ |
| GPT-5.2 | 24.0 | 15.5 | 14.9███░░░░░░░░░░░░░░░░░ | 8.0██░░░░░░░░░░░░░░░░░░ | 21.6████░░░░░░░░░░░░░░░░ |
| Qwen3-VL-235B | 20.5 | 14.0 | 19.3████░░░░░░░░░░░░░░░░ | 8.0██░░░░░░░░░░░░░░░░░░ | 12.2██░░░░░░░░░░░░░░░░░░ |
| Claude-Haiku-4.5 | 18.8 | 12.5 | 19.3████░░░░░░░░░░░░░░░░ | 5.0█░░░░░░░░░░░░░░░░░░░ | 10.1██░░░░░░░░░░░░░░░░░░ |
| GLM-4.6V | 17.5 | 11.5 | 18.6████░░░░░░░░░░░░░░░░ | 8.0██░░░░░░░░░░░░░░░░░░ | 5.8█░░░░░░░░░░░░░░░░░░░ |
| Grok-4.1-Fast | 15.5 | 12.5 | 17.4███░░░░░░░░░░░░░░░░░ | 11.0██░░░░░░░░░░░░░░░░░░ | 7.9██░░░░░░░░░░░░░░░░░░ |
* All models with extended thinking capability are evaluated with thinking enabled.
Open-weight model
Evaluation Framework
Our benchmark evaluates aesthetic judgment across seven core dimensions, each capturing a distinct facet of visual quality assessment.
Intentional structure, clear hierarchy, and a frame that organizes attention without awkward clutter or emptiness.
Clean value separation, coherent palette, and command over highlights, shadows, and atmospheric mood.
Confident handling of fundamentals — technique that serves the image, never competing with it.
Whether the viewer instantly finds what matters. Clear focal point, smooth flow, minimal noise.
A resolved image with no neglected areas. Edges, transitions, and surfaces all feel deliberate.
The felt impact beyond craft — atmosphere, captured moment, artistic character. Something that lingers.
The overall call. Which work holds together better, and where the other falls short.
How It Works
Original work produced as matched sets. Intent stays constant, only execution varies.
1,000+ creators · 2,000+ hours
Original work produced as matched sets. Intent stays constant, only execution varies.
10 experts per pair · 13,000+ ratings
100+ evaluators across domains. Only kept pairs where judgments converge.
Best, worst, and both
Models rank coherent aesthetic ordering, not just single-image preference.
Reshuffled order · Repeated trials
Both consistency and average accuracy reported across domains.
Have a frontier model you'd like to evaluate? Submit it for inclusion in the next benchmark round.
Get in Touch