BakeLab Public Benchmark ///

Last updated February 2026

Visual Aesthetic
Benchmark

A large-scale evaluation of frontier AI models on artist-curated artworks across fine art, photography, and illustration — benchmarking model judgments against domain-expert evaluations across 400 comparison sets.

13K+Expert Judgments

20Frontier Models

2,000+Hrs Commissioned

26.5%Highest Performance

BakeLab · UW · UCSB · Stanford · Notre Dame · IBM Research

🤗HuggingFace GitHub

Leaderboard

Model Rankings

Correct in all 3 attempts with shuffled option order

Avg precision over 3 attempts with shuffled option order

███ TB-1

Top & Bottom 1: select best and worst

███ TOP-1

Top 1: only select the best

░░░░░░░░░░░░░░░░░░░░░███████████████████████████████

77.7

68.9

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░████████████████

40.3

26.5

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░██████████████

35.5

20.0

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░██████████████

35.0

22.3

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░██████████████

35.0

22.0

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░██████████████

34.3

20.3

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░█████████████

33.3

23.5

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░█████████████

32.3

21.8

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░████████████

30.8

19.3

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░████████████

29.8

17.3

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░████████████

29.5

21.3

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░████████████

29.5

20.0

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░███████████

28.0

15.8

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░███████████

26.8

15.0

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░██████████

25.0

21.5

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░██████████

24.3

14.5

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░██████████

24.0

15.5

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░████████

20.5

14.0

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░████████

18.8

12.5

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░███████

17.5

11.5

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░██████

15.5

12.5

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░███

6.6

5.3

👨‍🎓

Human-Expert

Claude-Sonnet-4.6

Claude-Opus-4.6

Gemini-3.1-Pro

Gemini-3-Pro

Claude-Opus-4.5

Doubao-Seed-2.0-Pro

GPT-5

Qwen-3.5-Plus

Qwen-3.5-397B

GPT-4.1

GPT-5.1

Gemini-3-Flash

Kimi-K2.5

o4-mini

Claude-Sonnet-4.5

GPT-5.2

Qwen3-VL-235B

Claude-Haiku-4.5

GLM-4.6V

Grok-4.1-Fast

🎲

Random-Guess

■ ■ ■ sorted by overall score ■ ■ ■

👨‍🎓Human Expert

77.768.9

Top-1

███████████████████████████████░░░░░░░░░

TB-1

████████████████████████████░░░░░░░░░░░░

01Claude Sonnet 4.6

40.326.5

Top-1

████████████████░░░░░░░░░░░░░░░░░░░░░░░░

TB-1

███████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

02Claude Opus 4.6

35.520.0

Top-1

██████████████░░░░░░░░░░░░░░░░░░░░░░░░░░

TB-1

████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

03Gemini 3.1 Pro

35.022.3

Top-1

██████████████░░░░░░░░░░░░░░░░░░░░░░░░░░

TB-1

█████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

04Gemini 3 Pro

35.022.0

Top-1

██████████████░░░░░░░░░░░░░░░░░░░░░░░░░░

TB-1

█████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

05Claude Opus 4.5

34.320.3

Top-1

██████████████░░░░░░░░░░░░░░░░░░░░░░░░░░

TB-1

████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

06Doubao Seed 2.0 Pro

33.323.5

Top-1

█████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░

TB-1

█████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

07GPT-5

32.321.8

Top-1

█████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░

TB-1

█████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

08Qwen 3.5 Plus

30.819.3

Top-1

████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░

TB-1

████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

09Qwen 3.5 397B

29.817.3

Top-1

████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░

TB-1

███████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

10GPT-4.1

29.521.3

Top-1

████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░

TB-1

█████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

11GPT-5.1

29.520.0

Top-1

████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░

TB-1

████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

12Gemini 3 Flash

28.015.8

Top-1

███████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

TB-1

██████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

13Kimi K2.5

26.815.0

Top-1

███████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

TB-1

██████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

14o4-mini

25.021.5

Top-1

██████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

TB-1

█████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

15Claude Sonnet 4.5

24.314.5

Top-1

██████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

TB-1

██████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

16GPT-5.2

24.015.5

Top-1

██████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

TB-1

██████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

17Qwen3 VL 235B

20.514.0

Top-1

████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

TB-1

██████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

18Claude Haiku 4.5

18.812.5

Top-1

████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

TB-1

█████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

GLM 4.6V

17.511.5

Top-1

███████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

TB-1

█████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

20Grok 4.1 Fast

15.512.5

Top-1

██████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

TB-1

█████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

🎲Random Guess

6.65.3

Top-1

███░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

TB-1

██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Detailed Scores

Score Breakdown

Scroll to explore the full table

Correct in all 3 attempts with shuffled option order

Avg precision over 3 attempts with shuffled option order

Top & Bottom 1: select best and worst

Top 1: only select the best

			Artwork	Illustration	Photography
	Top-1	TB-1	Average	Average	Average
Human-Expert	77.7	68.9	74.7███████████████░░░░░	54.4███████████░░░░░░░░░	72.4██████████████░░░░░░
Model
1Claude-Sonnet-4.6	40.3	26.5	34.2███████░░░░░░░░░░░░░	19.0████░░░░░░░░░░░░░░░░	23.0█████░░░░░░░░░░░░░░░
2Claude-Opus-4.6	35.5	20.0	23.6█████░░░░░░░░░░░░░░░	17.0███░░░░░░░░░░░░░░░░░	18.0████░░░░░░░░░░░░░░░░
3Gemini-3.1-Pro	35.0	22.3	26.1█████░░░░░░░░░░░░░░░	13.0███░░░░░░░░░░░░░░░░░	24.5█████░░░░░░░░░░░░░░░
Gemini-3-Pro	35.0	22.0	29.8██████░░░░░░░░░░░░░░	14.0███░░░░░░░░░░░░░░░░░	18.7████░░░░░░░░░░░░░░░░
Claude-Opus-4.5	34.3	20.3	24.2█████░░░░░░░░░░░░░░░	14.0███░░░░░░░░░░░░░░░░░	20.1████░░░░░░░░░░░░░░░░
Doubao-Seed-2.0-Pro	33.3	23.5	32.9███████░░░░░░░░░░░░░	6.0█░░░░░░░░░░░░░░░░░░░	25.2█████░░░░░░░░░░░░░░░
GPT-5	32.3	21.8	25.5█████░░░░░░░░░░░░░░░	9.0██░░░░░░░░░░░░░░░░░░	26.6█████░░░░░░░░░░░░░░░
Qwen-3.5-Plus	30.8	19.3	19.9████░░░░░░░░░░░░░░░░	13.0███░░░░░░░░░░░░░░░░░	23.0█████░░░░░░░░░░░░░░░
Qwen-3.5-397B	29.8	17.3	17.4███░░░░░░░░░░░░░░░░░	8.0██░░░░░░░░░░░░░░░░░░	23.7█████░░░░░░░░░░░░░░░
GPT-4.1	29.5	21.3	25.5█████░░░░░░░░░░░░░░░	15.0███░░░░░░░░░░░░░░░░░	20.9████░░░░░░░░░░░░░░░░
GPT-5.1	29.5	20.0	29.8██████░░░░░░░░░░░░░░	11.0██░░░░░░░░░░░░░░░░░░	15.1███░░░░░░░░░░░░░░░░░
Gemini-3-Flash	28.0	15.8	20.5████░░░░░░░░░░░░░░░░	8.0██░░░░░░░░░░░░░░░░░░	15.8███░░░░░░░░░░░░░░░░░
Kimi-K2.5	26.8	15.0	18.0████░░░░░░░░░░░░░░░░	14.0███░░░░░░░░░░░░░░░░░	12.2██░░░░░░░░░░░░░░░░░░
o4-mini	25.0	21.5	23.0█████░░░░░░░░░░░░░░░	7.0█░░░░░░░░░░░░░░░░░░░	30.2██████░░░░░░░░░░░░░░
Claude-Sonnet-4.5	24.3	14.5	19.3████░░░░░░░░░░░░░░░░	8.0██░░░░░░░░░░░░░░░░░░	13.7███░░░░░░░░░░░░░░░░░
GPT-5.2	24.0	15.5	14.9███░░░░░░░░░░░░░░░░░	8.0██░░░░░░░░░░░░░░░░░░	21.6████░░░░░░░░░░░░░░░░
Qwen3-VL-235B	20.5	14.0	19.3████░░░░░░░░░░░░░░░░	8.0██░░░░░░░░░░░░░░░░░░	12.2██░░░░░░░░░░░░░░░░░░
Claude-Haiku-4.5	18.8	12.5	19.3████░░░░░░░░░░░░░░░░	5.0█░░░░░░░░░░░░░░░░░░░	10.1██░░░░░░░░░░░░░░░░░░
GLM-4.6V	17.5	11.5	18.6████░░░░░░░░░░░░░░░░	8.0██░░░░░░░░░░░░░░░░░░	5.8█░░░░░░░░░░░░░░░░░░░
Grok-4.1-Fast	15.5	12.5	17.4███░░░░░░░░░░░░░░░░░	11.0██░░░░░░░░░░░░░░░░░░	7.9██░░░░░░░░░░░░░░░░░░

* All models with extended thinking capability are evaluated with thinking enabled.

Open-weight model

Evaluation Framework

7 Aesthetic Dimensions

Our benchmark evaluates aesthetic judgment across seven core dimensions, each capturing a distinct facet of visual quality assessment.

Composition & Visual Order

Intentional structure, clear hierarchy, and a frame that organizes attention without awkward clutter or emptiness.

Light, Tone & Color

Clean value separation, coherent palette, and command over highlights, shadows, and atmospheric mood.

Technical Craft

Confident handling of fundamentals — technique that serves the image, never competing with it.

Clarity & Focus

Whether the viewer instantly finds what matters. Clear focal point, smooth flow, minimal noise.

Finish & Completeness

A resolved image with no neglected areas. Edges, transitions, and surfaces all feel deliberate.

Expression & Presence

The felt impact beyond craft — atmosphere, captured moment, artistic character. Something that lingers.

Comparative Judgment

The overall call. Which work holds together better, and where the other falls short.

How It Works

Methodology

Original work produced as matched sets. Intent stays constant, only execution varies.

Expert Commissioning

1,000+ creators · 2,000+ hours

Original work produced as matched sets. Intent stays constant, only execution varies.

Expert Consensus

10 experts per pair · 13,000+ ratings

100+ evaluators across domains. Only kept pairs where judgments converge.

Pairwise & Ordering

Best, worst, and both

Models rank coherent aesthetic ordering, not just single-image preference.

Bias-Resistant Scoring

Reshuffled order · Repeated trials

Both consistency and average accuracy reported across domains.

Read Blog →

Submit Your Model

Have a frontier model you'd like to evaluate? Submit it for inclusion in the next benchmark round.

Get in Touch

Visual AestheticBenchmark

Model Rankings

Score Breakdown

7 Aesthetic Dimensions

Composition & Visual Order

Light, Tone & Color

Technical Craft

Clarity & Focus

Finish & Completeness

Expression & Presence

Comparative Judgment

Methodology

Expert Commissioning

Expert Consensus

Pairwise & Ordering

Bias-Resistant Scoring

Submit Your Model

Visual Aesthetic
Benchmark