SkillsBench: Benchmarking How Well Skills Work Across Diverse Tasks

The first evaluation framework that measures how skills work, and the first dataset that measures how powerful models are at using skills on expert-curated tasks across high-GDP-value, diverse domains.

Read the Paper See Leaderboard Documentation

How SkillsBench Works

SkillsBench evaluates AI agents across three abstraction layers, mirroring how traditional computing systems are structured.

SkillsApplications

Agent HarnessOperating Systems

ModelsCPUs

Skills Layer

Domain-specific capabilities and workflows that extend agent functionality. Like applications on an OS, skills provide specialized knowledge and tools for particular tasks.

Agent Harness Layer

The execution environment that orchestrates agents, manages tool access, and handles I/O. Analogous to an operating system that mediates between applications and hardware.

Models Layer

The foundational AI models that power reasoning and generation. Like CPUs, they provide the raw computational capability that upper layers build upon.

Agent Performance

Resolution rate vs. mean agent wall-clock per task (log scale, faster to the right). Hover a point for exact values; the no-Skills counterparts are ghosted for context. OpenHands is the default baseline harness; configs run under another harness are labelled with it.

Hover a point to pin its exact resolution rate and wall-clock to the axes. Dashed lines mark fleet means.

Capability Over Time

SkillsBench resolution rate vs. model release date — one dot per model–harness config. Newer models trend up and to the right. Release months are approximate editorial estimates (paper-reported where available). OpenHands is the default baseline harness; configs run under another harness are labelled with it.

Hover a point to pin its release month and resolution rate. Dashed line marks the fleet mean; the emerald line is a least-squares fit.

Agent Leaderboard

Resolution rates across 24 agent–model configurations on SkillsBench (87 tasks, up to 3 trials per task).

dataset: skillsbench@1.1 (v1.1, 87 tasks, registry.json) · recomputed 2026-06-16

Sort by

Skill Invocation RateSelf-generated Skills

#	Agent	Without	With Skills	Δ	Gain (g)	Visualization
1	GPT-5.5OpenHands	51.5%	67.3%	+15.8	32.6%	51.5 → 67.3%
2	GPT-5.5Codex	46.8%	66.5%	+19.7	37.0%	46.8 → 66.5%
3	Opus 4.7Claude Code	43.0%	61.2%	+18.2	31.9%	43.0 → 61.2%
4	Gemini 3.1 ProGemini CLI	36.0%	60.8%	+24.8	38.7%	36.0 → 60.8%
5	GLM 5.1OpenHands	32.7%	58.4%	+25.7	38.1%	32.7 → 58.4%
6	Gemini 3 FlashGemini CLI	34.2%	54.6%	+20.4	31.0%	34.2 → 54.6%
7	Opus 4.8OpenHands	45.7%	54.1%	+8.4	15.5%	45.7 → 54.1%
8	Kimi K2.6OpenHands	33.4%	54.0%	+20.6	31.0%	33.4 → 54.0%
9	Opus 4.7OpenHands	42.1%	53.1%	+11.1	19.1%	42.1 → 53.1%
10	MiniMax M3OpenHands	29.7%	53.0%	+23.3	33.2%	29.7 → 53.0%
11	Gemini 3.1 ProOpenHands	33.8%	52.8%	+19.0	28.7%	33.8 → 52.8%
12	GPT-5.2Codex	29.7%	51.7%	+22.0	31.3%	29.7 → 51.7%
13	Opus 4.6Claude Code	33.7%	50.2%	+16.5	25.0%	33.7 → 50.2%
14	DeepSeek V4 ProOpenHands	26.9%	50.1%	+23.2	31.8%	26.9 → 50.1%
15	Opus 4.5Claude Code	23.8%	49.0%	+25.2	33.1%	23.8 → 49.0%
16	Gemini 3.5 FlashOpenHands	41.1%	48.2%	+7.1	12.1%	41.1 → 48.2%
17	Sonnet 4.6OpenHands	33.5%	47.2%	+13.6	20.5%	33.5 → 47.2%
18	DeepSeek V4 FlashOpenHands	27.5%	44.7%	+17.2	23.7%	27.5 → 44.7%
19	Grok 4.3OpenHands	22.8%	41.7%	+18.8	24.4%	22.8 → 41.7%
20	GPT-5.4 MiniOpenHands	29.9%	41.4%	+11.5	16.4%	29.9 → 41.4%
21	Sonnet 4.5Claude Code	16.7%	36.2%	+19.5	23.4%	16.7 → 36.2%
22	MiniMax M2.7OpenHands	18.1%	34.9%	+16.8	20.5%	18.1 → 34.9%
23	Haiku 4.5Claude Code	8.8%	30.1%	+21.3	23.4%	8.8 → 30.1%
24	Gemini 3.1 Flash LiteOpenHands	16.0%	20.1%	+4.1	4.9%	16.0 → 20.1%

Hover over a row to see confidence intervals and normalized gain.View full leaderboard →

skillsbench@1.1 · 87 tasks · up to 3 trials per task · 95% CIsWithoutWith Skills

OpenAI

Anthropic

Google

Z.ai

Moonshot

MiniMax

DeepSeek

xAI

Professional-Domain Profile

Resolution rate across the eight professional domains of the 87-task taxonomy. Hover a radar axis to inspect that domain; compare up to 4 agents.

Software Engineering16 tasks

GPT-5.5OpenHands52.6 → 63.4%

Opus 4.8OpenHands55.1 → 65.9%

Gemini 3.5 FlashOpenHands46.4 → 50.9%

solid = without Skills · pale = Skill lift · hover another radar axis to switch domain

GPT-5.5· OpenHandsOpus 4.8· OpenHandsGemini 3.5 Flash· OpenHandsrings at 20–100% · with Skills

Task Registry

3d-scan-calc

industrial-physical-systems

hard

You need to calculate the mass of a 3D printed part. The input (`/root/scan_data.stl`) is a binary STL, but the 2-byte "Attribute Byte Count" at the end of each triangle record is being used to store the **Material ID** of the object.

SkillsBench: Benchmarking How Well Skills Work Across Diverse Tasks

How SkillsBench Works

Skills Layer

Agent Harness Layer

Models Layer

Agent Performance

Capability Over Time

Agent Leaderboard

Professional-Domain Profile

Task Registry

3d-scan-calc

ada-bathroom-plan-repair

adaptive-cruise-control

azure-bgp-oscillation-route-leak

bike-rebalance

citation-check

civ6-adjacency-optimizer

court-form-filling

crystallographic-wyckoff-position-analysis

dapt-intrusion-detection

data-to-d3

debug-trl-grpo

dialogue-parser

drone-planning-control

dynamic-object-aware-egomotion

earthquake-phase-association

earthquake-plate-calculation

econ-detrending-correlation

edit-pdf

energy-ac-optimal-power-flow