Image
AI-assisted software engineering has seen the emergence of several benchmarks to measure the capabilities of LLMs. Android developers face specific challenges that aren't covered by existing benchmarks, so we created one that focuses on a north star of high quality Android development.

Android LLM Leaderboard

Model Score (%)
arrow_range Cl range (%)
Date
Image Gemini 3.1 Pro Preview 72.4
65.3 — 79.8 2026-03-04
Image Claude Opus 4.6 66.6
58.9 — 73.9 2026-03-04
Image GPT-5.2-Codex 62.5
54.7 — 70.3 2026-03-04
Image Claude Opus 4.5 61.9
53.9 — 69.6 2026-03-04
Image Gemini 3 Pro Preview 60.4
52.6 — 67.8 2026-03-04
Image Claude Sonnet 4.6 58.4
51.1 — 66.6 2026-03-04
Image Claude Sonnet 4.5 54.2
45.5 — 62.4 2026-03-04
Image Gemini 3 Flash Preview 42.0
36.3 — 47.9 2026-03-04
Image Gemini 2.5 Flash 16.1
10.9 — 21.9 2026-03-04
Latest results as of March 5th 2026 - check back periodically for updates!
Score is the average percentage of 100 test cases successfully resolved across 10 runs for each model.
Confidence Interval (CI) represents the expected performance range, reflecting the results' statistical reliability (p-value < 0.05).
Image

Learn more about Android Bench

Learn more about how we created a set of common Android developer tasks.
Many of the tasks are based on how we define high quality Android development, which is detailed in our developer documentation.
See the full repo so you can replicate the tests yourself.