ARC Prize (@arcprize) / X

ARC Prize

808 posts

ARC Prize

@arcprize

A North Star for open AGI. Co-founders: @fchollet @mikeknoop. President: @gregkamradt. We're hiring mission-driven builders: arcprize.org/jobs

Earth

Joined March 2024

Pinned
ARC Prize
@arcprize
Mar 25
Announcing ARC-AGI-3 The only unsaturated agentic intelligence benchmark in the world Humans score 100%, AI <1% This human-AI gap demonstrates we do not yet have AGI Most benchmarks test what models already know, ARC-AGI-3 tests how they learn
GIF
740K
ARC Prize
@arcprize
Jul 10, 2025
Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9% This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA
7.3M
ARC Prize
@arcprize
Dec 20, 2024
New verified ARC-AGI-Pub SoTA! @OpenAI o3 has scored a breakthrough 75.7% on the ARC-AGI Semi-Private Evaluation. And a high-compute o3 configuration (not eligible for ARC-AGI-Pub) scored 87.5% on the Semi-Private Eval. 1/4
2.5M
ARC Prize
@arcprize
Jun 11, 2025
After the o3 price reduction, we retested the o3-2025-04-16 model on ARC-AGI to determine whether its performance had changed. We compared the retest results with the original results and observed no difference in performance.
435K
ARC Prize
@arcprize
Mar 24, 2025
Today we are announcing ARC-AGI-2, an unsaturated frontier AGI benchmark that challenges AI reasoning systems (same relative ease for humans). Grand Prize: 85%, ~$0.42/task efficiency Current Performance: * Base LLMs: 0% * Reasoning Systems: <4%
462K
ARC Prize
@arcprize
Sep 16, 2025
New SOTA on ARC-AGI - V1: 79.6%, $8.42/task - V2: 29.4%, $30.40/task Custom submissions by @jeremyberman and @_eric_pang_ are now the best known solutions to ARC-AGI Both: * Are open source * Use Grok 4 * Implement program-synthesis outer loops with test-time adaptation
7.5M
ARC Prize
@arcprize
Jul 18, 2025
Today, we're announcing a preview of ARC-AGI-3, the Interactive Reasoning Benchmark with the widest gap between easy for humans and hard for AI We’re releasing: * 3 games (environments) * $10K agent contest * AI agents API Starting scores - Frontier AI: 0%, Humans: 100%
521K
ARC Prize
@arcprize
Aug 15, 2025
Analyzing the Hierarchical Reasoning Model by @makingAGI We verified scores on hidden tasks, ran ablations, and found that performance comes from an unexpected source ARC-AGI Semi Private Scores: * ARC-AGI-1: 32% * ARC-AGI-2: 2% Our 4 findings:
273K
ARC Prize
@arcprize
Oct 9, 2025
New ARC-AGI SOTA: GPT-5 Pro - ARC-AGI-1: 70.2%, $4.78/task - ARC-AGI-2: 18.3%, $7.41/task @OpenAI’s GPT-5 Pro now holds the highest verified frontier LLM score on ARC-AGI’s Semi-Private benchmark
563K
ARC Prize
@arcprize
Apr 16, 2025
Clarifying o3’s ARC-AGI Performance OpenAI has confirmed: * The released o3 is a different model from what we tested in December 2024 * All released o3 compute tiers are smaller than the version we tested * The released o3 was not trained on ARC-AGI data, not even the train
226K
ARC Prize
@arcprize
Jan 21, 2025
Verified DeepSeek performance on ARC-AGI's Public Eval (400 tasks) + Semi-Private (100 tasks) DeepSeek V3: * Semi-Private: 7.3% ($.002) * Public Eval: 14% ($.002) DeepSeek Reasoner: * Semi-Private: 15.8% ($.06) * Public Eval: 20.5% ($.05) (Avg $ per task)
292K
ARC Prize
@arcprize
Feb 14, 2025
Introducing SnakeBench, an experimental benchmark side quest We made 50 LLMs battle each other in head-to-head snake 🐍 2.8K matches showed which models are the best at snake real-time strategy and spatial reasoning Here’s the top match between o3-mini and DeepSeek-R1 🧵
GIF
177K
ARC Prize
@arcprize
Oct 21, 2025
Grok-4 (Fast Reasoning) on ARC-AGI Semi Private Eval - ARC-AGI-1: 48.5%, $0.03/task - ARC-AGI-2: 5.3%, $0.06/task @xai pushes the frontier of performance efficiency on ARC-AGI
1.6M
ARC Prize
@arcprize
Mar 27, 2025
Gemini-2.5-Pro Experimental Preview Results ARC-AGI-1 * Public Eval: 24.3% * Semi Private: 12.5% ARC-AGI-2 * Public Eval: .8% * Semi Private: 1.3% These results are on par with Deepseek's R1
298K