Artificial Analysis (@ArtificialAnlys) / X

Artificial Analysis

2,532 posts

Artificial Analysis

@ArtificialAnlys

Independent analysis of AI

San Francisco

artificialanalysis.ai

Joined January 2024

Pinned
Thanks for the support @AndrewYNg! Completely agree, faster token generation will become increasingly important as a greater proportion of output tokens are consumed by models, such as in multi-step agentic workflows, rather than being read by people.
Andrew Ng
@AndrewYNg
Jul 6, 2024
Shoutout to the team that built artificialanalysis.ai . Really neat site that benchmarks the speed of different LLM API providers to help developers pick which models to use. This nicely complements the LMSYS Chatbot Arena, Hugging Face open LLM leaderboards and Stanford's HELM
480K
xAI gave us early access to Grok 4 - and the results are in. Grok 4 is now the leading AI model. We have run our full suite of benchmarks and Grok 4 achieves an Artificial Analysis Intelligence Index of 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude
3.4M
DeepSeek takes the lead: DeepSeek V3-0324 is now the highest scoring non-reasoning model This is the first time an open weights model is the leading non-reasoning model, a milestone for open source. DeepSeek V3-0324 has jumped forward 7 points in Artificial Analysis
487K
Artificial Analysis
@ArtificialAnlys
May 29, 2025
DeepSeek’s R1 leaps over xAI, Meta and Anthropic to be tied as the world’s #2 AI Lab and the undisputed open-weights leader DeepSeek R1 0528 has jumped from 60 to 68 in the Artificial Analysis Intelligence Index, our index of 7 leading evaluations that we run independently
621K
Artificial Analysis
@ArtificialAnlys
Sep 19, 2025
xAI has released Grok 4 Fast - breaking through our intelligence vs cost frontier by achieving Gemini 2.5 Pro level intelligence at a ~25X cheaper cost Intelligence: @xai shared with us pre-release access to Grok 4 Fast. In reasoning mode, the model scores an impressive 60 on
731K
Artificial Analysis
@ArtificialAnlys
Jul 17, 2025
🇰🇷 South Korean AI Lab Upstage AI has just launched their first reasoning model - Solar Pro 2! The 31B parameter model demonstrates impressive performance for its size, with intelligence approaching Claude 4 Sonnet in 'Thinking' mode and is priced very competitively Key details:
7.5M
MoonshotAI has released Kimi K2 Thinking, a new reasoning variant of Kimi K2 that achieves #1 in the Tau2 Bench Telecom agentic benchmark and is potentially the new leading open weights model Kimi K2 Thinking is one of the largest open weights models ever, at 1T total parameters
1.4M
Artificial Analysis
@ArtificialAnlys
Jan 23, 2025
DeepSeek’s first reasoning model has arrived - over 25x cheaper than OpenAI’s o1 Highlights from our initial benchmarking of DeepSeek R1: ➤ Trades blows with OpenAI’s o1 across our eval suite to score the second highest in Artificial Analysis Quality Index ever ➤ Priced on
1.5M
Artificial Analysis
@ArtificialAnlys
Jul 18, 2024
GPT-4o Mini, announced today, is very impressive for how cheap it is being offered 👀 With a MMLU score of 82% (reported by TechCrunch), it surpasses the quality of other smaller models including Gemini 1.5 Flash (79%) and Claude 3 Haiku (75%). What is particularly exciting is
1.1M
Independent benchmarks of OpenAI’s gpt-oss models: gpt-oss-120b is the most intelligent American open weights model, comes behind DeepSeek R1 and Qwen3 235B in intelligence but offers efficiency benefits OpenAI has released two versions of gpt-oss: ➤ gpt-oss-120b (116.8B total
173K
Today’s GPT-4o update is actually big - it leapfrogs Claude 3.7 Sonnet (non-reasoning) and Gemini 2.0 Flash in our Intelligence Index and is now the leading non-reasoning model for coding This makes GPT-4o the second highest scoring non-reasoning model (excludes o3-mini, Gemini
133K
Artificial Analysis
@ArtificialAnlys
May 29, 2025
DeepSeek's R1 update consolidates the lead of 🇨🇳 Chinese AI Labs in open weights intelligence
74K
Kimi K2 Thinking is the new leading open weights model: it demonstrates particular strength in agentic contexts but is very verbose, generating the most tokens of any model in completing our Intelligence Index evals @Kimi_Moonshot's Kimi K2 Thinking achieves a 67 in the
113K
Wait - is the new GPT-4o a smaller and less intelligent model? We have completed running our independent evals on OpenAI’s GPT-4o release yesterday and are consistently measuring materially lower eval scores than the August release of GPT-4o. GPT-4o (Nov) vs GPT-4o (Aug): ➤
189K