Arena.ai (@arena) / X

Arena.ai

3,354 posts

Arena.ai

@arena

Where AI meets the real world. Formerly LMArena. We measure and advance the frontier of AI through community-driven evaluation. We’re hiring → arena.ai/jobs

Joined March 2023

Pinned
Arena.ai
@arena
Jun 4
Introducing Agent Mode: Agentic AI is now measured in the Arena. Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more. It completes more complex tasks by using tools like web search, bash in a sandbox environment, image
00:00
274K
Arena.ai
@arena
Jun 23
Cosmos 3 Super by @nvidia is ranked among open models as #8 and #11 for Text-to-Image Arena (ranked #49 & #54 overall). - #8 Cosmos-3-Super-Text2Image is on par with Flux-2-Klein-9B and Qwen Image Prompt Extend - #11 Cosmos-3-Super-Text2Image (Agentic), is on par with models
21K
Arena.ai
@arena
Jun 23
Dig into the Text-to-Image Arena leaderboard details at:
Text-to-Image Leaderboard - Best AI Image Generators
From arena.ai
4.6K
Arena.ai
@arena
Jun 22
Arena's leaderboard isn't a static benchmark, it's a living one. Every ranking is driven by real-world tasks from a global community of users, refreshed continuously as new prompts and models arrive. So how does it all work? The team breaks down the full model lifecycle:
00:00
13K
Arena.ai
@arena
Jun 22
Read more about Arena's AI evaluations, which we launched back in September (under our former name, LMArena):
AI Evaluations at LMArena
From arena.ai
5.4K
Arena.ai
@arena
Jun 22
Learn more about our latest casual tracing methodology for the newest Agent Arena leaderboard:
Agent Arena: Causal Evaluation of Agents in the Real World
From arena.ai
3.7K
Arena.ai
@arena
Jun 22
Seed 2.1 Pro Preview ranks #8 in Code Arena: Frontend, scoring 1539, on par with Opus 4.6. It performs strongly on React apps and lands in the top 10 for five of seven subcategories. In those areas, only a handful of frontier labs rank above it: @Zai_org's GLM-5.2 and
137K
Arena.ai
@arena
Jun 22
Learn more about Seed 2.1 Pro Preview from Bytedance: seed.bytedance.com/en/blog/seed-2…
5.1K
Arena.ai
@arena
Jun 22
Replying to @arena
Head over to look into the Code Arena: Frontend leaderboard details at:
WebDev AI Leaderboard - Best AI Models for Web Development
From arena.ai
9.3K
Arena.ai reposted
Arena.ai
@arena
Jun 16
Exciting news: GLM-5.2 (Max) ranks #2 in Code Arena: Frontend, with +29pt over Claude Opus 4.7 (Thinking) and only behind Fable 5! GLM-5.2 is the best open model vs Kimi-K2.6 and Minimax-M3 by a large margin. - #2 React and #4 HTML sub-leaderboards - Ranks as the top model in
Arena.ai
@arena
Jun 16
GLM-5.2 (Max) by @Zai_org ranks #10 on the new Agent Arena leaderboard, closely matching Claude-Opus-4.8 (non-thinking) and is the #1 open model by a wide margin! In Agent Arena, we measure models on millions of real-world, long-horizon agentic tasks from a global community of
1.4M
Arena.ai reposted
池建强
@sagacity
Jun 20
Agent 的能力前十强 lab，中美五比五
 29K
Arena.ai reposted
Kai
@hqmank
Jun 20
Agent Arena AI lab ranking update: Zai moved to #3 after releasing GLM 5.2, ahead of Google. The current top 10 now includes 5 US labs and 5 Chinese labs.
27K
Arena.ai
@arena
Jun 16
GLM-5.2 (Max) by @Zai_org ranks #10 on the new Agent Arena leaderboard, closely matching Claude-Opus-4.8 (non-thinking) and is the #1 open model by a wide margin! In Agent Arena, we measure models on millions of real-world, long-horizon agentic tasks from a global community of
Z.ai
@Zai_org
Jun 16
Introducing GLM-5.2: Frontier Intelligence, Open Weights - Significant improvements in coding and agentic tasks - Strong long-horizon capabilities with a 1M context window - Two levels of reasoning effort: GLM-5.2 (max) pushes the limits, while GLM-5.2 (high) strikes a strong
687K
Arena.ai
@arena
Jun 19
Curious about GLM-5.2 and haven’t tested it yet? Check out first impressions with @petergostev
8.1K
Arena.ai
@arena
Jun 18
Agent Arena's causal tracing methodology lets us quantify the real value of humans working together with AI agents, and observe a huge range of model behaviors from the same traces. We started with 5 signals: confirmed success, praise vs. complaint, steerability, bash recovery,
00:00
Arena.ai
@arena
Jun 17
Agent Arena has been live for 2 weeks, with 10 more models now on the new leaderboard. Two highlights worth mentioning: - GLM-5.2 (Max) by @Zai_org enters the top 10. The strongest open-weight result we've measured, at +9.4% confirmed success and +14.9% praise-vs-complaint
12K
Arena.ai
@arena
Jun 18
Learn more about how we built the methodology behind Agent Arena:
Agent Arena: Causal Evaluation of Agents in the Real World
From arena.ai
4.2K
Arena.ai
@arena
Jun 17
Agent Arena has been live for 2 weeks, with 10 more models now on the new leaderboard. Two highlights worth mentioning: - GLM-5.2 (Max) by @Zai_org enters the top 10. The strongest open-weight result we've measured, at +9.4% confirmed success and +14.9% praise-vs-complaint
Arena.ai
@arena
Jun 4
Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks. On Arena, models now get web search, filesystem, and terminal tools to complete complex
54K
Arena.ai
@arena
Jun 17
Learn more about the causal tracing methodology for Agent Arena on our blog:
Agent Arena: Causal Evaluation of Agents in the Real World
From arena.ai
5.8K
Arena.ai
@arena
Jun 17
Head over to the Agent Arena leaderboard to see the data in detail:
Agent Arena | AI Agent Performance Leaderboard
From arena.ai
4.6K
Arena.ai
@arena
Jun 17
Kimi K2.7 Code by @Kimi_Moonshot ranks #19 overall on the new Agent Arena leaderboard, and #6 among open models. In Agent Arena, we measure models on millions of real-world, long-horizon agentic tasks from a global community of users. Models can access web search, filesystem,
Kimi.ai
@Kimi_Moonshot
Jun 12
🌘 Kimi-K2.7-Code, our latest coding model, is now released and open-sourced! 🔷 Improved coding & agent performance over K2.6: +21.8% on Kimi Code Bench v2, +11.0% on Program Bench, and +31.5% on MLS Bench Lite. 🔷 Reasoning efficiency: Less overthinking, with 30% lower
33K
Arena.ai
@arena
Jun 17
Replying to @arena
Learn more about the causal tracing methodology for Agent Arena on our blog:
Agent Arena: Causal Evaluation of Agents in the Real World
From arena.ai
4.1K
Arena.ai
@arena
Jun 17
Head over to the Agent Arena leaderboard and filter by open models or view by lab:
Agent Arena | AI Agent Performance Leaderboard
From arena.ai
3.5K