Log inSign up
Arena.ai
3,354 posts
Image
user avatar
Arena.ai
@arena
Where AI meets the real world. Formerly LMArena. We measure and advance the frontier of AI through community-driven evaluation. We’re hiring → arena.ai/jobs
US
arena.ai
Joined March 2023
215
Following
171.1K
Followers
  • Pinned
    user avatar
    Arena.ai
    @arena
    Jun 4
    Introducing Agent Mode: Agentic AI is now measured in the Arena. Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more. It completes more complex tasks by using tools like web search, bash in a sandbox environment, image
    Image
    00:00
    274K
  • user avatar
    Arena.ai
    @arena
    Jun 23
    Cosmos 3 Super by @nvidia is ranked among open models as #8 and #11 for Text-to-Image Arena (ranked #49 & #54 overall). - #8 Cosmos-3-Super-Text2Image is on par with Flux-2-Klein-9B and Qwen Image Prompt Extend - #11 Cosmos-3-Super-Text2Image (Agentic), is on par with models
    Image
    21K
    user avatar
    Arena.ai
    @arena
    Jun 23
    Dig into the Text-to-Image Arena leaderboard details at:
    Image
    Text-to-Image Leaderboard - Best AI Image Generators
    From arena.ai
    4.6K
  • user avatar
    Arena.ai
    @arena
    Jun 22
    Arena's leaderboard isn't a static benchmark, it's a living one. Every ranking is driven by real-world tasks from a global community of users, refreshed continuously as new prompts and models arrive. So how does it all work? The team breaks down the full model lifecycle:
    Image
    00:00
    13K
    user avatar
    Arena.ai
    @arena
    Jun 22
    Read more about Arena's AI evaluations, which we launched back in September (under our former name, LMArena):
    Image
    AI Evaluations at LMArena
    From arena.ai
    5.4K
    user avatar
    Arena.ai
    @arena
    Jun 22
    Learn more about our latest casual tracing methodology for the newest Agent Arena leaderboard:
    Image
    Agent Arena: Causal Evaluation of Agents in the Real World
    From arena.ai
    3.7K
  • user avatar
    Arena.ai
    @arena
    Jun 22
    Seed 2.1 Pro Preview ranks #8 in Code Arena: Frontend, scoring 1539, on par with Opus 4.6. It performs strongly on React apps and lands in the top 10 for five of seven subcategories. In those areas, only a handful of frontier labs rank above it: @Zai_org's GLM-5.2 and
    Image
    137K
    user avatar
    Arena.ai
    @arena
    Jun 22
    Learn more about Seed 2.1 Pro Preview from Bytedance: seed.bytedance.com/en/blog/seed-2…
    5.1K
  • user avatar
    Arena.ai
    @arena
    Jun 22
    Replying to @arena
    Head over to look into the Code Arena: Frontend leaderboard details at:
    Image
    WebDev AI Leaderboard - Best AI Models for Web Development
    From arena.ai
    9.3K
  • Arena.ai reposted
    user avatar
    Arena.ai
    @arena
    Jun 16
    Exciting news: GLM-5.2 (Max) ranks #2 in Code Arena: Frontend, with +29pt over Claude Opus 4.7 (Thinking) and only behind Fable 5! GLM-5.2 is the best open model vs Kimi-K2.6 and Minimax-M3 by a large margin. - #2 React and #4 HTML sub-leaderboards - Ranks as the top model in
    Image
    Image
    user avatar
    Arena.ai
    @arena
    Jun 16
    GLM-5.2 (Max) by @Zai_org ranks #10 on the new Agent Arena leaderboard, closely matching Claude-Opus-4.8 (non-thinking) and is the #1 open model by a wide margin! In Agent Arena, we measure models on millions of real-world, long-horizon agentic tasks from a global community of
    1.4M
  • Arena.ai reposted
    user avatar
    池建强
    @sagacity
    Jun 20
    Agent 的能力前十强 lab,中美五比五
    Image
    29K
  • Arena.ai reposted
    user avatar
    Kai
    @hqmank
    Jun 20
    Agent Arena AI lab ranking update: Zai moved to #3 after releasing GLM 5.2, ahead of Google. The current top 10 now includes 5 US labs and 5 Chinese labs.
    Image
    27K
  • user avatar
    Arena.ai
    @arena
    Jun 16
    GLM-5.2 (Max) by @Zai_org ranks #10 on the new Agent Arena leaderboard, closely matching Claude-Opus-4.8 (non-thinking) and is the #1 open model by a wide margin! In Agent Arena, we measure models on millions of real-world, long-horizon agentic tasks from a global community of
    Image
    Image
    user avatar
    Z.ai
    @Zai_org
    Jun 16
    Introducing GLM-5.2: Frontier Intelligence, Open Weights - Significant improvements in coding and agentic tasks - Strong long-horizon capabilities with a 1M context window - Two levels of reasoning effort: GLM-5.2 (max) pushes the limits, while GLM-5.2 (high) strikes a strong
    687K
    user avatar
    Arena.ai
    @arena
    Jun 19
    Curious about GLM-5.2 and haven’t tested it yet? Check out first impressions with @petergostev
    8.1K
  • user avatar
    Arena.ai
    @arena
    Jun 18
    Agent Arena's causal tracing methodology lets us quantify the real value of humans working together with AI agents, and observe a huge range of model behaviors from the same traces. We started with 5 signals: confirmed success, praise vs. complaint, steerability, bash recovery,
    Image
    00:00
    Image
    user avatar
    Arena.ai
    @arena
    Jun 17
    Agent Arena has been live for 2 weeks, with 10 more models now on the new leaderboard. Two highlights worth mentioning: - GLM-5.2 (Max) by @Zai_org enters the top 10. The strongest open-weight result we've measured, at +9.4% confirmed success and +14.9% praise-vs-complaint
    12K
    user avatar
    Arena.ai
    @arena
    Jun 18
    Learn more about how we built the methodology behind Agent Arena:
    Image
    Agent Arena: Causal Evaluation of Agents in the Real World
    From arena.ai
    4.2K
  • user avatar
    Arena.ai
    @arena
    Jun 17
    Agent Arena has been live for 2 weeks, with 10 more models now on the new leaderboard. Two highlights worth mentioning: - GLM-5.2 (Max) by @Zai_org enters the top 10. The strongest open-weight result we've measured, at +9.4% confirmed success and +14.9% praise-vs-complaint
    Image
    Image
    user avatar
    Arena.ai
    @arena
    Jun 4
    Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks. On Arena, models now get web search, filesystem, and terminal tools to complete complex
    54K
    user avatar
    Arena.ai
    @arena
    Jun 17
    Learn more about the causal tracing methodology for Agent Arena on our blog:
    Image
    Agent Arena: Causal Evaluation of Agents in the Real World
    From arena.ai
    5.8K
    user avatar
    Arena.ai
    @arena
    Jun 17
    Head over to the Agent Arena leaderboard to see the data in detail:
    Image
    Agent Arena | AI Agent Performance Leaderboard
    From arena.ai
    4.6K
  • user avatar
    Arena.ai
    @arena
    Jun 17
    Kimi K2.7 Code by @Kimi_Moonshot ranks #19 overall on the new Agent Arena leaderboard, and #6 among open models. In Agent Arena, we measure models on millions of real-world, long-horizon agentic tasks from a global community of users. Models can access web search, filesystem,
    Image
    Image
    Image
    user avatar
    Kimi.ai
    @Kimi_Moonshot
    Jun 12
    🌘 Kimi-K2.7-Code, our latest coding model, is now released and open-sourced! 🔷 Improved coding & agent performance over K2.6: +21.8% on Kimi Code Bench v2, +11.0% on Program Bench, and +31.5% on MLS Bench Lite. 🔷 Reasoning efficiency: Less overthinking, with 30% lower
    33K
    user avatar
    Arena.ai
    @arena
    Jun 17
    Replying to @arena
    Learn more about the causal tracing methodology for Agent Arena on our blog:
    Image
    Agent Arena: Causal Evaluation of Agents in the Real World
    From arena.ai
    4.1K
    user avatar
    Arena.ai
    @arena
    Jun 17
    Head over to the Agent Arena leaderboard and filter by open models or view by lab:
    Image
    Agent Arena | AI Agent Performance Leaderboard
    From arena.ai
    3.5K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement