Datacurve (@datacurve) / X

Datacurve

19 posts

Datacurve

@datacurve

Research and data to advance frontier models.

San Francisco

Joined February 2024

Pinned
Datacurve
@datacurve
Jun 19
Claude Fable 5 debuts at #1 on DeepSWE. It outscores the previous best by 3% and sets a new state-of-the-art on our long-horizon coding benchmark.
00:00
461K
Datacurve
@datacurve
Jun 20
GLM 5.2 is now on DeepSWE as the top open-source model on our leaderboard. With a pass@1 score of 44% at max effort, GLM 5.2 is indisputable #1 open-source model besting Kimi K2.7 Code by 17%.
00:00
555K
Datacurve
@datacurve
Jun 20
Our updated leaderboard at
DeepSWE
From deepswe.datacurve.ai
13K
Datacurve
@datacurve
Jun 19
Replying to @datacurve
See the full updated leaderboard here:
DeepSWE
From deepswe.datacurve.ai
13K
Datacurve
@datacurve
Jun 19
Replying to @datacurve
Fable 5 scores 70% pass@1 and tracks GPT-5.5 on cost-performance at the default high effort. Kimi K2.7 also joins the leaderboard with a score of 31%.
00:00
25K
Datacurve reposted
Artificial Analysis
@ArtificialAnlys
Jun 12
We've updated the Artificial Analysis Coding Agent Index, replacing SWE-Bench Pro with Datacurve's DeepSWE benchmark - the swap lifts Codex with GPT-5.5 (xhigh) above Claude Code with Opus 4.8 (max), while the newly released Claude Fable 5 (max) in Claude Code debuts at the top
568K
Datacurve
@datacurve
May 30
Opus 4.8 is now on DeepSWE. On the default high thinking effort, it scores 6% higher than Opus 4.7 xhigh, while also lowering average cost per task.
00:00
978K
Datacurve
@datacurve
May 30
Opus 4.8 delivers efficiency gains by solving tasks in fewer steps, directly reducing the total number of input tokens required per task.
124K
Datacurve
@datacurve
May 30
Full deep dive coming soon. Check out the full benchmark here →
DeepSWE
From deepswe.datacurve.ai
26K
Datacurve reposted
Matthew Berman
@MatthewBerman
May 27
DeepSWE reflects what I’m hearing from engineers better than any other benchmark. They took the hard path to build a good one.
00:00
Serena Ge (Datacurve)
@serenaa_ge
May 26
Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.
41K
Datacurve reposted
Garry Tan
@garrytan
May 26
This is the new standard for engineering evals
Serena Ge (Datacurve)
@serenaa_ge
May 26
Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.
117K
Datacurve reposted
Serena Ge (Datacurve)
@serenaa_ge
May 26
Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.
2M
Datacurve reposted
Serena Ge (Datacurve)
@serenaa_ge
Apr 4, 2024
I presented today at Demo day Day 2 and @TechCrunch featured us @datacurve! Just been reading TC and listening to TC Daily Crunch since high school mornings... a surreal feeling to see us on it. Also, post-demo sadness cuz now YC is coming to an end
31K