Cognition (@cognition) / X

Cognition

760 posts

Cognition

@cognition

Makers of Devin, the first AI software engineer. We are an applied AI lab building end-to-end software agents. Join us: cognition.ai

San Francisco Bay Area

devin.ai/?utm_source=x&…

Joined January 2024

Pinned
Cognition
@cognition
Jun 8
Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers. Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?
2.5M
Cognition
@cognition
Jun 9
Replying to @cognition
Read our blog post:
Claude Fable 5 is now available in Devin
From devin.ai
7.1K
Cognition
@cognition
Jun 9
Replying to @cognition
Try out Devin today!
Devin
From devin.ai
8.8K
Cognition
@cognition
Jun 9
Replying to @cognition
You can try Claude Fable 5 as part of Devin Cloud’s Ultra agent. Devin Ultra is our smartest and most capable agent, which excels at long-horizon tasks and debugging. We tuned the harness so Ultra costs only ~40% more than default Devin agent. Claude Fable 5 is also available
25K
Cognition
@cognition
Jun 9
Claude Fable 5 is now available in Devin. Fable 5 earns the #1 spot on FrontierCode, our benchmark for real-world engineering tasks that grades mergeability and quality:
275K
Cognition reposted
Devin Desktop
@devindesktop
Jun 8
At @tryramp, engineers use Devin Desktop to bring their favorite agents into one place. With Devin Desktop they can dispatch, monitor, and jump between agents from a single surface with shared context across every task.
11K
Cognition
@cognition
Jun 8
Replying to @cognition
You can find full model results and technical implementation details on our blog:
Introducing FrontierCode
From cognition.ai
33K
Cognition
@cognition
Jun 8
Replying to @cognition
FrontierCode has three task sets: Extended (150 tasks), Main (100 tasks) and Diamond (50 tasks). SOTA LLMs have significant room for improvement, with the top model earning a score of just 13.4/100 on our Diamond task set.
109K
Cognition
@cognition
Jun 8
Replying to @cognition
Tasks in the dataset have a concise problem statement with large solutions that cut across multiple files. FrontierCode’s task set is more diverse than other software engineering evals, measuring ability across a wide range of languages and problem types.
63K
Cognition
@cognition
Jun 8
Replying to @cognition
Rigorous quality control is important, so we built an extensive QC pipeline with adversarial testing, calibration, and multi-stage review. Every task is manually reviewed by a Cognition researcher. This reduces both the false positive and false negative rates for proposed
45K
Cognition
@cognition
Jun 8
Replying to @cognition
FrontierCode was built in close partnership with the expert maintainers of 36 flagship open-source repositories, like @smilingnosrati, CEO & Tech Lead @CeleryOrg (29k stars), and Martin McKeaveney, CTO of @budibase (28k stars). Maintainers invested more than 40 hours per task,
57K
Cognition
@cognition
Jun 8
Replying to @cognition
20+ world-class open-source developers built realistic coding tasks on repos they maintain. They define what “mergeable” means in their repo. What does it take to measure mergeability? We use a mix of unit tests, rubrics and novel verifiers to assess correctness, test quality,
99K
Cognition
@cognition
Jun 5
Replying to @cognition
Thank you to @Lux_Capital for hosting. Full video:
9.2K