Collinear AI (@CollinearAI) / X

Collinear AI

124 posts

Collinear AI

@CollinearAI

The AI Simulation Lab

Joined October 2023

Collinear AI
@CollinearAI
4h
Most “hard” problems are useless for training a model. The useful ones sit in a narrow learnable region where the model fails sometimes and succeeds sometimes. Here are some early results where we took 5 trivial and 5 impossible cybersecurity environments and had a model
450
Collinear AI
@CollinearAI
3h
Replying to @CollinearAI
6/7 For an initially hard example, one bug was buried behind confusing arithmetic and misleading “legacy” comments, so the solver never found it. The creator made it legible, a plain value instead of an obfuscated one, plus a hint at the right file. Now the solver sometimes
37
Collinear AI
@CollinearAI
3h
7/7 In summary, with a few rounds of feedback, Opus 4.8 can reshape a task to land at the edge of GPT-5.5’s ability. Though an open question remains; is this due to the various iterations of feedback or because the creator has capabilities that the solver does not have?
33
Collinear AI
@CollinearAI
Jun 21
What are the real bottlenecks to AGI? Let's debate! Collinear HQ, Sunnyvale, June 29th Join researchers from xAI, Sierra and Amazon AGI for some hot takes, moderated by our own @nazneenrajani. Invite-only for ACL/ICML authors.
361
Collinear AI
@CollinearAI
Jun 22
Request invite here -->
Samosas and the Race to AGI · Luma
From luma.com
64
Collinear AI reposted
Sachin
@sachpatro97
Jun 21
There’s been a lot of talk about the new models getting scary good. Mythos on cybersecurity. GPT-5.5 Codex on coding. GLM 5.1 as the all-around daily driver. But the biggest AI research question we have is much simpler: Can it do it on a rainy night in Stoke? Little sneak peek
worldcupbench.com
WorldCupBench — Coming soon
An agentic LLM benchmark where models manage national teams through a simulated FIFA World Cup 2026. From Collinear AI.
359
Collinear AI
@CollinearAI
Jun 17
1/n We benchmarked the top-4 models for solving real-life cybersecurity vulnerabilities on a collection of 221 tasks which covering more than 150+ CWEs. And found some interesting patterns..
2.4K
Collinear AI
@CollinearAI
Jun 17
Replying to @CollinearAI
3/n On the same set of 24 tasks Opus and Fable have the same pass@1 but Fable has a higher pass@4 showing that Fable has more variance in it rollouts - potentially making it interesting to scale parallel test time compute with!
112
Collinear AI
@CollinearAI
Jun 17
4/n On one of the tasks involving a Java 100K+ codebase with 4 vulnerabilities (CWE94, CWE503, CWE862 and CWE863) both the models fixed the three obvious bug but the fourth one (CWE863) which is more subtle was only solved by Fable.
96
Collinear AI
@CollinearAI
Jun 9
We are hiring MTS with backgrounds in research, ML, engineering, product, and customer facing deployments. Come get your seat on the rocket ship 🚀
Nazneen Rajani
@nazneenrajani
Jun 9
In January of this year, the number of MTS with PhDs @CollinearAI was 2. Today it is 8 and 3 more joining next month. We believe we are building something special that lies on the critical path to AGI. We are not like the other RL env companies. Not every hill is worth
279
Collinear AI
@CollinearAI
Jun 4
We are pleased to see that the latest MAI-Thinking-1 model is strongly sustained by a synthetic pipeline for RL environment, primarily for agentic MCP tool use scenario. Curiously, they especially highlight the FunReason-MT pipeline by Ant Group, which contains a few interesting
5.2K
Collinear AI
@CollinearAI
Jun 1
It's important for the community to reflect on in what areas the open-source labs have closed the gap on frontier capabilities: (1) 1M context. DeepSeek V4 tech report has sufficiently shown how compression + sparse selection of keys/values in attention can enable 1M context at
MiniMax (official)
@MiniMax_AI
Jun 1
Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities - Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, 74.2% MCP Atlas - MiniMax Sparse Attention scales context to 1M -
4.2K
Collinear AI
@CollinearAI
May 31
As agentic RL becomes more important in the research community, the problem of token-vs-text mismatch is now actively studied. Some throwbacks to earlier efforts from our side & frontier labs: - Back in January, when building our on-policy distillation framework Spider, we
clem 🤗
@ClementDelangue
May 29
Most people training agentic LLMs with RL right now have a silently broken training loop and have no idea. Here's the trap: single-turn RL works beautifully. Clean curves, sane rewards, everything converges. Then you add tools so the model can act mid-rollout, and things get
7.4K
Collinear AI
@CollinearAI
May 29
In light of Claude Code's Dynamic Workflow rollout, we choose to review some solid multi-agent research by frontier labs, as they are very informative to the agentic research community. - Anthropic's early "open source recipe" for Workflow, where they use multi agents to build a
8.8K
Collinear AI
@CollinearAI
May 21
Article
Is your RL environment fair to your agent?
or ensuring that your hillclimbing budget is spent right :) tldr; based on my current understanding of evaluations, RL environments, and the hill-climbing loop: an environment (or evaluation) is fair...
395