Most “hard” problems are useless for training a model.
The useful ones sit in a narrow learnable region where the model fails sometimes and succeeds sometimes.
Here are some early results where we took 5 trivial and 5 impossible cybersecurity environments and had a model
Collinear AI
124 posts
- Replying to @CollinearAI6/7 For an initially hard example, one bug was buried behind confusing arithmetic and misleading “legacy” comments, so the solver never found it. The creator made it legible, a plain value instead of an obfuscated one, plus a hint at the right file. Now the solver sometimes7/7 In summary, with a few rounds of feedback, Opus 4.8 can reshape a task to land at the edge of GPT-5.5’s ability. Though an open question remains; is this due to the various iterations of feedback or because the creator has capabilities that the solver does not have?
- What are the real bottlenecks to AGI? Let's debate! Collinear HQ, Sunnyvale, June 29th Join researchers from xAI, Sierra and Amazon AGI for some hot takes, moderated by our own @nazneenrajani. Invite-only for ACL/ICML authors.Request invite here -->
- Collinear AI repostedThere’s been a lot of talk about the new models getting scary good. Mythos on cybersecurity. GPT-5.5 Codex on coding. GLM 5.1 as the all-around daily driver. But the biggest AI research question we have is much simpler: Can it do it on a rainy night in Stoke? Little sneak peek
- 1/n We benchmarked the top-4 models for solving real-life cybersecurity vulnerabilities on a collection of 221 tasks which covering more than 150+ CWEs. And found some interesting patterns..Replying to @CollinearAI3/n On the same set of 24 tasks Opus and Fable have the same pass@1 but Fable has a higher pass@4 showing that Fable has more variance in it rollouts - potentially making it interesting to scale parallel test time compute with!4/n On one of the tasks involving a Java 100K+ codebase with 4 vulnerabilities (CWE94, CWE503, CWE862 and CWE863) both the models fixed the three obvious bug but the fourth one (CWE863) which is more subtle was only solved by Fable.
- We are hiring MTS with backgrounds in research, ML, engineering, product, and customer facing deployments. Come get your seat on the rocket ship 🚀In January of this year, the number of MTS with PhDs @CollinearAI was 2. Today it is 8 and 3 more joining next month. We believe we are building something special that lies on the critical path to AGI. We are not like the other RL env companies. Not every hill is worth
- We are pleased to see that the latest MAI-Thinking-1 model is strongly sustained by a synthetic pipeline for RL environment, primarily for agentic MCP tool use scenario. Curiously, they especially highlight the FunReason-MT pipeline by Ant Group, which contains a few interesting
- It's important for the community to reflect on in what areas the open-source labs have closed the gap on frontier capabilities: (1) 1M context. DeepSeek V4 tech report has sufficiently shown how compression + sparse selection of keys/values in attention can enable 1M context atIntroducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities - Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, 74.2% MCP Atlas - MiniMax Sparse Attention scales context to 1M -
- As agentic RL becomes more important in the research community, the problem of token-vs-text mismatch is now actively studied. Some throwbacks to earlier efforts from our side & frontier labs: - Back in January, when building our on-policy distillation framework Spider, weMost people training agentic LLMs with RL right now have a silently broken training loop and have no idea. Here's the trap: single-turn RL works beautifully. Clean curves, sane rewards, everything converges. Then you add tools so the model can act mid-rollout, and things get
- In light of Claude Code's Dynamic Workflow rollout, we choose to review some solid multi-agent research by frontier labs, as they are very informative to the agentic research community. - Anthropic's early "open source recipe" for Workflow, where they use multi agents to build a



















