Log inSign up
Collinear AI
124 posts
Image
user avatar
Collinear AI
@CollinearAI
The AI Simulation Lab
collinear.ai
Joined October 2023
45
Following
524
Followers
  • user avatar
    Collinear AI
    @CollinearAI
    4h
    Most “hard” problems are useless for training a model. The useful ones sit in a narrow learnable region where the model fails sometimes and succeeds sometimes. Here are some early results where we took 5 trivial and 5 impossible cybersecurity environments and had a model
    Image
    450
    user avatar
    Collinear AI
    @CollinearAI
    3h
    Replying to @CollinearAI
    6/7 For an initially hard example, one bug was buried behind confusing arithmetic and misleading “legacy” comments, so the solver never found it. The creator made it legible, a plain value instead of an obfuscated one, plus a hint at the right file. Now the solver sometimes
    37
    user avatar
    Collinear AI
    @CollinearAI
    3h
    7/7 In summary, with a few rounds of feedback, Opus 4.8 can reshape a task to land at the edge of GPT-5.5’s ability. Though an open question remains; is this due to the various iterations of feedback or because the creator has capabilities that the solver does not have?
    33
  • user avatar
    Collinear AI
    @CollinearAI
    Jun 21
    What are the real bottlenecks to AGI? Let's debate! Collinear HQ, Sunnyvale, June 29th Join researchers from xAI, Sierra and Amazon AGI for some hot takes, moderated by our own @nazneenrajani. Invite-only for ACL/ICML authors.
    Image
    361
    user avatar
    Collinear AI
    @CollinearAI
    Jun 22
    Request invite here -->
    Image
    Samosas and the Race to AGI · Luma
    From luma.com
    64
  • Collinear AI reposted
    user avatar
    Sachin
    Collinear AI
    @sachpatro97
    Jun 21
    There’s been a lot of talk about the new models getting scary good. Mythos on cybersecurity. GPT-5.5 Codex on coding. GLM 5.1 as the all-around daily driver. But the biggest AI research question we have is much simpler: Can it do it on a rainy night in Stoke? Little sneak peek
    worldcupbench.com
    WorldCupBench — Coming soon
    An agentic LLM benchmark where models manage national teams through a simulated FIFA World Cup 2026. From Collinear AI.
    359
  • user avatar
    Collinear AI
    @CollinearAI
    Jun 17
    1/n We benchmarked the top-4 models for solving real-life cybersecurity vulnerabilities on a collection of 221 tasks which covering more than 150+ CWEs. And found some interesting patterns..
    Image
    2.4K
    user avatar
    Collinear AI
    @CollinearAI
    Jun 17
    Replying to @CollinearAI
    3/n On the same set of 24 tasks Opus and Fable have the same pass@1 but Fable has a higher pass@4 showing that Fable has more variance in it rollouts - potentially making it interesting to scale parallel test time compute with!
    112
    user avatar
    Collinear AI
    @CollinearAI
    Jun 17
    4/n On one of the tasks involving a Java 100K+ codebase with 4 vulnerabilities (CWE94, CWE503, CWE862 and CWE863) both the models fixed the three obvious bug but the fourth one (CWE863) which is more subtle was only solved by Fable.
    96
  • user avatar
    Collinear AI
    @CollinearAI
    Jun 9
    We are hiring MTS with backgrounds in research, ML, engineering, product, and customer facing deployments. Come get your seat on the rocket ship 🚀
    user avatar
    Nazneen Rajani
    Collinear AI
    @nazneenrajani
    Jun 9
    In January of this year, the number of MTS with PhDs @CollinearAI was 2. Today it is 8 and 3 more joining next month. We believe we are building something special that lies on the critical path to AGI. We are not like the other RL env companies. Not every hill is worth
    Image
    279
  • user avatar
    Collinear AI
    @CollinearAI
    Jun 4
    We are pleased to see that the latest MAI-Thinking-1 model is strongly sustained by a synthetic pipeline for RL environment, primarily for agentic MCP tool use scenario. Curiously, they especially highlight the FunReason-MT pipeline by Ant Group, which contains a few interesting
    Image
    Image
    Image
    5.2K
  • user avatar
    Collinear AI
    @CollinearAI
    Jun 1
    It's important for the community to reflect on in what areas the open-source labs have closed the gap on frontier capabilities: (1) 1M context. DeepSeek V4 tech report has sufficiently shown how compression + sparse selection of keys/values in attention can enable 1M context at
    user avatar
    MiniMax (official)
    @MiniMax_AI
    Jun 1
    Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities - Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, 74.2% MCP Atlas - MiniMax Sparse Attention scales context to 1M -
    Image
    4.2K
  • user avatar
    Collinear AI
    @CollinearAI
    May 31
    As agentic RL becomes more important in the research community, the problem of token-vs-text mismatch is now actively studied. Some throwbacks to earlier efforts from our side & frontier labs: - Back in January, when building our on-policy distillation framework Spider, we
    Image
    Image
    Image
    user avatar
    clem 🤗
    @ClementDelangue
    May 29
    Most people training agentic LLMs with RL right now have a silently broken training loop and have no idea. Here's the trap: single-turn RL works beautifully. Clean curves, sane rewards, everything converges. Then you add tools so the model can act mid-rollout, and things get
    7.4K
  • user avatar
    Collinear AI
    @CollinearAI
    May 29
    In light of Claude Code's Dynamic Workflow rollout, we choose to review some solid multi-agent research by frontier labs, as they are very informative to the agentic research community. - Anthropic's early "open source recipe" for Workflow, where they use multi agents to build a
    Image
    Image
    8.8K
  • user avatar
    Collinear AI
    @CollinearAI
    May 21
    Article cover image
    Article
    Is your RL environment fair to your agent?
    or ensuring that your hillclimbing budget is spent right :) tldr; based on my current understanding of evaluations, RL environments, and the hill-climbing loop: an environment (or evaluation) is fair...
    395

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement