Log inSign up
Naman Jain
567 posts
user avatar
Naman Jain
@StringChaos
Research @cursor_ai | CursorBench, LiveCodeBench, DeepSWE, R2E-Gym, GSO, LMArena Coding | Past: @UCBerkeley @MetaAI @AWS @MSFTResearch @iitbombay
San Francisco, CA
naman-ntc.github.io
Joined March 2018
1,445
Following
2,871
Followers
  • Pinned
    user avatar
    Naman Jain
    @StringChaos
    Mar 12
    New post: how we do evals at @cursor_ai. Takeaways: 1. Online metrics from real Cursor requests provide construct validity 2. CursorBench: a dynamic offline suite distilled from online learnings 3. Multi-axes evals -- correctness, efficiency, agent interaction behavior
    user avatar
    Cursor
    @cursor_ai
    Mar 12
    We're sharing a new method for scoring models on agentic coding tasks. Here's how models in Cursor compare on intelligence and efficiency:
    Image
    39K
  • user avatar
    Naman Jain
    @StringChaos
    Jan 17, 2025
    DeepSeek-R1 (Preview) Results 🔥 We worked with the @deepseek_ai team to evaluate R1 Preview models on LiveCodeBench. The model performs in the vicinity of o1-Medium providing SOTA reasoning performance! Huge kudos to the team and I'm looking forward to the full release!! /1
    Image
    174K
  • user avatar
    Naman Jain
    @StringChaos
    Apr 10, 2024
    The new GPT-4-Turbo improves an impressive 4.5 points on LiveCodeBench (comprising competition-style programming problems). These problems are quite challenging for current LLMs and this improvement highlights a considerable improvement in reasoning!! x.com/polynoamial/st…
    Image
    user avatar
    Noam Brown
    @polynoamial
    Apr 9, 2024
    GPT-4 reasoning has been further improved
    126K
  • user avatar
    Naman Jain
    @StringChaos
    Feb 24, 2025
    Check out evaluations for QwQ-Max-Preview on LiveCodeBench where it performs on par with o1-medium🚀!!
    Image
    Image
    01:20
    user avatar
    Qwen
    @Alibaba_Qwen
    Feb 24, 2025
    <think>...</think> QwQ-Max-Preview Qwen Chat: chat.qwen.ai Blog: qwenlm.github.io/blog/qwq-max-p… 🤔 Today we release "Thinking (QwQ)" in Qwen Chat, backed by our QwQ-Max-Preview, which is a reasoning model based on Qwen2.5-Max. This model is still for preview. It is highly
    76K
  • user avatar
    Naman Jain
    @StringChaos
    Apr 9, 2025
    Excited to release R2E-Gym - 🔥 8.1K executable environments using synthetic data - 🧠 Hybrid verifiers for enhanced inference-time scaling - 📈 51% success-rate on the SWE-Bench Verified - 🤗 Open Source Data + Models + Trajectories 1/
    Image
    52K
  • user avatar
    Naman Jain
    @StringChaos
    Mar 14, 2024
    📢📢Excited to introduce our new work LiveCodeBench! 📈 Live evaluations to ensure fairness and reliability 🔍 Holistic evaluations using 4 code-related scenarios 💡Insights from comparing 20+ code models 🚨🚨We use problem release dates to detect and prevent contamination
    Image
    83K
  • user avatar
    Naman Jain
    @StringChaos
    Aug 15, 2022
    Super excited to announce that after spending two amazing years @MSFTResearch India, I am starting my Ph.D. at @Berkeley_EECS! Grateful to all the advisors, collaborators, friends, and family that made this possible. Look forward to doing exciting work in the ML ↔️ PL space
  • user avatar
    Naman Jain
    @StringChaos
    Nov 29, 2023
    Thrilled to share our recent work in leveraging synthetic data to enhance data quality for algorithmic code generation! We use instruction-tuned language models to transform existing datasets ensuring correctness with an oracle equivalence checker. arxiv.org/abs/2311.14904 (1/N)
    Image
    41K
  • user avatar
    Naman Jain
    @StringChaos
    Feb 27, 2025
    Exciting LiveCodeBench update! A new model from @Kimi_Moonshot Kimi-1.6-IoI-High optimized for (algorithmic) coding now ranks first on the leaderboard!
    Image
    45K
  • user avatar
    Naman Jain
    @StringChaos
    May 30, 2025
    Can SWE-Agents aid in High-Performance Software development? ⚡️🤔 Introducing GSO: A Challenging Code Optimization Benchmark 🔍 Unlike simple bug fixes, this combines algorithmic reasoning with systems programming 📊 Results: Current agents struggle with <5% success rate!
    Image
    Image
    user avatar
    Manish Shetty
    @slimshetty_
    May 30, 2025
    ✨ NEW SWE-Agents BENCHMARK ✨ Introducing GSO: The Global Software Optimization Benchmark - 👩🏻‍💻 100+ challenging software optimization tasks - 🛣️ a long-horizon task w/ precise specification - 🐘 large code changes in Py, C, C++, ... - 📉 SOTA models get < 5% success! 1/
    14K
  • user avatar
    Naman Jain
    @StringChaos
    Jan 15, 2025
    📢 Excited to share the 5th update for LiveCodeBench We have added 167 new problems this time and collected 880 problems overall, over two-fold increase from 400 problems in v1 Leaderboard ⬇️ - 🥇 open and closed reasoning models (o1, Gemini-Flash, QwQ, and upcoming R1!)
    Image
    48K
  • user avatar
    Naman Jain
    @StringChaos
    Dec 26, 2024
    DeepSeek-V3 Released - 🥇 "non-reasoning" model on LiveCodeBench right behind the O1 models (open-source!) - "reasoning" distillation from the R1 model (trading off COT tokens for performance) Beyond algorithmic-coding -- AIME 39 (MATH500 90 ☠️) and SWEBench-Verified (w.
    Image
    Image
    Image
    Image
    user avatar
    DeepSeek
    @deepseek_ai
    Dec 26, 2024
    🚀 Introducing DeepSeek-V3! Biggest leap forward yet: ⚡ 60 tokens/second (3x faster than V2!) 💪 Enhanced capabilities 🛠 API compatibility intact 🌍 Fully open-source models & papers 🐋 1/n
    8.4K
  • user avatar
    Naman Jain
    @StringChaos
    Apr 22, 2025
    Heading to ICLR to present LiveCodeBench at the Friday afternoon poster session and Challenges and Paths Towards AI For SWE at various ICLR workshops. Looking forward to connecting with folks working on Code and LLM Evaluation! (DMs open) PS: Check out our recent v6 update 🚀
    Image
    12K
  • user avatar
    Naman Jain
    @StringChaos
    Jul 2, 2025
    🚀 Introducing DeepSWE: Open-Source SWE Agent We're excited to release DeepSWE, our fully open-source software engineering agent trained with pure reinforcement learning on Qwen3-32B. 📊 The results: 59% on SWE-Bench-Verified with test-time scaling (42.2% Pass@1) - new SOTA
    user avatar
    Agentica Project
    @Agentica_
    Jul 2, 2025
    🚀 Introducing DeepSWE 🤖: our fully open-sourced, SOTA software engineering agent trained purely with RL on top of Qwen3-32B. DeepSWE achieves 59% on SWEBench-Verified with test-time scaling (and 42.2% Pass@1), topping the SWEBench leaderboard for open-weight models. 💪DeepSWE
    Image
    6.4K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement