Naman Jain (@StringChaos) / X

Naman Jain

567 posts

Naman Jain

@StringChaos

Research @cursor_ai | CursorBench, LiveCodeBench, DeepSWE, R2E-Gym, GSO, LMArena Coding | Past: @UCBerkeley @MetaAI @AWS @MSFTResearch @iitbombay

San Francisco, CA

Joined March 2018

Pinned
Naman Jain
@StringChaos
Mar 12
New post: how we do evals at @cursor_ai. Takeaways: 1. Online metrics from real Cursor requests provide construct validity 2. CursorBench: a dynamic offline suite distilled from online learnings 3. Multi-axes evals -- correctness, efficiency, agent interaction behavior
Cursor
@cursor_ai
Mar 12
We're sharing a new method for scoring models on agentic coding tasks. Here's how models in Cursor compare on intelligence and efficiency:
39K
Naman Jain
@StringChaos
Jan 17, 2025
DeepSeek-R1 (Preview) Results 🔥 We worked with the @deepseek_ai team to evaluate R1 Preview models on LiveCodeBench. The model performs in the vicinity of o1-Medium providing SOTA reasoning performance! Huge kudos to the team and I'm looking forward to the full release!! /1
174K
Naman Jain
@StringChaos
Apr 10, 2024
The new GPT-4-Turbo improves an impressive 4.5 points on LiveCodeBench (comprising competition-style programming problems). These problems are quite challenging for current LLMs and this improvement highlights a considerable improvement in reasoning!! x.com/polynoamial/st…
Noam Brown
@polynoamial
Apr 9, 2024
GPT-4 reasoning has been further improved
126K
Naman Jain
@StringChaos
Feb 24, 2025
Check out evaluations for QwQ-Max-Preview on LiveCodeBench where it performs on par with o1-medium🚀!!
01:20
Qwen
@Alibaba_Qwen
Feb 24, 2025
<think>...</think> QwQ-Max-Preview Qwen Chat: chat.qwen.ai Blog: qwenlm.github.io/blog/qwq-max-p… 🤔 Today we release "Thinking (QwQ)" in Qwen Chat, backed by our QwQ-Max-Preview, which is a reasoning model based on Qwen2.5-Max. This model is still for preview. It is highly
76K
Naman Jain
@StringChaos
Apr 9, 2025
Excited to release R2E-Gym - 🔥 8.1K executable environments using synthetic data - 🧠 Hybrid verifiers for enhanced inference-time scaling - 📈 51% success-rate on the SWE-Bench Verified - 🤗 Open Source Data + Models + Trajectories 1/
52K
Naman Jain
@StringChaos
Mar 14, 2024
📢📢Excited to introduce our new work LiveCodeBench! 📈 Live evaluations to ensure fairness and reliability 🔍 Holistic evaluations using 4 code-related scenarios 💡Insights from comparing 20+ code models 🚨🚨We use problem release dates to detect and prevent contamination
83K
Naman Jain
@StringChaos
Aug 15, 2022
Super excited to announce that after spending two amazing years @MSFTResearch India, I am starting my Ph.D. at @Berkeley_EECS! Grateful to all the advisors, collaborators, friends, and family that made this possible. Look forward to doing exciting work in the ML ↔️ PL space
Naman Jain
@StringChaos
Nov 29, 2023
Thrilled to share our recent work in leveraging synthetic data to enhance data quality for algorithmic code generation! We use instruction-tuned language models to transform existing datasets ensuring correctness with an oracle equivalence checker. arxiv.org/abs/2311.14904 (1/N)
41K
Naman Jain
@StringChaos
Feb 27, 2025
Exciting LiveCodeBench update! A new model from @Kimi_Moonshot Kimi-1.6-IoI-High optimized for (algorithmic) coding now ranks first on the leaderboard!
45K
Naman Jain
@StringChaos
May 30, 2025
Can SWE-Agents aid in High-Performance Software development? ⚡️🤔 Introducing GSO: A Challenging Code Optimization Benchmark 🔍 Unlike simple bug fixes, this combines algorithmic reasoning with systems programming 📊 Results: Current agents struggle with <5% success rate!
Manish Shetty
@slimshetty_
May 30, 2025
✨ NEW SWE-Agents BENCHMARK ✨ Introducing GSO: The Global Software Optimization Benchmark - 👩🏻‍💻 100+ challenging software optimization tasks - 🛣️ a long-horizon task w/ precise specification - 🐘 large code changes in Py, C, C++, ... - 📉 SOTA models get < 5% success! 1/
14K
Naman Jain
@StringChaos
Jan 15, 2025
📢 Excited to share the 5th update for LiveCodeBench We have added 167 new problems this time and collected 880 problems overall, over two-fold increase from 400 problems in v1 Leaderboard ⬇️ - 🥇 open and closed reasoning models (o1, Gemini-Flash, QwQ, and upcoming R1!)
48K
Naman Jain
@StringChaos
Dec 26, 2024
DeepSeek-V3 Released - 🥇 "non-reasoning" model on LiveCodeBench right behind the O1 models (open-source!) - "reasoning" distillation from the R1 model (trading off COT tokens for performance) Beyond algorithmic-coding -- AIME 39 (MATH500 90 ☠️) and SWEBench-Verified (w.
DeepSeek
@deepseek_ai
Dec 26, 2024
🚀 Introducing DeepSeek-V3! Biggest leap forward yet: ⚡ 60 tokens/second (3x faster than V2!) 💪 Enhanced capabilities 🛠 API compatibility intact 🌍 Fully open-source models & papers 🐋 1/n
8.4K
Naman Jain
@StringChaos
Apr 22, 2025
Heading to ICLR to present LiveCodeBench at the Friday afternoon poster session and Challenges and Paths Towards AI For SWE at various ICLR workshops. Looking forward to connecting with folks working on Code and LLM Evaluation! (DMs open) PS: Check out our recent v6 update 🚀
12K
Naman Jain
@StringChaos
Jul 2, 2025
🚀 Introducing DeepSWE: Open-Source SWE Agent We're excited to release DeepSWE, our fully open-source software engineering agent trained with pure reinforcement learning on Qwen3-32B. 📊 The results: 59% on SWE-Bench-Verified with test-time scaling (42.2% Pass@1) - new SOTA
Agentica Project
@Agentica_
Jul 2, 2025
🚀 Introducing DeepSWE 🤖: our fully open-sourced, SOTA software engineering agent trained purely with RL on top of Qwen3-32B. DeepSWE achieves 59% on SWEBench-Verified with test-time scaling (and 42.2% Pass@1), topping the SWEBench leaderboard for open-weight models. 💪DeepSWE
6.4K