Log inSign up
Junxian He
289 posts
user avatar
Junxian He
@junxian_he
Assist. Prof @hkust. NLP/ML PhD @LTIatCMU.
Hong Kong
jxhe.github.io
Joined September 2017
707
Following
6,335
Followers
  • user avatar
    Junxian He
    @junxian_he
    Jan 25, 2025
    We replicated the DeepSeek-R1-Zero and DeepSeek-R1 training on 7B model with only 8K examples, the results are surprisingly strong. 🚀 Starting from Qwen2.5-Math-7B (base model), we perform RL on it directly. No SFT, no reward model, just 8K MATH examples for verification, the
    Image
    Image
    923K
  • user avatar
    Junxian He
    @junxian_he
    Feb 15, 2025
    Thank you for sharing our work! My first time to co-author with the great DeepSeek 🐳
    user avatar
    Nathan Lambert
    @natolambert
    Feb 15, 2025
    Glad to see DeepSeek team members writing more papers than just their fun tech reports :) (maybe I just missed them in the past)
    Image
    Image
    101K
  • user avatar
    Junxian He
    @junxian_he
    Mar 25, 2025
    Two months ago, we open-sourced the first R1-like zero RL training project on math with the Qwen2.5-math model. Since then, many great works performed successful zero RL training, mostly based on Qwen2.5 models. 🚀Now, we introduce SimpleRL-Zoo, a deep investigation of zero RL
    Image
    77K
  • user avatar
    Junxian He
    @junxian_he
    Mar 6, 2025
    When we were working on the "compression represents intelligence linearly" project (arxiv.org/abs/2404.09937) months ago, we found that models' compression efficiency on some documents is more reflective of the their downstream abilities than other text. Since then, we have been
    user avatar
    KaShun SHUM
    @ksshumab_
    Mar 6, 2025
    🚀Excited to introduce Predictive Data Selection (PreSelect): The Data That Predicts Is the Data That Teaches🚀 We find that data on which model losses are predictive of downstream abilities also contribute effectively to learning. Then we further propose predictive data
    Image
    35K
  • user avatar
    Junxian He
    @junxian_he
    Jan 25, 2025
    Replying to @junxian_he
    Around step 44, the "aha moment" emerges
    Image
    61K
  • user avatar
    Junxian He
    @junxian_he
    Jan 25, 2025
    While already indicated in the blog, I guess I need to clarify that we are not like spending five days to replicate DS-R1 training -- We have been doing these experiments with the simple RL recipe since two months ago, most results were already there before DS-R1's release. We
    user avatar
    Junxian He
    @junxian_he
    Jan 25, 2025
    We replicated the DeepSeek-R1-Zero and DeepSeek-R1 training on 7B model with only 8K examples, the results are surprisingly strong. 🚀 Starting from Qwen2.5-Math-7B (base model), we perform RL on it directly. No SFT, no reward model, just 8K MATH examples for verification, the
    Image
    Image
    23K
  • user avatar
    Junxian He
    @junxian_he
    Sep 3, 2025
    Mirage or method? We re-assess a series of RL observations such as spurious reward, one-shot RL, test-time RL, and negative-sample training. 🧐These approaches were all proved on Qwen+Math combination originally, but do they work in other settings? If not, under which
    Image
    Image
    19K
  • user avatar
    Junxian He
    @junxian_he
    Oct 30, 2025
    🚀We are excited to introduce the Tool Decathlon (Toolathlon), a benchmark for language agents on diverse, complex, and realistic tool use. ⭐️32 applications and 600+ tools based on real-world software environments ⭐️Execution-based, reliable evaluation ⭐️Realistic, covering
    Image
    00:00
    44K
  • user avatar
    Junxian He
    @junxian_he
    Jan 25, 2025
    Replying to @junxian_he
    Main results below. Qwen2.5-7B-SimpleRL-Zero is pure PPO on the Qwen2.5-Math-7B base model, using only 8K examples from the MATH dataset. Qwen2.5-7B-SimpleRL does Long CoT SFT first as a cold start and then RL. Across both approaches, we only utilize the same 8K MATH examples.
    Image
    31K
  • user avatar
    Junxian He
    @junxian_he
    Feb 12, 2020
    Our #ICLR2020 spotlight paper presents a deep generative model for unsupervised text style transfer, by hypothesizing a parallel latent (unobserved) sequence to generate each observed sequence, good results achieved on several benchmarks: arxiv.org/abs/2002.03912
    Image
    Image
  • user avatar
    Junxian He
    @junxian_he
    Feb 12, 2025
    I am always wondering what kinda reasoning data can transfer and generalize to improve diverse reasoning tasks beyond math and coding, what is the more generalizable way to express the reasoning patterns in the pretraining data? 🚀We find that learning to predict code execution
    user avatar
    Junlong Li
    @lockonlvange
    Feb 12, 2025
    Introducing CodeI/O (codei-o.github.io), a systematic way to condense diverse reasoning patterns via code input-output prediction to build massive training data for more reasoning tasks beyond commonly focused math problem-solving and code generation, which usually suffer
    Image
    Image
    17K
  • user avatar
    Junxian He
    @junxian_he
    Jun 2, 2025
    We studied both rule-based and model-based verifiers and found that each has unique limitations. Rule-based verifiers are often unreliable, even in math, and are unavailable in many domains. Model-based verifiers can be easily hacked. In our paper, we construct simple
    user avatar
    Yuzhen Huang
    @yuzhenh17
    May 29, 2025
    🔍 Are Verifiers Trustworthy in RLVR? Our paper, Pitfalls of Rule- and Model-based Verifiers, exposes the critical flaws in reinforcement learning verification for mathematical reasoning. 🔑 Key findings: 1️⃣ Rule-based verifiers miss correct answers, especially when presented in
    Image
    16K
  • user avatar
    Junxian He
    @junxian_he
    May 29, 2025
    I really like this paper. I'd like to echo the point that RL-related conclusions should be drawn cautiously when using only Qwen models solely on math tasks. Our SimpleRL-Zoo paper is one of the few that actually conducts RLVR across diverse models: arxiv.org/abs/2503.18892
    Image
    Image
    user avatar
    Stella Li
    @StellaLisy
    May 27, 2025
    🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…
    10K
  • user avatar
    Junxian He
    @junxian_he
    Jan 25, 2025
    Replying to @junxian_he
    Evaluation accuracy and length dynamics on 8 benchmarks. We found that the Qwen2.5-Math base model generates long responses since it produces a lot of code mixed with text to solve the problem. Throughout RL training, this pattern is gradually removed reflected by a length
    Image
    22K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement