Junxian He (@junxian

Junxian He

289 posts

Junxian He

@junxian_he

Assist. Prof @hkust. NLP/ML PhD @LTIatCMU.

Hong Kong

jxhe.github.io

Joined September 2017

Junxian He
@junxian_he
Jan 25, 2025
We replicated the DeepSeek-R1-Zero and DeepSeek-R1 training on 7B model with only 8K examples, the results are surprisingly strong. 🚀 Starting from Qwen2.5-Math-7B (base model), we perform RL on it directly. No SFT, no reward model, just 8K MATH examples for verification, the
923K
Junxian He
@junxian_he
Feb 15, 2025
Thank you for sharing our work! My first time to co-author with the great DeepSeek 🐳
Nathan Lambert
@natolambert
Feb 15, 2025
Glad to see DeepSeek team members writing more papers than just their fun tech reports :) (maybe I just missed them in the past)
101K
Junxian He
@junxian_he
Mar 25, 2025
Two months ago, we open-sourced the first R1-like zero RL training project on math with the Qwen2.5-math model. Since then, many great works performed successful zero RL training, mostly based on Qwen2.5 models. 🚀Now, we introduce SimpleRL-Zoo, a deep investigation of zero RL
77K
Junxian He
@junxian_he
Mar 6, 2025
When we were working on the "compression represents intelligence linearly" project (arxiv.org/abs/2404.09937) months ago, we found that models' compression efficiency on some documents is more reflective of the their downstream abilities than other text. Since then, we have been
KaShun SHUM
@ksshumab_
Mar 6, 2025
🚀Excited to introduce Predictive Data Selection (PreSelect): The Data That Predicts Is the Data That Teaches🚀 We find that data on which model losses are predictive of downstream abilities also contribute effectively to learning. Then we further propose predictive data
35K
Junxian He
@junxian_he
Jan 25, 2025
Replying to @junxian_he
Around step 44, the "aha moment" emerges
61K
Junxian He
@junxian_he
Jan 25, 2025
While already indicated in the blog, I guess I need to clarify that we are not like spending five days to replicate DS-R1 training -- We have been doing these experiments with the simple RL recipe since two months ago, most results were already there before DS-R1's release. We
Junxian He
@junxian_he
Jan 25, 2025
We replicated the DeepSeek-R1-Zero and DeepSeek-R1 training on 7B model with only 8K examples, the results are surprisingly strong. 🚀 Starting from Qwen2.5-Math-7B (base model), we perform RL on it directly. No SFT, no reward model, just 8K MATH examples for verification, the
23K
Junxian He
@junxian_he
Sep 3, 2025
Mirage or method? We re-assess a series of RL observations such as spurious reward, one-shot RL, test-time RL, and negative-sample training. 🧐These approaches were all proved on Qwen+Math combination originally, but do they work in other settings? If not, under which
19K
Junxian He
@junxian_he
Oct 30, 2025
🚀We are excited to introduce the Tool Decathlon (Toolathlon), a benchmark for language agents on diverse, complex, and realistic tool use. ⭐️32 applications and 600+ tools based on real-world software environments ⭐️Execution-based, reliable evaluation ⭐️Realistic, covering
00:00
44K
Junxian He
@junxian_he
Jan 25, 2025
Replying to @junxian_he
Main results below. Qwen2.5-7B-SimpleRL-Zero is pure PPO on the Qwen2.5-Math-7B base model, using only 8K examples from the MATH dataset. Qwen2.5-7B-SimpleRL does Long CoT SFT first as a cold start and then RL. Across both approaches, we only utilize the same 8K MATH examples.
31K
Junxian He
@junxian_he
Feb 12, 2020
Our #ICLR2020 spotlight paper presents a deep generative model for unsupervised text style transfer, by hypothesizing a parallel latent (unobserved) sequence to generate each observed sequence, good results achieved on several benchmarks: arxiv.org/abs/2002.03912
Junxian He
@junxian_he
Feb 12, 2025
I am always wondering what kinda reasoning data can transfer and generalize to improve diverse reasoning tasks beyond math and coding, what is the more generalizable way to express the reasoning patterns in the pretraining data? 🚀We find that learning to predict code execution
Junlong Li
@lockonlvange
Feb 12, 2025
Introducing CodeI/O (codei-o.github.io), a systematic way to condense diverse reasoning patterns via code input-output prediction to build massive training data for more reasoning tasks beyond commonly focused math problem-solving and code generation, which usually suffer
17K
Junxian He
@junxian_he
Jun 2, 2025
We studied both rule-based and model-based verifiers and found that each has unique limitations. Rule-based verifiers are often unreliable, even in math, and are unavailable in many domains. Model-based verifiers can be easily hacked. In our paper, we construct simple
Yuzhen Huang
@yuzhenh17
May 29, 2025
🔍 Are Verifiers Trustworthy in RLVR? Our paper, Pitfalls of Rule- and Model-based Verifiers, exposes the critical flaws in reinforcement learning verification for mathematical reasoning. 🔑 Key findings: 1️⃣ Rule-based verifiers miss correct answers, especially when presented in
16K
Junxian He
@junxian_he
May 29, 2025
I really like this paper. I'd like to echo the point that RL-related conclusions should be drawn cautiously when using only Qwen models solely on math tasks. Our SimpleRL-Zoo paper is one of the few that actually conducts RLVR across diverse models: arxiv.org/abs/2503.18892
Stella Li
@StellaLisy
May 27, 2025
🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…
10K
Junxian He
@junxian_he
Jan 25, 2025
Replying to @junxian_he
Evaluation accuracy and length dynamics on 8 benchmarks. We found that the Qwen2.5-Math base model generates long responses since it produces a lot of code mixed with text to solve the problem. Throughout RL training, this pattern is gradually removed reflected by a length
22K