We replicated the DeepSeek-R1-Zero and DeepSeek-R1 training on 7B model with only 8K examples, the results are surprisingly strong.
🚀 Starting from Qwen2.5-Math-7B (base model), we perform RL on it directly. No SFT, no reward model, just 8K MATH examples for verification, the
Junxian He
289 posts
- Thank you for sharing our work! My first time to co-author with the great DeepSeek 🐳Glad to see DeepSeek team members writing more papers than just their fun tech reports :) (maybe I just missed them in the past)
- Two months ago, we open-sourced the first R1-like zero RL training project on math with the Qwen2.5-math model. Since then, many great works performed successful zero RL training, mostly based on Qwen2.5 models. 🚀Now, we introduce SimpleRL-Zoo, a deep investigation of zero RL
- When we were working on the "compression represents intelligence linearly" project (arxiv.org/abs/2404.09937) months ago, we found that models' compression efficiency on some documents is more reflective of the their downstream abilities than other text. Since then, we have been🚀Excited to introduce Predictive Data Selection (PreSelect): The Data That Predicts Is the Data That Teaches🚀 We find that data on which model losses are predictive of downstream abilities also contribute effectively to learning. Then we further propose predictive data
- While already indicated in the blog, I guess I need to clarify that we are not like spending five days to replicate DS-R1 training -- We have been doing these experiments with the simple RL recipe since two months ago, most results were already there before DS-R1's release. WeWe replicated the DeepSeek-R1-Zero and DeepSeek-R1 training on 7B model with only 8K examples, the results are surprisingly strong. 🚀 Starting from Qwen2.5-Math-7B (base model), we perform RL on it directly. No SFT, no reward model, just 8K MATH examples for verification, the
- Mirage or method? We re-assess a series of RL observations such as spurious reward, one-shot RL, test-time RL, and negative-sample training. 🧐These approaches were all proved on Qwen+Math combination originally, but do they work in other settings? If not, under which
- 🚀We are excited to introduce the Tool Decathlon (Toolathlon), a benchmark for language agents on diverse, complex, and realistic tool use. ⭐️32 applications and 600+ tools based on real-world software environments ⭐️Execution-based, reliable evaluation ⭐️Realistic, covering
00:00 - Replying to @junxian_heMain results below. Qwen2.5-7B-SimpleRL-Zero is pure PPO on the Qwen2.5-Math-7B base model, using only 8K examples from the MATH dataset. Qwen2.5-7B-SimpleRL does Long CoT SFT first as a cold start and then RL. Across both approaches, we only utilize the same 8K MATH examples.
- Our #ICLR2020 spotlight paper presents a deep generative model for unsupervised text style transfer, by hypothesizing a parallel latent (unobserved) sequence to generate each observed sequence, good results achieved on several benchmarks: arxiv.org/abs/2002.03912
- I am always wondering what kinda reasoning data can transfer and generalize to improve diverse reasoning tasks beyond math and coding, what is the more generalizable way to express the reasoning patterns in the pretraining data? 🚀We find that learning to predict code executionIntroducing CodeI/O (codei-o.github.io), a systematic way to condense diverse reasoning patterns via code input-output prediction to build massive training data for more reasoning tasks beyond commonly focused math problem-solving and code generation, which usually suffer
- We studied both rule-based and model-based verifiers and found that each has unique limitations. Rule-based verifiers are often unreliable, even in math, and are unavailable in many domains. Model-based verifiers can be easily hacked. In our paper, we construct simple🔍 Are Verifiers Trustworthy in RLVR? Our paper, Pitfalls of Rule- and Model-based Verifiers, exposes the critical flaws in reinforcement learning verification for mathematical reasoning. 🔑 Key findings: 1️⃣ Rule-based verifiers miss correct answers, especially when presented in
- I really like this paper. I'd like to echo the point that RL-related conclusions should be drawn cautiously when using only Qwen models solely on math tasks. Our SimpleRL-Zoo paper is one of the few that actually conducts RLVR across diverse models: arxiv.org/abs/2503.18892🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…
- Replying to @junxian_heEvaluation accuracy and length dynamics on 8 benchmarks. We found that the Qwen2.5-Math base model generates long responses since it produces a lot of code mixed with text to solve the problem. Throughout RL training, this pattern is gradually removed reflected by a length
























