Rulin Shao (@RulinShao) / X

Rulin Shao

479 posts

Rulin Shao

@RulinShao

PhD @UWNLP, visiting researcher @Meta.

Joined April 2022

Pinned
Rulin Shao
@RulinShao
May 25
DR Tulu is now accepted for an oral presentation at #ICML2026 🙏 Updated paper: arxiv.org/abs/2511.19399 📥We added more ablations including using Qwen3-8B as the rubric generator&judge, showing evolving rubrics work with a weak model too; spurious rewards sanity check, etc.
Rulin Shao
@RulinShao
May 1
Happy to share that DR Tulu has been accepted to ICML as a ✨Spotlight✨! We believe that co-evolving the agent and its reward metric can lead to more capable intelligence. DR Tulu is a team effort. Huge thanks and congrats to all my amazing collaborators and mentors!
arxiv.org
DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research
Deep research agents perform multi-step research to produce long-form, well-attributed answers. However, most open deep research agents are trained on easily verifiable short-form QA tasks via...
27K
Rulin Shao
@RulinShao
Jul 17, 2024
🔥We release the first open-source 1.4T-token RAG datastore and present a scaling study for RAG on perplexity and downstream tasks! We show LM+RAG scales better than LM alone, with better performance for the same training compute (pretraining+indexing) retrievalscaling.github.io 🧵
GIF
118K
Rulin Shao
@RulinShao
Oct 10, 2023
Introduce LightSeq for long-context LLM training: - Highly optimized for decoder models - smarter checkpointing - better support for fewer heads models up to 2x faster, 2-8x longer sequences vs Megatron-LM. arxiv.org/abs/2310.03294
GIF
125K
Rulin Shao
@RulinShao
May 1, 2025
Meet ReasonIR-8B✨the first retriever specifically trained for reasoning tasks! Our challenging synthetic training data unlocks SOTA scores on reasoning IR and RAG benchmarks. ReasonIR-8B ranks 1st on BRIGHT and outperforms search engine and retriever baselines on MMLU and GPQA🔥
64K
Rulin Shao
@RulinShao
Jun 13, 2025
🎉Our Spurious Rewards is available on ArXiv! We added experiments on - More prompts/steps/models/analysis... - Spurious Prompts! Surprisingly, we obtained 19.4% gains when replacing prompts with LaTex placeholder text (\lipsum) 😶‍🌫️ Check out our 2nd blog: tinyurl.com/spurious-prompt
Stella Li ✈️ ICML🇰🇷
@StellaLisy
May 27, 2025
🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…
29K
Rulin Shao
@RulinShao
Sep 26, 2024
Happy to share our work on RAG scaling is accepted by @NeurIPSConf 🥳 Some new thoughts on this work: (1) Retrieving from a web-scale datastore is another way to do test-time scaling. It doesn't add much to the training cost, leading to better compute-optimal scaling curves. 🔎🧵
Rulin Shao
@RulinShao
Jul 17, 2024
🔥We release the first open-source 1.4T-token RAG datastore and present a scaling study for RAG on perplexity and downstream tasks! We show LM+RAG scales better than LM alone, with better performance for the same training compute (pretraining+indexing) retrievalscaling.github.io 🧵
36K
Rulin Shao
@RulinShao
Feb 21, 2025
New features added to MassiveDS-pipe to make it painless to build and serve trillion-token datastore: 1. Distributed API serving (<30ms latency); 2. Efficient indices: IVF-Flat, IVF-PQ; 3. Memory-free fast passage loading. It has been adopted by AI2 OpenScholar and Meta EWE 🥳
25K
Rulin Shao
@RulinShao
Jul 8, 2025
Happy to share that ReasonIR is accepted by @COLM_conf! Synthetic data & test-time scaling are powerful tools to enable new capabilities for challenging tasks. I’m impressed by how quickly smaller retrievers and better rerankers have been developed with ReasonIR data! #COLM2025
Rulin Shao
@RulinShao
May 1, 2025
Meet ReasonIR-8B✨the first retriever specifically trained for reasoning tasks! Our challenging synthetic training data unlocks SOTA scores on reasoning IR and RAG benchmarks. ReasonIR-8B ranks 1st on BRIGHT and outperforms search engine and retriever baselines on MMLU and GPQA🔥
12K
Rulin Shao
@RulinShao
May 27, 2025
One more fun thing! RLVR can elicit existing behaviors like code reasoning. But! If your model is not good at code but thought it could? - RLVR w/ spurious rewards let Olmo use more code: but perf decreased (Fig 6) - When we discourage it not to: the perf goes up!🤣 (Fig 9)
Stella Li ✈️ ICML🇰🇷
@StellaLisy
May 27, 2025
🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…
14K
Rulin Shao
@RulinShao
Dec 9, 2024
I'll be presenting MassiveDS and DCLM at #NeurIPS2024! Drop by or DM me to catch up! Happy to chat about anything--RAG, reasoning, synthetic data, model architecture design, etc.! MassiveDS: Wed 11-2 pm, #7203 (calendar: tinyurl.com/massiveds) DCLM: Fri 4:30-7:30pm, #5109
13K
Rulin Shao
@RulinShao
May 18, 2025
Accepted by #ACL2025! Congrats @mingdachen and the team🥳 Several cool ideas: - Maintain an explicit editable working memory during generation; - Actively integrate external feedback (factual check w/ VeriScore); A smart LM learns to memorize, a smarter LM learns to forget too!
Aran Komatsuzaki
@arankomatsuzaki
Dec 25, 2024
Meta presents Improving Factuality with Explicit Working Memory Presents EWE, a novel approach that enhances factuality in long-form text generation by integrating a working memory that receives real-time feedback from external resources EWE outperforms strong baselines on four
11K
Rulin Shao
@RulinShao
Oct 8, 2025
#COLM2025 Please drop by our ReasonIR poster at Poster3 #967 (11:00am - 1:00pm Wed) by @varsha_kishore_ 🥰 Happy to answer questions or chat online--feel free to DM! I've been exploring deep research training lately to empower reasoning+search for complex tasks💪 Stay tuned!
Rulin Shao
@RulinShao
May 1, 2025
Meet ReasonIR-8B✨the first retriever specifically trained for reasoning tasks! Our challenging synthetic training data unlocks SOTA scores on reasoning IR and RAG benchmarks. ReasonIR-8B ranks 1st on BRIGHT and outperforms search engine and retriever baselines on MMLU and GPQA🔥
10K
Rulin Shao
@RulinShao
Jul 10, 2024
Happy to share LightSeq is accepted by @COLM_conf 🥳 LightSeq supports efficient long-context Transformer training, where the supported context length grows with the number of nodes. We are excited about the innovative applications it will enable, such as long-context LLM/VLM! 🚀
Rulin Shao
@RulinShao
Oct 10, 2023
Introduce LightSeq for long-context LLM training: - Highly optimized for decoder models - smarter checkpointing - better support for fewer heads models up to 2x faster, 2-8x longer sequences vs Megatron-LM. arxiv.org/abs/2310.03294
GIF
17K
Rulin Shao
@RulinShao
Aug 8, 2025
Factuality and logical reasoning (e.g., math, code) favor different sets of reasoning patterns. 🧑‍🍳 A fresh RL recipe to improve factuality is here — crafted by the amazing @ccsasuke!
Jason Weston
@jaseweston
Aug 8, 2025
...is today a good day for new paper posts? 🤖Learning to Reason for Factuality 🤖 📝: arxiv.org/abs/2508.05618 - New reward func for GRPO training of long CoTs for *factuality* - Design stops reward hacking by favoring precision, detail AND quality - Improves base model across
7.7K