We release the strongest public 1.3B and 3B models so far – the ShearedLLaMA series.
Structured pruning from a large model to a small one is far more cost-effective (only 3%!) than pre-training them from scratch!
Check out our paper and models at: xiamengzhou.github.io/sheared-llama/
[1/n]
Mengzhou Xia
303 posts
Joined May 2015
- Lots of instruction tuning data out there...but how to best adapt LLMs for specific queries? Don’t use ALL of the data, use LESS! 5% beats the full dataset. Can even use one small model to select data for others! Paper: arxiv.org/abs/2402.04333 Code: github.com/princeton-nlp/… [1/n]
- How do language models of different sizes learn during the course of pre-training? We study the training trajectories with training checkpoints of language model from 125M to 175B for a better understanding! Check out our new paper 📜: arxiv.org/abs/2212.09803 (1/N)
- I am honored to receive the Apple Scholars in AIML fellowship! Very grateful to my advisor, mentors and collaborators along the way :) Excited to keep exploring the Pareto-frontier of capabilities and efficiency of foundation models!Congrats to @xiamengzhou on receiving an Apple Scholars in AIML fellowship! 🎉🍏 The fellowship recognizes graduate students doing innovative and cutting-edge research in machine learning. Xia is part of @princeton_nlp, advised by @danqi_chen. bit.ly/3wZz78q
- I am excited to attend #NeurIPS2024 🤩! I’ll be presenting SimPO and CharXiv, and would love to catch up and chat about: - RLHF, reasoning, high-quality data synthesis and generally about AI! - And.. also about the academic and industry job markets!
- We train and evaluate extensively with various offline preference optimization algorithms, including DPO, KTO, ORPO, RDPO, and more. Hyperparameter tuning significantly impacts algorithm effectiveness. DPO performs consistently well, but SimPO is better!Introducing SimPO: Simpler & more effective Preference Optimization!🎉 Significantly outperforms DPO w/o a reference model!📈 Llama-3-8B-SimPO ranked among top on leaderboards!💪 ✅44.7% LC win rate on AlpacaEval 2 ✅33.8% win rate on Arena-Hard arxiv.org/abs/2405.14734 🧵[1/n]
- 🌟 Exciting update! Gemma2-9b + SimPO ranks at the top of AlpacaEval 2 (❗LC 72.4) and leads the WildBench leaderboard among similar-sized models 🚀 SimPO is at least competitive as (and often outperforms) DPO across all benchmarks, despite its simplicity. ✨ Recipe: on-policy
- Surprisingly, we find training only with incorrect traces leads to strong performance 🤯 Even more interesting: it improves model diversity and test-time scaling—while correct traces do the opposite. Check out the 🧵👇🔥The debate’s been wild: How does the reward in RLVR actually improve LLM reasoning?🤔 🚀Introducing our new paper👇 💡TL;DR: Just penalizing incorrect rollouts❌ — no positive reward needed — can boost LLM reasoning, and sometimes better than PPO/GRPO! 🧵[1/n]
- 🌟We release the code for training Sheared-LLaMA here at github.com/princeton-nlp/…. We're excited to see even stronger sheared models emerging in the future! 🤩 For more details, check out our preprint at arxiv.org/abs/2310.06694.We release the strongest public 1.3B and 3B models so far – the ShearedLLaMA series. Structured pruning from a large model to a small one is far more cost-effective (only 3%!) than pre-training them from scratch! Check out our paper and models at: xiamengzhou.github.io/sheared-llama/ [1/n]
- Check out our #acl2022 paper on CoFi☕️! Structured pruning is competitive compared to knowledge distillation but requires much less training time and zero unlabeled data. Joint work w/ @ZexuanZhong, @danqi_chen Paper: arxiv.org/pdf/2204.00408… Code: github.com/princeton-nlp/… (1/5)
- Check out our preprint on Prompting ELECTRA! We show that discriminative models like ELECTRA outperform generative MLMs like BERT and RoBERTa on zero-shot and few-shot prompting. Joint work w/ @artetxem, @JefferyDuu, @danqi_chen, @vesko_st Paper: arxiv.org/pdf/2205.15223…
- I'm pleased and honored to receive the fellowship and thanks to @TechAtBloomberg for supporting my research 😀Congratulations to @PrincetonCS + @princeton_nlp's @xiamengzhou on being named one of the 2022-2023 @Bloomberg #DataScience Ph.D. Fellows! Learn more about her research focus and the other Fellows in our newest cohort: bloom.bg/3BROsru #AI #ML #NLProc
- Our LLM trajectory paper got accpted to #ACL2023 😊! Code and results are at github.com/xiamengzhou/tr… Looking forward to future work to analyze trajectories not only in pre-training but also in the more accessible yet mysterious process of instruction tuning with human feedback.How do language models of different sizes learn during the course of pre-training? We study the training trajectories with training checkpoints of language model from 125M to 175B for a better understanding! Check out our new paper 📜: arxiv.org/abs/2212.09803 (1/N)
- This is my first time attending #NeurIPS 🥳 I’d love to chat about efficient approaches for LLMs, learning dynamics/trajectories and more! DM me to grab a coffee together :)


















