We are excited to share our recent results on the dynamics of model training from a mechanistic interpretability perspective. It appears that Grokking, Double Descent, and Emergent Ability might share the same underlying dynamics! #LLM
arxiv.org/abs/2402.15175
Shengding Hu
41 posts
PhD @ Tsinghua University @deepseek_ai
- Llama 3 8B is trained on 15T tokens! 😱 This is in accordance with our recent scaling law in #MiniCPM paper(arxiv.org/abs/2404.06395): Compute optimal data size should be 200 times larger than model size 🤩 Chinchilla Optimal is dead! lol #llama3 #scaling #minicpm
- Love COLM,and it is my great honor to present here!Really enjoyed this presentation at COLM. Such a dense set of experiments. arxiv.org/abs/2404.06395
- Thanks for discovering our paper! Seems that there is a trend! Just planned to write a blog to connect these highly similar papers. But I'm too busy recently. Autoregressive conditional block attention is all we need for unified modalities🤣Okay, interesting. I saw a very similar idea in Vision now. They do experiments on images and videos. Even the title is almost the same. 😅😅😅 arxiv.org/abs/2412.07720
- Thrilled to discover that both Huggingface and Megatron have incorporated the WSD scheduler! Everyone is welcome to try it out! Finally, I've made a small contribution to LLMs.
- arxiv.org/pdf/2405.18392 A cool paper that studies WSD and compares it against various schedulers including SFO(Schedule Free Optimizer) and other techniques such as Stochastic Weight Averaging (SWA). Thanks @eliebakouch for the detailed study, which greatly updates my knowledge.
- Thanks for introducing our work! Check out our new observations on Mamba! Great work with @DonnyChan123 . tbh, the name is one of the best that I came up with among all my papers🤣this is a great paper (with a great name) - clever exps on the state capacity and long context ability of SSMs arxiv.org/abs/2410.07145… strikingly, for every state size M there's a phase transition at some training context len >= K where SSMs will length-generalize robustly this
- Replying to @DeanHu11But tbh, I think even block language diffusion still has a long way to go in terms of both performance and batch-serving efficiency before they can match autoregressive models.
- [1/3] How does Google's Gemma 7B & 2B models stack up against our MiniCPM-2B? 🚀 Quick comparison: 👇 1. MiniCPM-2B leads over Gemma-2B in English tasks (AvgScore: 56.6 vs. 46.4). Even competes with Gemma-7B in several tasks. Chinese scores also surpass Gemma-7B&2B noticeably.
- Another good point from Llama3 Blog is the following sentence: ''The model knows how to produce the right answer, but it does not know how to select it.'' This sentence makes me think of our claim in ICLR 2024 paper (See you soon in Vienna😉)arxiv.org/abs/2310.03262
- Excellent work on understanding pretraining loss!🔍How does pretraining loss evolve under different LR schedules? 🌟Meet our Multi-Power Law: predicts the full loss curve for various schedules! 🌟Accurate enough to optimize LR schedules directly. 🌟Result? A WSD-like schedule that outperforms the rest! 🔥Accepted at #ICLR2025
- Arrive at Vienna @#ICLR2024! I will be at poster session at Tue 7 May 4:30 p.m. CEST — 6:30 p.m. CEST. Looking forward to discussing with you about any topics!
- Replying to @DeanHu11
- Very glad to hear that! Why the decay stage works well might also have some implications on the training dynamics ...Replying to @eliebakouch @huggingface and 4 othersTL;DR: WSD works as well as cosine schedule. It seems to become the standard for open-source models and might (already?) be used in closed-source models 👀 Also big thanks to @LoubnaBenAllal1 and @lvwerra for helping me with these experiments 🤗



























