Shengding Hu (@DeanHu11) / X

Shengding Hu

41 posts

Shengding Hu

@DeanHu11

PhD @ Tsinghua University @deepseek_ai

Beijing, China

shengdinghu.github.io

Joined October 2018

Shengding Hu
@DeanHu11
Feb 26, 2024
We are excited to share our recent results on the dynamics of model training from a mechanistic interpretability perspective. It appears that Grokking, Double Descent, and Emergent Ability might share the same underlying dynamics! #LLM arxiv.org/abs/2402.15175
18K
Shengding Hu
@DeanHu11
Apr 19, 2024
Llama 3 8B is trained on 15T tokens! 😱 This is in accordance with our recent scaling law in #MiniCPM paper(arxiv.org/abs/2404.06395): Compute optimal data size should be 200 times larger than model size 🤩 Chinchilla Optimal is dead! lol #llama3 #scaling #minicpm
11K
Shengding Hu
@DeanHu11
Oct 8, 2024
Love COLM，and it is my great honor to present here！
Sasha Rush
@srush_nlp
Oct 8, 2024
Really enjoyed this presentation at COLM. Such a dense set of experiments. arxiv.org/abs/2404.06395
14K
Shengding Hu
@DeanHu11
Mar 24, 2025
Thanks for discovering our paper! Seems that there is a trend! Just planned to write a blog to connect these highly similar papers. But I'm too busy recently. Autoregressive conditional block attention is all we need for unified modalities🤣
Tao HU
@vtaohu
Mar 16, 2025
Okay, interesting. I saw a very similar idea in Vision now. They do experiments on images and videos. Even the title is almost the same. 😅😅😅 arxiv.org/abs/2412.07720
18K
Shengding Hu
@DeanHu11
Aug 6, 2024
Thrilled to discover that both Huggingface and Megatron have incorporated the WSD scheduler! Everyone is welcome to try it out! Finally, I've made a small contribution to LLMs.
5.7K
Shengding Hu
@DeanHu11
May 31, 2024
arxiv.org/pdf/2405.18392 A cool paper that studies WSD and compares it against various schedulers including SFO(Schedule Free Optimizer) and other techniques such as Stochastic Weight Averaging (SWA). Thanks @eliebakouch for the detailed study, which greatly updates my knowledge.
4K
Shengding Hu
@DeanHu11
Oct 31, 2024
Thanks for introducing our work! Check out our new observations on Mamba! Great work with @DonnyChan123 . tbh, the name is one of the best that I came up with among all my papers🤣
Albert Gu
@_albertgu
Oct 31, 2024
this is a great paper (with a great name) - clever exps on the state capacity and long context ability of SSMs arxiv.org/abs/2410.07145… strikingly, for every state size M there's a phase transition at some training context len >= K where SSMs will length-generalize robustly this
4.9K
Shengding Hu
@DeanHu11
Mar 24, 2025
Replying to @DeanHu11
But tbh, I think even block language diffusion still has a long way to go in terms of both performance and batch-serving efficiency before they can match autoregressive models.
2.8K
Shengding Hu
@DeanHu11
Feb 23, 2024
[1/3] How does Google's Gemma 7B & 2B models stack up against our MiniCPM-2B? 🚀 Quick comparison: 👇 1. MiniCPM-2B leads over Gemma-2B in English tasks (AvgScore: 56.6 vs. 46.4). Even competes with Gemma-7B in several tasks. Chinese scores also surpass Gemma-7B&2B noticeably.
3.7K
Shengding Hu
@DeanHu11
Apr 19, 2024
Another good point from Llama3 Blog is the following sentence: ''The model knows how to produce the right answer, but it does not know how to select it.'' This sentence makes me think of our claim in ICLR 2024 paper (See you soon in Vienna😉）arxiv.org/abs/2310.03262
1.4K
Shengding Hu
@DeanHu11
Mar 19, 2025
Excellent work on understanding pretraining loss!
Kairong Luo
@openhonor
Mar 18, 2025
🔍How does pretraining loss evolve under different LR schedules? 🌟Meet our Multi-Power Law: predicts the full loss curve for various schedules! 🌟Accurate enough to optimize LR schedules directly. 🌟Result? A WSD-like schedule that outperforms the rest! 🔥Accepted at #ICLR2025
4.1K
Shengding Hu
@DeanHu11
May 7, 2024
Arrive at Vienna @#ICLR2024! I will be at poster session at Tue 7 May 4:30 p.m. CEST — 6:30 p.m. CEST. Looking forward to discussing with you about any topics!
909
Shengding Hu
@DeanHu11
Feb 23, 2024
Replying to @DeanHu11
[3/3] 2024 is crucial for LLM applications. Our joint vision with Google Gemma for democratizing AI aligns perfectly. More exploration ahead in hardware, architecture & algorithms! 🔥#AI #LLM #EdgeLLM #Gemma #minicpm
705
Shengding Hu
@DeanHu11
May 16, 2024
Very glad to hear that! Why the decay stage works well might also have some implications on the training dynamics ...
elie
@eliebakouch
May 15, 2024
Replying to @eliebakouch @huggingface and 4 others
TL;DR: WSD works as well as cosine schedule. It seems to become the standard for open-source models and might (already?) be used in closed-source models 👀 Also big thanks to @LoubnaBenAllal1 and @lvwerra for helping me with these experiments 🤗
676