Log inSign up
Shengding Hu
41 posts
user avatar
Shengding Hu
@DeanHu11
PhD @ Tsinghua University @deepseek_ai
Beijing, China
shengdinghu.github.io
Joined October 2018
142
Following
1,084
Followers
  • user avatar
    Shengding Hu
    @DeanHu11
    Feb 26, 2024
    We are excited to share our recent results on the dynamics of model training from a mechanistic interpretability perspective. It appears that Grokking, Double Descent, and Emergent Ability might share the same underlying dynamics! #LLM arxiv.org/abs/2402.15175
    Image
    Image
    Image
    18K
  • user avatar
    Shengding Hu
    @DeanHu11
    Apr 19, 2024
    Llama 3 8B is trained on 15T tokens! 😱 This is in accordance with our recent scaling law in #MiniCPM paper(arxiv.org/abs/2404.06395): Compute optimal data size should be 200 times larger than model size 🤩 Chinchilla Optimal is dead! lol #llama3 #scaling #minicpm
    Image
    Image
    Image
    Image
    11K
  • user avatar
    Shengding Hu
    @DeanHu11
    Oct 8, 2024
    Love COLM,and it is my great honor to present here!
    user avatar
    Sasha Rush
    @srush_nlp
    Oct 8, 2024
    Really enjoyed this presentation at COLM. Such a dense set of experiments. arxiv.org/abs/2404.06395
    Image
    14K
  • user avatar
    Shengding Hu
    @DeanHu11
    Mar 24, 2025
    Thanks for discovering our paper! Seems that there is a trend! Just planned to write a blog to connect these highly similar papers. But I'm too busy recently. Autoregressive conditional block attention is all we need for unified modalities🤣
    user avatar
    Tao HU
    @vtaohu
    Mar 16, 2025
    Okay, interesting. I saw a very similar idea in Vision now. They do experiments on images and videos. Even the title is almost the same. 😅😅😅 arxiv.org/abs/2412.07720
    Image
    18K
  • user avatar
    Shengding Hu
    @DeanHu11
    Aug 6, 2024
    Thrilled to discover that both Huggingface and Megatron have incorporated the WSD scheduler! Everyone is welcome to try it out! Finally, I've made a small contribution to LLMs.
    Image
    Image
    5.7K
  • user avatar
    Shengding Hu
    @DeanHu11
    May 31, 2024
    arxiv.org/pdf/2405.18392 A cool paper that studies WSD and compares it against various schedulers including SFO(Schedule Free Optimizer) and other techniques such as Stochastic Weight Averaging (SWA). Thanks @eliebakouch for the detailed study, which greatly updates my knowledge.
    Image
    4K
  • user avatar
    Shengding Hu
    @DeanHu11
    Oct 31, 2024
    Thanks for introducing our work! Check out our new observations on Mamba! Great work with @DonnyChan123 . tbh, the name is one of the best that I came up with among all my papers🤣
    user avatar
    Albert Gu
    Cartesia
    @_albertgu
    Oct 31, 2024
    this is a great paper (with a great name) - clever exps on the state capacity and long context ability of SSMs arxiv.org/abs/2410.07145… strikingly, for every state size M there's a phase transition at some training context len >= K where SSMs will length-generalize robustly this
    Image
    Image
    4.9K
  • user avatar
    Shengding Hu
    @DeanHu11
    Mar 24, 2025
    Replying to @DeanHu11
    But tbh, I think even block language diffusion still has a long way to go in terms of both performance and batch-serving efficiency before they can match autoregressive models.
    2.8K
  • user avatar
    Shengding Hu
    @DeanHu11
    Feb 23, 2024
    [1/3] How does Google's Gemma 7B & 2B models stack up against our MiniCPM-2B? 🚀 Quick comparison: 👇 1. MiniCPM-2B leads over Gemma-2B in English tasks (AvgScore: 56.6 vs. 46.4). Even competes with Gemma-7B in several tasks. Chinese scores also surpass Gemma-7B&2B noticeably.
    Image
    3.7K
  • user avatar
    Shengding Hu
    @DeanHu11
    Apr 19, 2024
    Another good point from Llama3 Blog is the following sentence: ''The model knows how to produce the right answer, but it does not know how to select it.'' This sentence makes me think of our claim in ICLR 2024 paper (See you soon in Vienna😉)arxiv.org/abs/2310.03262
    Image
    Image
    Image
    Image
    1.4K
  • user avatar
    Shengding Hu
    @DeanHu11
    Mar 19, 2025
    Excellent work on understanding pretraining loss!
    user avatar
    Kairong Luo
    @openhonor
    Mar 18, 2025
    🔍How does pretraining loss evolve under different LR schedules? 🌟Meet our Multi-Power Law: predicts the full loss curve for various schedules! 🌟Accurate enough to optimize LR schedules directly. 🌟Result? A WSD-like schedule that outperforms the rest! 🔥Accepted at #ICLR2025
    4.1K
  • user avatar
    Shengding Hu
    @DeanHu11
    May 7, 2024
    Arrive at Vienna @#ICLR2024! I will be at poster session at Tue 7 May 4:30 p.m. CEST — 6:30 p.m. CEST. Looking forward to discussing with you about any topics!
    909
  • user avatar
    Shengding Hu
    @DeanHu11
    Feb 23, 2024
    Replying to @DeanHu11
    [3/3] 2024 is crucial for LLM applications. Our joint vision with Google Gemma for democratizing AI aligns perfectly. More exploration ahead in hardware, architecture & algorithms! 🔥#AI #LLM #EdgeLLM #Gemma #minicpm
    705
  • user avatar
    Shengding Hu
    @DeanHu11
    May 16, 2024
    Very glad to hear that! Why the decay stage works well might also have some implications on the training dynamics ...
    user avatar
    elie
    Prime Intellect
    @eliebakouch
    May 15, 2024
    Replying to @eliebakouch @huggingface and 4 others
    TL;DR: WSD works as well as cosine schedule. It seems to become the standard for open-source models and might (already?) be used in closed-source models 👀 Also big thanks to @LoubnaBenAllal1 and @lvwerra for helping me with these experiments 🤗
    676

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement