Log inSign up
Yikang Shen
237 posts
Image
user avatar
Yikang Shen
@Yikang_Shen
CEO at Learning Machine. ex MTS @xAI. ex Staff RS @IBM. PhD @Mila. Ordered Neurons, Mixture of Attention Heads, JetMoE, stick-breaking attention, DeltaNet.
Palo Alto, CA
Joined September 2012
472
Following
2,434
Followers

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
  • user avatar
    Yikang Shen
    @Yikang_Shen
    Sep 20, 2024
    🚨Job alert🚨 1. IBM Foundation Model team is hiring research engineers in India and North Carolina. 2. We are also looking for 2025 summer research interns in Boston. We train large language models and do fundamental research on directions related to LLMs. Please email or DM
    94K
  • user avatar
    Yikang Shen
    @Yikang_Shen
    Dec 12, 2023
    Impressed by the performance of Mamba and believe in RNN? We provide a simple alternative solution! Excited to share Gated Linear Attention (GLA-Transformer). (1/n) arxiv.org/abs/2312.06635
    Image
    125K
  • user avatar
    Yikang Shen
    @Yikang_Shen
    Sep 13, 2023
    Announcing the open-source of MoLM, a collection of MoE-based Language models ranging in scale from 4 billion to 8 billion parameters, based on our ModuleFormer architecture. (1/n)
    Image
    GitHub - IBM/ModuleFormer: ModuleFormer is a MoE-based architecture that includes two different...
    From github.com
    91K
  • user avatar
    Yikang Shen
    @Yikang_Shen
    Oct 13, 2023
    Our new Sparse Universal Transformer is both parameter-efficient and computation-efficient compared to the Transformer, and it's better at compositional generalization! paper: arxiv.org/abs/2310.07096
    Image
    82K
  • user avatar
    Yikang Shen
    @Yikang_Shen
    Oct 27, 2023
    Our team in the MIT-IBM Watson Lab seeks highly motivated summer research interns to work on exciting projects about Foundation Models: pre-training, finetuning, alignment, MoE, agent, etc. (1/4)
    53K
  • user avatar
    Yikang Shen
    @Yikang_Shen
    Oct 24, 2024
    Mixture of Attention Heads(MoA) arxiv.org/abs/2210.05144 Mod-Squad arxiv.org/abs/2212.08066 Moduleformer arxiv.org/abs/2306.04640 Sparse Universal Transformer (SUT) arxiv.org/abs/2310.07096 ScatterMoE arxiv.org/abs/2403.08245 DS-MoE arxiv.org/abs/2404.05567 JetMoE
    Image
    15K
  • user avatar
    Yikang Shen
    @Yikang_Shen
    Jul 4, 2024
    With a few tricks, Llama-3-8B can be continued trained to outperform GPT-4 on Medical tasks. For more details, check our paper Efficient Continual Pre-training by Mitigating the Stability Gap (arxiv.org/abs/2406.14833)!
    Image
    22K
  • user avatar
    Yikang Shen
    @Yikang_Shen
    Oct 28, 2024
    Stick-Breaking Attention: Out-of-box length extrapolation, thanks to removing the position embedding; Better performance than Softmax+RoPE on almost every task; Similar efficient implementation like Flash Attention. Do we still need Softmax+RoPE for Language Models?
    Image
    38K
  • user avatar
    Yikang Shen
    @Yikang_Shen
    Jun 8, 2023
    Sparse models can do more things than you think! We propose ModuleFormer, a new MoE-based architecture that is more efficient, extendable, and easier to prune than dense language models.[1/4] Arxiv page: arxiv.org/abs/2306.04640
    Image
    32K
  • user avatar
    Yikang Shen
    @Yikang_Shen
    Mar 14, 2024
    It often surprises people when I explain that Sparse MoE is actually slower than dense models in practice although it requires less computation. It's caused by two reasons: 1) lack of efficient implementation for MoE model training and inference; 2) MoE models require more
    user avatar
    Aran Komatsuzaki
    @arankomatsuzaki
    Mar 14, 2024
    Scattered Mixture-of-Experts Implementation - Presents ScatterMoE, an implementation of Sparse Mixture-of-Experts on GPU - Enables a higher throughput and lower memory footprint repo: github.com/shawntan/scatt… abs: arxiv.org/abs/2403.08245
    Image
    26K
  • user avatar
    Yikang Shen
    @Yikang_Shen
    Aug 27, 2024
    Thanks for posting our work! (1/5) After running thousands of experiments with the WSD learning rate scheduler and μTransfer, we found that the optimal learning rate strongly correlates with the batch size and the number of tokens.
    Image
    Image
    user avatar
    AK
    @_akhaliq
    Aug 27, 2024
    IBM presents Power Scheduler A Batch Size and Token Number Agnostic Learning Rate Scheduler discuss: huggingface.co/papers/2408.13… Finding the optimal learning rate for language model pretraining is a challenging task. This is not only because there is a complicated correlation
    32K
  • user avatar
    Yikang Shen
    @Yikang_Shen
    Dec 17, 2023
    It’s nice to see people start working on MoE for attention mechanisms! A simple fact is that about 1/3 of transformer parameters and more than 1/3 of computation are in the attention layers, so you need something like MoE to scale it up or make it more efficient. Btw, our work,
    user avatar
    AK
    @_akhaliq
    Dec 14, 2023
    SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention paper page: huggingface.co/papers/2312.07… The costly self-attention layers in modern Transformers require memory and compute quadratic in sequence length. Existing approximation methods usually underperform and
    Image
    28K
  • user avatar
    Yikang Shen
    @Yikang_Shen
    Oct 21, 2024
    Granite 3.0 is our latest update for the IBM foundation models. The 8B and 2B models outperform strong competitors with similar sizes. The 1B and 3B MoE use only 400M and 800M active parameters to target the on-device use cases. Our technical report provides all the details you
    Image
    9.9K
  • user avatar
    Yikang Shen
    @Yikang_Shen
    Apr 4, 2024
    Training LLMs can be much cheaper than your new top-spec Cybertruck! Our new JetMoE project shows that just 0.1 million USD is sufficient for training LLaMA2-level LLMs. Thanks to its more aggressive MoE architecture and 2-phase data mixture strategy, JetMoE-8B could
    42K
This post is unavailable.
Advertisement
Advertisement