Yikang Shen (@Yikang

Yikang Shen

237 posts

Yikang Shen

@Yikang_Shen

CEO at Learning Machine. ex MTS @xAI. ex Staff RS @IBM. PhD @Mila. Ordered Neurons, Mixture of Attention Heads, JetMoE, stick-breaking attention, DeltaNet.

Palo Alto, CA

Joined September 2012

Yikang Shen
@Yikang_Shen
Sep 20, 2024
🚨Job alert🚨 1. IBM Foundation Model team is hiring research engineers in India and North Carolina. 2. We are also looking for 2025 summer research interns in Boston. We train large language models and do fundamental research on directions related to LLMs. Please email or DM
94K
Yikang Shen
@Yikang_Shen
Dec 12, 2023
Impressed by the performance of Mamba and believe in RNN? We provide a simple alternative solution! Excited to share Gated Linear Attention (GLA-Transformer). (1/n) arxiv.org/abs/2312.06635
125K
Yikang Shen
@Yikang_Shen
Sep 13, 2023
Announcing the open-source of MoLM, a collection of MoE-based Language models ranging in scale from 4 billion to 8 billion parameters, based on our ModuleFormer architecture. (1/n)
GitHub - IBM/ModuleFormer: ModuleFormer is a MoE-based architecture that includes two different...
From github.com
91K
Yikang Shen
@Yikang_Shen
Oct 13, 2023
Our new Sparse Universal Transformer is both parameter-efficient and computation-efficient compared to the Transformer, and it's better at compositional generalization! paper: arxiv.org/abs/2310.07096
82K
Yikang Shen
@Yikang_Shen
Oct 27, 2023
Our team in the MIT-IBM Watson Lab seeks highly motivated summer research interns to work on exciting projects about Foundation Models: pre-training, finetuning, alignment, MoE, agent, etc. (1/4)
53K
Yikang Shen
@Yikang_Shen
Oct 24, 2024
Mixture of Attention Heads(MoA) arxiv.org/abs/2210.05144 Mod-Squad arxiv.org/abs/2212.08066 Moduleformer arxiv.org/abs/2306.04640 Sparse Universal Transformer (SUT) arxiv.org/abs/2310.07096 ScatterMoE arxiv.org/abs/2403.08245 DS-MoE arxiv.org/abs/2404.05567 JetMoE
15K
Yikang Shen
@Yikang_Shen
Jul 4, 2024
With a few tricks, Llama-3-8B can be continued trained to outperform GPT-4 on Medical tasks. For more details, check our paper Efficient Continual Pre-training by Mitigating the Stability Gap (arxiv.org/abs/2406.14833)!
22K
Yikang Shen
@Yikang_Shen
Oct 28, 2024
Stick-Breaking Attention: Out-of-box length extrapolation, thanks to removing the position embedding; Better performance than Softmax+RoPE on almost every task; Similar efficient implementation like Flash Attention. Do we still need Softmax+RoPE for Language Models?
38K
Yikang Shen
@Yikang_Shen
Jun 8, 2023
Sparse models can do more things than you think! We propose ModuleFormer, a new MoE-based architecture that is more efficient, extendable, and easier to prune than dense language models.[1/4] Arxiv page: arxiv.org/abs/2306.04640
32K
Yikang Shen
@Yikang_Shen
Mar 14, 2024
It often surprises people when I explain that Sparse MoE is actually slower than dense models in practice although it requires less computation. It's caused by two reasons: 1) lack of efficient implementation for MoE model training and inference; 2) MoE models require more
Aran Komatsuzaki
@arankomatsuzaki
Mar 14, 2024
Scattered Mixture-of-Experts Implementation - Presents ScatterMoE, an implementation of Sparse Mixture-of-Experts on GPU - Enables a higher throughput and lower memory footprint repo: github.com/shawntan/scatt… abs: arxiv.org/abs/2403.08245
26K
Yikang Shen
@Yikang_Shen
Aug 27, 2024
Thanks for posting our work! (1/5) After running thousands of experiments with the WSD learning rate scheduler and μTransfer, we found that the optimal learning rate strongly correlates with the batch size and the number of tokens.
AK
@_akhaliq
Aug 27, 2024
IBM presents Power Scheduler A Batch Size and Token Number Agnostic Learning Rate Scheduler discuss: huggingface.co/papers/2408.13… Finding the optimal learning rate for language model pretraining is a challenging task. This is not only because there is a complicated correlation
32K
Yikang Shen
@Yikang_Shen
Dec 17, 2023
It’s nice to see people start working on MoE for attention mechanisms! A simple fact is that about 1/3 of transformer parameters and more than 1/3 of computation are in the attention layers, so you need something like MoE to scale it up or make it more efficient. Btw, our work,
AK
@_akhaliq
Dec 14, 2023
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention paper page: huggingface.co/papers/2312.07… The costly self-attention layers in modern Transformers require memory and compute quadratic in sequence length. Existing approximation methods usually underperform and
28K
Yikang Shen
@Yikang_Shen
Oct 21, 2024
Granite 3.0 is our latest update for the IBM foundation models. The 8B and 2B models outperform strong competitors with similar sizes. The 1B and 3B MoE use only 400M and 800M active parameters to target the on-device use cases. Our technical report provides all the details you
9.9K
Yikang Shen
@Yikang_Shen
Apr 4, 2024
Training LLMs can be much cheaper than your new top-spec Cybertruck! Our new JetMoE project shows that just 0.1 million USD is sufficient for training LLaMA2-level LLMs. Thanks to its more aggressive MoE architecture and 2-phase data mixture strategy, JetMoE-8B could
42K