Sham Kakade (@ShamKakade6) / X

Sham Kakade

538 posts

Sham Kakade

@ShamKakade6

Harvard Professor. Full stack ML and AI. Co-director of the Kempner Institute for the Study of Artificial and Natural Intelligence.

Joined December 2018

Sham Kakade
@ShamKakade6
Oct 14, 2025
1/8 Second Order Optimizers like SOAP and Muon have shown impressive performance on LLM optimization. But are we fully utilizing the potential of second order information? New work: we show that a full second order optimizer is much better than existing optimizers in terms of
145K
Sham Kakade
@ShamKakade6
Oct 18, 2021
Super excited to join Harvard with a stellar group of new hires and looking forward to many new collabs with the terrific faculty there! Def sad to be leaving my wonderful UW and MSR colleagues and friends; rest assured, I'll keep up the collabs!
Seven join Harvard computer science faculty
From seas.harvard.edu
Sham Kakade
@ShamKakade6
Sep 18, 2024
1/n Introducing SOAP (ShampoO with Adam in the Preconditioner's eigenbasis): A deep learning optimization algorithm that applies Adam in Shampoo's eigenbasis. SOAP outperforms both AdamW and Shampoo in language model pretraining.
52K
Sham Kakade
@ShamKakade6
Jul 1, 2020
Thank you so much to the awards committee! Also a huge thanks to the past and current ICML chairs and organizers for all their great work for our community! 👍 It is an honor to receive this 😀😀 with such wonderful co-authors: @arkrause, Matthias, and Niranjan!
Hal Daumé III
@haldaume3
Jul 1, 2020
We are very pleased to announce that the #icml2020 Test of Time award goes to Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design by Niranjan Srinivas, Andreas Krause, Sham Kakade and Matthias Seeger icml.cc/Conferences/20… >
Sham Kakade
@ShamKakade6
Dec 7, 2021
Grateful to Priscilla Chan/Mark Zuckerberg (@ChanZuckerberg Initiative) for generous gift. Kempner Natural & Artificial Intelligence Institute @Harvard. Excited to work w/@blsabatini+new colleagues to provide new educational and research opportunities.
New Harvard institute to study natural, artificial intelligence
From news.harvard.edu
Sham Kakade
@ShamKakade6
Nov 15, 2023
Can inductive biases explain mechanistic interpretability? Why do sinusoidal patterns emerge for NNs trained on modular addition? (e.g. @NeelNanda5) New work pins this down! w/ @depen_morwani @EdelmanBen @rosieyzh @costinoncescu! @KempnerInst
GIF
65K
Sham Kakade
@ShamKakade6
Oct 15, 2025
1/9 Introducing LOTION (Low-precision optimization via stochastic-noise smoothing), a principled alternative to Quantization-Aware Training (QAT) that explicitly smooths the quantized loss surface while preserving all global minima of the true quantized loss. Details below:
GIF
27K
Sham Kakade
@ShamKakade6
Oct 18, 2025
1/6 Introducing Seesaw: a principled batch size scheduling algo. Seesaw achieves theoretically optimal serial run time given a fixed compute budget and also matches the performance of cosine annealing at fixed batch size.
42K
Sham Kakade
@ShamKakade6
Jul 12, 2024
Which optimizer is opt? Our new work compares SGD, Adam, Adafactor (+ momentum), Lion, and, simply, SignSGD on LLM training wrt performance _and_ hyperparameter stability. tldr: Use anything but SGD, the rest are nearly identical:
62K
Sham Kakade
@ShamKakade6
Mar 3, 2020
couldn't reconcile theory/practice with dropout for over a year. New: arxiv.org/abs/2002.12915 w/ @tengyuma & C. Wei. turns out dropout sometimes has an implicit regularization effect! pretty wild. just like small vs. large batch sgd. these plots def surprised us!
Sham Kakade
@ShamKakade6
Oct 10, 2019
What actually constitutes a good representation for reinforcement learning? Lots of sufficient conditions. But what's necessary? New paper: arxiv.org/abs/1910.03016. Surprisingly, good value (or policy) based representations just don't cut it! w/ @SimonShaoleiDu @RuosongW @lyang36
arxiv.org
Is a Good Representation Sufficient for Sample Efficient...
Modern deep learning methods provide effective means to learn good representations. However, is a good representation itself sufficient for sample efficient reinforcement learning? This question...
Sham Kakade
@ShamKakade6
Jun 21, 2025
1/6 Infinite-dim SGD in linear regression is the strawman model for studying scaling laws, critical batch sizes, and LR schedules. We revisit (and simplify) its analysis using just linear algebra, making it easier to derive and reason about. No PSD operators. No tensor calculus.
35K
Sham Kakade
@ShamKakade6
Oct 19, 2019
Wrapped up at the "Workshop on Theory of Deep Learning: Where next?" at IAS. math.ias.edu/wtdl. The field has moved so much! e.g. Neural Tangent Kernel (NTK) results! A few years ago, understanding DL looked hopeless. Terrific set of talks, too!
Sham Kakade
@ShamKakade6
Nov 22, 2024
(1/n) 💡How can we speed up the serial runtime of long pre-training runs? Enter Critical Batch Size (CBS): the tipping point where the gains of data parallelism balance with diminishing efficiency. Doubling batch size halves the optimization steps—until we hit CBS, beyond which
23K