leloy! (@leloykun) / X

leloy!

7,879 posts

leloy!

@leloykun

Math @ AdMU • NanoGPT speedrunner • Muon fan 🤍 • prev ML @ XPD • 2x IOI & 2x ICPC WF • admonymous.co/leloy

Joined November 2018

Pinned
leloy!
@leloykun
Apr 6
'Autoresearch', but for theoretical science? I formalized my blog posts on steepest descent convergence bounds and hyperparameter scaling laws in Lean using Codex. This started as an art project, but I ended up having similar (almost exactly the same) results as prior work
leloy!
@leloykun
Mar 23
Excited results!! I was also working on this direction a few months back but only managed to reach the necessary step before this. I got similar results to @dakovalev1 's, but without the need to go back-and-forth between dual norms and the frobenius norm which makes the
54K
leloy!
@leloykun
Oct 20, 2024
Deep Learning Optimizers from First Principles My attempt at answering these questions: 1. Why do steepest descent in non-Euclidean spaces? 2. Why does adaptive preconditioning work so well in practice? And, 3. Why normalize everything ala nGPT?
270K
leloy!
@leloykun
Dec 10, 2024
Remember @karpathy's llm.c repro of the GPT-2 (124M) training run which took 45 mins on 8xH100s? We're proud to share that we've just breached the 4 min mark! A few years ago, it would've costed you hundreds of thousands of dollars (maybe millions!) to achieve the same result.
156K
leloy!
@leloykun
Nov 17, 2024
Deep Learning Optimizers from First Principles Now with more maths! In this thread, I'll discuss: 1. The difference between 1st order gradient dualizaton approaches and 2nd order optimization approaches. 2. Preconditioning--how to do it and why. 3. How to derive a couple of
162K
leloy!
@leloykun
Aug 9, 2025
I have so many interests I find it hard to focus on any of them I wanna study algebraic topology, category theory, optimization on finsler manifolds but also, I wanna build. I can build the entire AI infra of an AI SaaS, even the UI. I've done it before yet here I am,
69K
leloy!
@leloykun
Aug 8, 2025
I managed to train a 1-Lipschitz, 2-layer MLP to grok on the Addition-Modulo-113 task (40-60% train-test split) in just 44 full-batch steps. This is another evidence that "we just need to scale up" is a brainworm and being smart on the choice of geometry to 'place' our weights
130K
leloy!
@leloykun
Jul 25, 2019
Di ka iiwan ng DL certificates mo tho
Clint Petilla
@Metal_Strix
Jul 25, 2019
Love life first before grades; Relationship first before acads. Hindi ka naman yayakapin ng DL certificate mo kapag umiiyak ka na. #TipsForPisayFreshies
leloy!
@leloykun
Feb 11, 2025
GRPO's Main Flaw I've been testing different critic-free RL algos on multi-task environments, and one thing I've noticed is that GRPO seems to slightly underperform normalization-free variants. This tracks with the results in the LOOP paper. Why? Most likely because GRPO's
85K
leloy!
@leloykun
Aug 22, 2025
I've finally solved steepest descent on Finsler-structured (matrix) manifolds more generally. This generalizes work by me, @jxbz, and @Jianlin_S on Muon, Orthogonal Muon, & Stiefel Muon. --- The general solution turned out to be much simpler than I thought. And it should
leloy!
@leloykun
Aug 8, 2025
I managed to train a 1-Lipschitz, 2-layer MLP to grok on the Addition-Modulo-113 task (40-60% train-test split) in just 44 full-batch steps. This is another evidence that "we just need to scale up" is a brainworm and being smart on the choice of geometry to 'place' our weights
82K
leloy!
@leloykun
Oct 17, 2024
The Case for Muon 1) We can descend 'faster' in non-Euclidean spaces 2) Adam/Shampoo/SOAP/etc. dynamically learn the preconditioner and, equivalently, the norm & space to descend in 3) Muon saves a lot of compute by simply letting the norm to vary within a fixed range
leloy!
@leloykun
Oct 15, 2024
got nerdsniped by @kellerjordan0's Muon and spent time analyzing it instead of preparing for my job interview in <1 hour 😅 tl;dr 1) 3 Newton-Schulz iterations suffice for the non-square matrices 2) I propose a new Newton-Schulz iterator that converges faster and is more stable
200K
leloy!
@leloykun
Jan 26, 2025
(Linear) Attention Mechanisms as Test-Time Regression By now, you've probably already heard of linear attention, in-context learning, test-time scaling, etc... Here, I'll discuss: 1. The unifying framework that ties them all together; 2. How to derive different linear
75K
leloy!
@leloykun
Mar 22, 2025
I'm not sure if someone has already pointed this out, but Dr. GRPO still has a bias that is more pronounced the smaller the group size is. To make it unbiased, simply multiply Dr. GRPO's A_i by the correction term N/N-1. With this, you'll get LOOP (Leave-One-Out Proximal Policy
Zichen Liu
@zzlccc
Mar 21, 2025
🪂Understanding R1-Zero-Like Training: A Critical Perspective * DeepSeek-V3-Base already exhibits "Aha moment" before RL-tuning?? * The ever-increasing output length in RL-tuning might be due to a BIAS in GRPO?? * Getting GRPO Done Right, we achieve a 7B AIME sota! 🧵 📜Full
74K
leloy!
@leloykun
Jan 17, 2025
Sub 3-minute NanoGPT Speedrun Record We're proud to share that we've just breached the 3 min mark! This means that with an ephemeral pod of 8xH100s that costs $8/hour, training a GPT-2-ish level model now only costs $0.40! --- What's in the latest record? A 🧵...
GIF
103K
leloy!
@leloykun
Dec 23, 2024
we're all probably overthinking what OpenAI did with O3 here's my best guess, stitched together from tweets from OpenAI employees, their blog posts, and rumors: 1. They collected a ton of process traces from STEM experts; they also likely mixed in synthetic data from tasks with
48K