Log inSign up
leloy!
7,879 posts
Image
user avatar
leloy!
@leloykun
Math @ AdMU • NanoGPT speedrunner • Muon fan 🤍 • prev ML @ XPD • 2x IOI & 2x ICPC WF • admonymous.co/leloy
leloykun.github.io
Joined November 2018
4,840
Following
7,503
Followers

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
  • Pinned
    user avatar
    leloy!
    @leloykun
    Apr 6
    'Autoresearch', but for theoretical science? I formalized my blog posts on steepest descent convergence bounds and hyperparameter scaling laws in Lean using Codex. This started as an art project, but I ended up having similar (almost exactly the same) results as prior work
    Image
    Image
    Image
    Image
    Image
    Image
    Image
    Image
    user avatar
    leloy!
    @leloykun
    Mar 23
    Excited results!! I was also working on this direction a few months back but only managed to reach the necessary step before this. I got similar results to @dakovalev1 's, but without the need to go back-and-forth between dual norms and the frobenius norm which makes the
    54K
  • user avatar
    leloy!
    @leloykun
    Oct 20, 2024
    Deep Learning Optimizers from First Principles My attempt at answering these questions: 1. Why do steepest descent in non-Euclidean spaces? 2. Why does adaptive preconditioning work so well in practice? And, 3. Why normalize everything ala nGPT?
    Image
    270K
  • user avatar
    leloy!
    @leloykun
    Dec 10, 2024
    Remember @karpathy's llm.c repro of the GPT-2 (124M) training run which took 45 mins on 8xH100s? We're proud to share that we've just breached the 4 min mark! A few years ago, it would've costed you hundreds of thousands of dollars (maybe millions!) to achieve the same result.
    156K
  • user avatar
    leloy!
    @leloykun
    Nov 17, 2024
    Deep Learning Optimizers from First Principles Now with more maths! In this thread, I'll discuss: 1. The difference between 1st order gradient dualizaton approaches and 2nd order optimization approaches. 2. Preconditioning--how to do it and why. 3. How to derive a couple of
    Image
    Image
    162K
  • user avatar
    leloy!
    @leloykun
    Aug 9, 2025
    I have so many interests I find it hard to focus on any of them I wanna study algebraic topology, category theory, optimization on finsler manifolds but also, I wanna build. I can build the entire AI infra of an AI SaaS, even the UI. I've done it before yet here I am,
    69K
  • user avatar
    leloy!
    @leloykun
    Aug 8, 2025
    I managed to train a 1-Lipschitz, 2-layer MLP to grok on the Addition-Modulo-113 task (40-60% train-test split) in just 44 full-batch steps. This is another evidence that "we just need to scale up" is a brainworm and being smart on the choice of geometry to 'place' our weights
    130K
  • user avatar
    leloy!
    @leloykun
    Jul 25, 2019
    Di ka iiwan ng DL certificates mo tho
    user avatar
    Clint Petilla
    @Metal_Strix
    Jul 25, 2019
    Love life first before grades; Relationship first before acads. Hindi ka naman yayakapin ng DL certificate mo kapag umiiyak ka na. #TipsForPisayFreshies
  • user avatar
    leloy!
    @leloykun
    Feb 11, 2025
    GRPO's Main Flaw I've been testing different critic-free RL algos on multi-task environments, and one thing I've noticed is that GRPO seems to slightly underperform normalization-free variants. This tracks with the results in the LOOP paper. Why? Most likely because GRPO's
    Image
    85K
  • user avatar
    leloy!
    @leloykun
    Aug 22, 2025
    I've finally solved steepest descent on Finsler-structured (matrix) manifolds more generally. This generalizes work by me, @jxbz, and @Jianlin_S on Muon, Orthogonal Muon, & Stiefel Muon. --- The general solution turned out to be much simpler than I thought. And it should
    Image
    Image
    user avatar
    leloy!
    @leloykun
    Aug 8, 2025
    I managed to train a 1-Lipschitz, 2-layer MLP to grok on the Addition-Modulo-113 task (40-60% train-test split) in just 44 full-batch steps. This is another evidence that "we just need to scale up" is a brainworm and being smart on the choice of geometry to 'place' our weights
    82K
  • user avatar
    leloy!
    @leloykun
    Oct 17, 2024
    The Case for Muon 1) We can descend 'faster' in non-Euclidean spaces 2) Adam/Shampoo/SOAP/etc. dynamically learn the preconditioner and, equivalently, the norm & space to descend in 3) Muon saves a lot of compute by simply letting the norm to vary within a fixed range
    Image
    Image
    Image
    Image
    user avatar
    leloy!
    @leloykun
    Oct 15, 2024
    got nerdsniped by @kellerjordan0's Muon and spent time analyzing it instead of preparing for my job interview in <1 hour 😅 tl;dr 1) 3 Newton-Schulz iterations suffice for the non-square matrices 2) I propose a new Newton-Schulz iterator that converges faster and is more stable
    200K
  • user avatar
    leloy!
    @leloykun
    Jan 26, 2025
    (Linear) Attention Mechanisms as Test-Time Regression By now, you've probably already heard of linear attention, in-context learning, test-time scaling, etc... Here, I'll discuss: 1. The unifying framework that ties them all together; 2. How to derive different linear
    Image
    75K
  • user avatar
    leloy!
    @leloykun
    Mar 22, 2025
    I'm not sure if someone has already pointed this out, but Dr. GRPO still has a bias that is more pronounced the smaller the group size is. To make it unbiased, simply multiply Dr. GRPO's A_i by the correction term N/N-1. With this, you'll get LOOP (Leave-One-Out Proximal Policy
    Image
    Image
    user avatar
    Zichen Liu
    @zzlccc
    Mar 21, 2025
    🪂Understanding R1-Zero-Like Training: A Critical Perspective * DeepSeek-V3-Base already exhibits "Aha moment" before RL-tuning?? * The ever-increasing output length in RL-tuning might be due to a BIAS in GRPO?? * Getting GRPO Done Right, we achieve a 7B AIME sota! 🧵 📜Full
    74K
  • user avatar
    leloy!
    @leloykun
    Jan 17, 2025
    Sub 3-minute NanoGPT Speedrun Record We're proud to share that we've just breached the 3 min mark! This means that with an ephemeral pod of 8xH100s that costs $8/hour, training a GPT-2-ish level model now only costs $0.40! --- What's in the latest record? A 🧵...
    Image
    Image
    Image
    GIF
    Image
    103K
  • user avatar
    leloy!
    @leloykun
    Dec 23, 2024
    we're all probably overthinking what OpenAI did with O3 here's my best guess, stitched together from tweets from OpenAI employees, their blog posts, and rumors: 1. They collected a ton of process traces from STEM experts; they also likely mixed in synthetic data from tasks with
    48K
Advertisement
Advertisement