Log inSign up
Keller Jordan
1,621 posts
Image
user avatar
Keller Jordan
@kellerjordan0
CIFAR-10 fanatic Pretraining @OpenAI OpCo LLC.
San Francisco
Joined March 2016
439
Following
17.5K
Followers
  • Pinned
    user avatar
    Keller Jordan
    @kellerjordan0
    Apr 13, 2023
    Neural network trainings are nondeterministic. Repeated runs each produce a unique network, often with significantly _varying_ test-set performance. 🆕📜 I demonstrate that this variation has a simple statistical structure, and is harmless & inevitable arxiv.org/abs/2304.01910
    Image
    263K
  • user avatar
    Keller Jordan
    @kellerjordan0
    Feb 24, 2025
    Some trivia: In November I interviewed at both OpenAI & xAI. I thought both labs seemed strong, even tho ppl said xAI was a noncontender back then. But in the end, which to join was an easy choice, because-- the xAI guys told me all my ideas must be wrong & rejected me ¯\_(ツ)_/¯
    361K
  • user avatar
    Keller Jordan
    @kellerjordan0
    Feb 4, 2025
    Unfortunately, it is hard to trust *claims* in 2025. What’s easier to trust is *incentives*. So here’s an incentive: I’ll pay a $3,000 bounty to the first person who uses this method to improve either the NanoGPT or CIFAR-10 speedruns
    user avatar
    David D. Baek
    @dbaek__
    Feb 4, 2025
    1/9 🚨 New Paper Alert: Cross-Entropy Loss is NOT What You Need! 🚨 We introduce harmonic loss as alternative to the standard CE loss for training neural networks and LLMs! Harmonic loss achieves 🛠️significantly better interpretability, ⚡faster convergence, and ⏳less grokking!
    Image
    GIF
    290K
  • user avatar
    Keller Jordan
    @kellerjordan0
    Oct 4, 2024
    New training speed record for @karpathy’s 124M-parameter NanoGPT setup: 3.28 Fineweb validation loss in 3.7B training tokens Previous record: 5B tokens Changelog: new optimizer 1/8
    Image
    283K
  • user avatar
    Keller Jordan
    @kellerjordan0
    Jun 6, 2024
    Here's a variant of @karpathy's NanoGPT which trains twice as fast, reaching GPT-2 level quality in 5B tokens instead of the original 10B. It uses rotary embeddings and an improved lr schedule.
    Image
    GitHub - KellerJordan/modded-nanogpt: NanoGPT (124M) in 90 seconds
    From github.com
    129K
  • user avatar
    Keller Jordan
    @kellerjordan0
    Feb 13, 2025
    The reason I didn't write a proper arxiv paper for Muon is because I simply don't think there's any relationship between the ability to publish a paper with lots of good-looking results about a new optimizer, and whether that optimizer actually works. I only trust speedruns.
    312K
  • user avatar
    Keller Jordan
    @kellerjordan0
    Feb 10, 2025
    Big GPU doesn't want you to know this but you can actually learn even more about the nature of neural network training from speedrunning CIFAR-10 than NanoGPT, since the experiments are so fast. I've personally trained 15 million CIFAR-10 models.
    91K
  • user avatar
    Keller Jordan
    @kellerjordan0
    Oct 31, 2025
    TIL that Muon is in PyTorch stable now. Pretty cool.
    Image
    66K
  • user avatar
    Keller Jordan
    @kellerjordan0
    Jun 15, 2025
    There have been hundreds of optimizer papers published. But the SOTA has only improved a few times. Therefore we can conclude that almost all optimizer papers are fake. If you're gonna write another fake optimizer paper, please don't cite Muon. I don't want your citation
    132K
  • user avatar
    Keller Jordan
    @kellerjordan0
    Jan 30, 2025
    Btw, I joined the OpenAI OpCo, LLC. Excited to do some science and contribute stuff into some big training runs. Yep, and shout-out to my amazing mentors @adamlerer and @mobav0!
    Image
    83K
  • user avatar
    Keller Jordan
    @kellerjordan0
    Oct 16, 2025
    Theorem: The maximum possible duration of the computational singularity is 470 years. Proof: The FLOPs capacity of all computers which existed in the year 1986 is estimated to be at most 4.5e14 (Hilbert et al. 2011). Based on public Nvidia revenue and GPU specs, this capacity
    345K
  • user avatar
    Keller Jordan
    @kellerjordan0
    Nov 11, 2025
    x.com/SRCHicks/statu…
    Image
    Image
    user avatar
    Stephen R. C. Hicks
    Peterson Academy
    @SRCHicks
    Nov 10, 2025
    “The only student of mine I was ever intimidated by. He was so quick. There was a seminar for advanced students in Zürich that I was teaching and von Neumann was in the class. I came to a certain theorem, and I said it is not proved and it may be difficult. Von Neumann didn't say
    79K
  • user avatar
    Keller Jordan
    @kellerjordan0
    Oct 14, 2024
    New NanoGPT training speed record: 3.28 Fineweb validation loss in 15.2 minutes Previous record: 22.3 minutes Changelog: - pad embedding to nearest 64 - switch from GELU to ReLU² - zero-init projection layers - QKNorm All four changes driven by @Grad62304977 1/8
    Image
    Image
    143K
  • user avatar
    Keller Jordan
    @kellerjordan0
    Jul 17, 2025
    New NanoGPT training speed record: 3.28 FineWeb val loss in 2.966 minutes on 8xH100 New record-holder: Vishal Agrawal (@.vagrawal on GitHub) Previous record: 2.979 minutes Changelog: Replaced gradient all_reduce with reduce_scatter, other efficiency tweaks
    Image
    35K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement