Keller Jordan (@kellerjordan0) / X

Keller Jordan

1,621 posts

Keller Jordan

@kellerjordan0

CIFAR-10 fanatic Pretraining @OpenAI OpCo LLC.

San Francisco

Joined March 2016

Pinned
Keller Jordan
@kellerjordan0
Apr 13, 2023
Neural network trainings are nondeterministic. Repeated runs each produce a unique network, often with significantly _varying_ test-set performance. 🆕📜 I demonstrate that this variation has a simple statistical structure, and is harmless & inevitable arxiv.org/abs/2304.01910
263K
Keller Jordan
@kellerjordan0
Feb 24, 2025
Some trivia: In November I interviewed at both OpenAI & xAI. I thought both labs seemed strong, even tho ppl said xAI was a noncontender back then. But in the end, which to join was an easy choice, because-- the xAI guys told me all my ideas must be wrong & rejected me ¯\_(ツ)_/¯
361K
Keller Jordan
@kellerjordan0
Feb 4, 2025
Unfortunately, it is hard to trust *claims* in 2025. What’s easier to trust is *incentives*. So here’s an incentive: I’ll pay a $3,000 bounty to the first person who uses this method to improve either the NanoGPT or CIFAR-10 speedruns
David D. Baek
@dbaek__
Feb 4, 2025
1/9 🚨 New Paper Alert: Cross-Entropy Loss is NOT What You Need! 🚨 We introduce harmonic loss as alternative to the standard CE loss for training neural networks and LLMs! Harmonic loss achieves 🛠️significantly better interpretability, ⚡faster convergence, and ⏳less grokking!
GIF
290K
Keller Jordan
@kellerjordan0
Oct 4, 2024
New training speed record for @karpathy’s 124M-parameter NanoGPT setup: 3.28 Fineweb validation loss in 3.7B training tokens Previous record: 5B tokens Changelog: new optimizer 1/8
283K
Keller Jordan
@kellerjordan0
Jun 6, 2024
Here's a variant of @karpathy's NanoGPT which trains twice as fast, reaching GPT-2 level quality in 5B tokens instead of the original 10B. It uses rotary embeddings and an improved lr schedule.
GitHub - KellerJordan/modded-nanogpt: NanoGPT (124M) in 90 seconds
From github.com
129K
Keller Jordan
@kellerjordan0
Feb 13, 2025
The reason I didn't write a proper arxiv paper for Muon is because I simply don't think there's any relationship between the ability to publish a paper with lots of good-looking results about a new optimizer, and whether that optimizer actually works. I only trust speedruns.
312K
Keller Jordan
@kellerjordan0
Feb 10, 2025
Big GPU doesn't want you to know this but you can actually learn even more about the nature of neural network training from speedrunning CIFAR-10 than NanoGPT, since the experiments are so fast. I've personally trained 15 million CIFAR-10 models.
91K
Keller Jordan
@kellerjordan0
Oct 31, 2025
TIL that Muon is in PyTorch stable now. Pretty cool.
66K
Keller Jordan
@kellerjordan0
Jun 15, 2025
There have been hundreds of optimizer papers published. But the SOTA has only improved a few times. Therefore we can conclude that almost all optimizer papers are fake. If you're gonna write another fake optimizer paper, please don't cite Muon. I don't want your citation
132K
Keller Jordan
@kellerjordan0
Jan 30, 2025
Btw, I joined the OpenAI OpCo, LLC. Excited to do some science and contribute stuff into some big training runs. Yep, and shout-out to my amazing mentors @adamlerer and @mobav0!
83K
Keller Jordan
@kellerjordan0
Oct 16, 2025
Theorem: The maximum possible duration of the computational singularity is 470 years. Proof: The FLOPs capacity of all computers which existed in the year 1986 is estimated to be at most 4.5e14 (Hilbert et al. 2011). Based on public Nvidia revenue and GPU specs, this capacity
345K
Keller Jordan
@kellerjordan0
Nov 11, 2025
x.com/SRCHicks/statu…
Stephen R. C. Hicks
@SRCHicks
Nov 10, 2025
“The only student of mine I was ever intimidated by. He was so quick. There was a seminar for advanced students in Zürich that I was teaching and von Neumann was in the class. I came to a certain theorem, and I said it is not proved and it may be difficult. Von Neumann didn't say
79K
Keller Jordan
@kellerjordan0
Oct 14, 2024
New NanoGPT training speed record: 3.28 Fineweb validation loss in 15.2 minutes Previous record: 22.3 minutes Changelog: - pad embedding to nearest 64 - switch from GELU to ReLU² - zero-init projection layers - QKNorm All four changes driven by @Grad62304977 1/8
143K
Keller Jordan
@kellerjordan0
Jul 17, 2025
New NanoGPT training speed record: 3.28 FineWeb val loss in 2.966 minutes on 8xH100 New record-holder: Vishal Agrawal (@.vagrawal on GitHub) Previous record: 2.979 minutes Changelog: Replaced gradient all_reduce with reduce_scatter, other efficiency tweaks
35K