Alexi Gladstone (@AlexiGlad) / X

Alexi Gladstone

353 posts

Alexi Gladstone

@AlexiGlad

Working on Generative Modeling, EBTs/EBMs, World Models, Reasoning. PhD @ UIUC. Prev @flappyairplanes @Meta, @PalantirTech, @UVA

San Francisco

Joined February 2023

Pinned
Alexi Gladstone
@AlexiGlad
Jul 7, 2025
How can we unlock generalized reasoning? ⚡️Introducing Energy-Based Transformers (EBTs), an approach that out-scales (feed-forward) transformers and unlocks generalized reasoning/thinking on any modality/problem without rewards. TLDR: - EBTs are the first model to outscale the
339K
Alexi Gladstone
@AlexiGlad
Nov 12, 2025
wow... just wow.... this may be the most impactful paper of the year if the results continue to hold! they seem almost too good to be true, but they're not the fact that loss can now correspond with performance in SSL is absolutely amazing, especially for academic researchers
Randall Balestriero
@randall_balestr
Nov 12, 2025
LeJEPA: a novel pretraining paradigm free of the (many) heuristics we relied on (stop-grad, teacher, ...) - 60+ arch., up to 2B params - 10+ datasets - in-domain training (>DINOv3) - corr(train loss, test perf)=95% Paper: arxiv.org/pdf/2511.08544 Code: github.com/rbalestr-lab/l…
00:00
131K
Alexi Gladstone
@AlexiGlad
Jul 7, 2025
Replying to @AlexiGlad
[3/N] So what exactly are EBMs?💭 The general idea of EBMs is to learn to assign a scalar energy value (verification) denoting the compatibility/unnormalized probability of the input variables, which in this case are the context and prediction pair. Then, EBMs learn to optimize
10K
Alexi Gladstone
@AlexiGlad
Jul 7, 2025
Replying to @AlexiGlad
[12/12] Website: energy-based-transformers.github.io Paper: arxiv.org/abs/2507.02092 Huge thanks to all collaborators including @BeingMIAkashs @du_yilun @i_amanchadha @hyeonjeong_ai @peixuanhakhan @hengjinlp @LiJundong @tiqbal_uva We’re just getting started with EBMs, we see EBMs as a
energy-based-transformers.github.io
Energy-Based Transformers: Outscaling Transformers and Generalizable Reasoning
We outscaled (feed-forward) transformers and generalized reasoning/system 2 thinking to any modality and problem. This is done by training a new class of models called Energy-Based Transformers...
6.1K
Alexi Gladstone
@AlexiGlad
Jul 7, 2025
Replying to @AlexiGlad @BeingMIAkashs and 7 others
P.S. I’m open to collaboration in topics related to EBMs or several other crazy ideas so feel free to reach out! P.P.S. Somehow this paper started receiving lots of support before I even made an official post so thanks to everyone for that, it has been truly amazing to see! You
Paper page - Energy-Based Transformers are Scalable Learners and Thinkers
From huggingface.co
15K
Alexi Gladstone
@AlexiGlad
Jul 7, 2025
Replying to @AlexiGlad
[2/N] 🤔So how can models learn to think from unsupervised learning? It turns out that there’s a very elegant solution:💡 1) Learn to verify predictions 2) Optimization predictions with respect to this verifier This is exactly how Energy-Based Models (EBM) work! EBMs enable
9.9K
Alexi Gladstone
@AlexiGlad
Jul 7, 2025
Replying to @AlexiGlad
[1/N] First, how can we generalize reasoning/System 2 Thinking to any problem/modality?🧐 Current approaches generally rely on verifiable rewards, which struggle with scalability and generalizability due to involving human supervision and not being problem/modality agnostic.
10K
Alexi Gladstone
@AlexiGlad
Jul 7, 2025
Replying to @AlexiGlad
[4/N] So if EBMs are so promising, why are they so uncommon, and why haven’t they been used at scale? EBMs have traditionally struggled to scale due to issues with stability, slow training speed, and parallelization. Therefore, we create specialized Transformers specifically for
8.2K
Alexi Gladstone
@AlexiGlad
Jul 7, 2025
Replying to @AlexiGlad
[6/N] Of particular note is the data scaling, where we consistently observe EBTs being more data-efficient than the Transformer++ by > 30%. This is especially important because frontier labs are saying (youtube.com/watch?v=6nJZop…) we are now data-constrained and that more
7.4K
Alexi Gladstone
@AlexiGlad
Jul 7, 2025
Replying to @AlexiGlad
[5/N] We compared autoregressive EBTs against the SOTA recipe (Transformer++ from Llama2) in both discrete (language) and continuous (visual) modalities. In our language modeling experiments, we observe that EBTs consistently scale at a higher rate than the Transformer++ with
7.5K
Alexi Gladstone
@AlexiGlad
Jul 7, 2025
Replying to @AlexiGlad
[7/N] 🧠We can also investigate the thinking capabilities of EBTs compared to the Transformer++ by increasing the amount of compute given at inference/test time. We observe two key results. First, EBTs when thinking longer and self-verifying can out-generalize the Transformer++
7K
Alexi Gladstone
@AlexiGlad
Jul 7, 2025
Replying to @AlexiGlad
[10/N] We also compare EBTs to diffusion models on relatively toy image denoising tasks, where we observe that EBTs outperform diffusion models while using 99% less forward passes. EBTs also learn significantly better representations of images than diffusion models, achieving a
5.5K
Alexi Gladstone
@AlexiGlad
Jul 7, 2025
Replying to @AlexiGlad
[11/N] ⛓️‍💥It’s common wisdom that “a chain is only as strong as its weakest link.” Following this wisdom, we believe that each step in a chain of thought should receive sufficient computation to avoid failure “links” that result in bad reasoning. In other words, when EBTs
5.6K
Alexi Gladstone
@AlexiGlad
Oct 30, 2025
Super cool to see what people are doing with EBMs/EBTs! A small energy head is a great tradeoff between not having as much training compute, but still getting benefits at test-time
Surya
@sdand
Oct 30, 2025
What if next-token prediction wasn't a single forward pass, but a tiny optimization problem? Introducing: nanoEBM a tiny transformer that learns to think harder by doing gradient descent on its own predictions. You can start training on your Mac now - it comes < 400 lines
GIF
9.2K