Log inSign up
Alexi Gladstone
353 posts
Image
user avatar
Alexi Gladstone
@AlexiGlad
Working on Generative Modeling, EBTs/EBMs, World Models, Reasoning. PhD @ UIUC. Prev @flappyairplanes @Meta, @PalantirTech, @UVA
San Francisco
alexiglad.github.io
Joined February 2023
680
Following
3,260
Followers
  • Pinned
    user avatar
    Alexi Gladstone
    @AlexiGlad
    Jul 7, 2025
    How can we unlock generalized reasoning? ⚡️Introducing Energy-Based Transformers (EBTs), an approach that out-scales (feed-forward) transformers and unlocks generalized reasoning/thinking on any modality/problem without rewards. TLDR: - EBTs are the first model to outscale the
    Image
    339K
  • user avatar
    Alexi Gladstone
    @AlexiGlad
    Nov 12, 2025
    wow... just wow.... this may be the most impactful paper of the year if the results continue to hold! they seem almost too good to be true, but they're not the fact that loss can now correspond with performance in SSL is absolutely amazing, especially for academic researchers
    user avatar
    Randall Balestriero
    @randall_balestr
    Nov 12, 2025
    LeJEPA: a novel pretraining paradigm free of the (many) heuristics we relied on (stop-grad, teacher, ...) - 60+ arch., up to 2B params - 10+ datasets - in-domain training (>DINOv3) - corr(train loss, test perf)=95% Paper: arxiv.org/pdf/2511.08544 Code: github.com/rbalestr-lab/l…
    Image
    00:00
    131K
  • user avatar
    Alexi Gladstone
    @AlexiGlad
    Jul 7, 2025
    Replying to @AlexiGlad
    [3/N] So what exactly are EBMs?💭 The general idea of EBMs is to learn to assign a scalar energy value (verification) denoting the compatibility/unnormalized probability of the input variables, which in this case are the context and prediction pair. Then, EBMs learn to optimize
    Image
    10K
  • user avatar
    Alexi Gladstone
    @AlexiGlad
    Jul 7, 2025
    Replying to @AlexiGlad
    [12/12] Website: energy-based-transformers.github.io Paper: arxiv.org/abs/2507.02092 Huge thanks to all collaborators including @BeingMIAkashs @du_yilun @i_amanchadha @hyeonjeong_ai @peixuanhakhan @hengjinlp @LiJundong @tiqbal_uva We’re just getting started with EBMs, we see EBMs as a
    energy-based-transformers.github.io
    Energy-Based Transformers: Outscaling Transformers and Generalizable Reasoning
    We outscaled (feed-forward) transformers and generalized reasoning/system 2 thinking to any modality and problem. This is done by training a new class of models called Energy-Based Transformers...
    6.1K
  • user avatar
    Alexi Gladstone
    @AlexiGlad
    Jul 7, 2025
    Replying to @AlexiGlad @BeingMIAkashs and 7 others
    P.S. I’m open to collaboration in topics related to EBMs or several other crazy ideas so feel free to reach out! P.P.S. Somehow this paper started receiving lots of support before I even made an official post so thanks to everyone for that, it has been truly amazing to see! You
    Image
    Paper page - Energy-Based Transformers are Scalable Learners and Thinkers
    From huggingface.co
    15K
  • user avatar
    Alexi Gladstone
    @AlexiGlad
    Jul 7, 2025
    Replying to @AlexiGlad
    [2/N] 🤔So how can models learn to think from unsupervised learning? It turns out that there’s a very elegant solution:💡 1) Learn to verify predictions 2) Optimization predictions with respect to this verifier This is exactly how Energy-Based Models (EBM) work! EBMs enable
    Image
    9.9K
  • user avatar
    Alexi Gladstone
    @AlexiGlad
    Jul 7, 2025
    Replying to @AlexiGlad
    [1/N] First, how can we generalize reasoning/System 2 Thinking to any problem/modality?🧐 Current approaches generally rely on verifiable rewards, which struggle with scalability and generalizability due to involving human supervision and not being problem/modality agnostic.
    10K
  • user avatar
    Alexi Gladstone
    @AlexiGlad
    Jul 7, 2025
    Replying to @AlexiGlad
    [4/N] So if EBMs are so promising, why are they so uncommon, and why haven’t they been used at scale? EBMs have traditionally struggled to scale due to issues with stability, slow training speed, and parallelization. Therefore, we create specialized Transformers specifically for
    Image
    8.2K
  • user avatar
    Alexi Gladstone
    @AlexiGlad
    Jul 7, 2025
    Replying to @AlexiGlad
    [6/N] Of particular note is the data scaling, where we consistently observe EBTs being more data-efficient than the Transformer++ by > 30%. This is especially important because frontier labs are saying (youtube.com/watch?v=6nJZop…) we are now data-constrained and that more
    Image
    7.4K
  • user avatar
    Alexi Gladstone
    @AlexiGlad
    Jul 7, 2025
    Replying to @AlexiGlad
    [5/N] We compared autoregressive EBTs against the SOTA recipe (Transformer++ from Llama2) in both discrete (language) and continuous (visual) modalities. In our language modeling experiments, we observe that EBTs consistently scale at a higher rate than the Transformer++ with
    Image
    7.5K
  • user avatar
    Alexi Gladstone
    @AlexiGlad
    Jul 7, 2025
    Replying to @AlexiGlad
    [7/N] 🧠We can also investigate the thinking capabilities of EBTs compared to the Transformer++ by increasing the amount of compute given at inference/test time. We observe two key results. First, EBTs when thinking longer and self-verifying can out-generalize the Transformer++
    Image
    7K
  • user avatar
    Alexi Gladstone
    @AlexiGlad
    Jul 7, 2025
    Replying to @AlexiGlad
    [10/N] We also compare EBTs to diffusion models on relatively toy image denoising tasks, where we observe that EBTs outperform diffusion models while using 99% less forward passes. EBTs also learn significantly better representations of images than diffusion models, achieving a
    Image
    5.5K
  • user avatar
    Alexi Gladstone
    @AlexiGlad
    Jul 7, 2025
    Replying to @AlexiGlad
    [11/N] ⛓️‍💥It’s common wisdom that “a chain is only as strong as its weakest link.” Following this wisdom, we believe that each step in a chain of thought should receive sufficient computation to avoid failure “links” that result in bad reasoning. In other words, when EBTs
    5.6K
  • user avatar
    Alexi Gladstone
    @AlexiGlad
    Oct 30, 2025
    Super cool to see what people are doing with EBMs/EBTs! A small energy head is a great tradeoff between not having as much training compute, but still getting benefits at test-time
    user avatar
    Surya
    @sdand
    Oct 30, 2025
    What if next-token prediction wasn't a single forward pass, but a tiny optimization problem? Introducing: nanoEBM a tiny transformer that learns to think harder by doing gradient descent on its own predictions. You can start training on your Mac now - it comes < 400 lines
    Image
    GIF
    9.2K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement