Log inSign up
Will
3,823 posts
Image
user avatar
Will
@_brickner
engineer. destroy the whole world
Chicago, IL
Joined November 2015
879
Following
1,594
Followers
  • Pinned
    user avatar
    Will
    @_brickner
    Dec 24, 2024
    wrote a paper: it lets you *train* in 1.58b! could use 97% less energy, 90% less weight memory. leads to a new model format which can store a 175B model in ~20mb. also, no backprop!
    Image
    970K
  • user avatar
    Will
    @_brickner
    Feb 15, 2025
    Replying to @nearcyan
    behold, new food
    Image
    Image
    16K
  • user avatar
    Will
    @_brickner
    Jul 7, 2021
    Replying to @atomicthumbs
    Image
  • user avatar
    Will
    @_brickner
    Aug 16, 2024
    Replying to @ChazakielDoremi
    i was an 8 year old without specific expectations so i loved it
    20K
  • user avatar
    Will
    @_brickner
    Dec 24, 2024
    Replying to @_brickner
    arxiv mods rejected this paper. they won’t say why. I don’t really care at this point, took weeks to get approval to submit. I think twitter boys will like it, that’s what matters.
    42K
  • user avatar
    Will
    @_brickner
    Dec 24, 2024
    Replying to @_brickner
    pdf:
    Image
    GitHub - wbrickner/noise_step: noise_step: Training in 1.58b With No Gradient Memory
    From github.com
    22K
  • user avatar
    Will
    @_brickner
    Dec 24, 2024
    Replying to @_brickner
    the core trick I use comes from `Gradients without Backpropagation`. using the JVP, you can find the alignment of random vectors to the gradient, and reconstruct it. only a forward pass!
    Image
    Image
    31K
  • user avatar
    Will
    @_brickner
    Dec 24, 2024
    Replying to @_brickner
    doesn’t this violate information theory? no, it’s probably that the domain in which models are compressible is the correlation of their loss gradient to noise. or something.
    Image
    GIF
    25K
  • user avatar
    Will
    @_brickner
    Jun 16, 2022
    Replying to @chasebratton
    the true mark of wealth
    Image
  • user avatar
    Will
    @_brickner
    Dec 24, 2024
    Replying to @_brickner
    what about bitnet? bitnet does inference in 1.58b, but training uses precision weights. basically they clamp weights to ternary {-1,0,1} in forward pass, and pretend they didn’t in backward pass.
    Image
    Image
    35K
  • user avatar
    Will
    @_brickner
    Dec 24, 2024
    Replying to @_brickner
    it’s often said academia is unwell. I was unimpressed with how these people operate. Real Science needs a massive cultural change. More openness, less hostility, less structure. I’m not a real researcher; freely discard my comments.
    Image
    23K
  • user avatar
    Will
    @_brickner
    Dec 24, 2024
    Replying to @_brickner
    actually, one acknowledgement: call me schizo but months ago I was discussing the algorithm loudly at dinner and I swear @DarioAmodei was watching me, grinning just like this. just eating with his wife. perhaps some things are fated.
    Image
    39K
  • user avatar
    Will
    @_brickner
    Dec 24, 2024
    Replying to @_brickner
    I woke up late, here is a cpu implementation
    Colab logo
    colab.research.google.com
    lk99_proof.ipynb
    Colab notebook
    26K
  • user avatar
    Will
    @_brickner
    Dec 24, 2024
    Replying to @_brickner
    the other thing is distributed training. the steps are tiny, the optimizer is stateless. imagine: a distributed training cluster across the internet with such low traffic it’s undetectable.
    Image
    22K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement