Will (@_brickner) / X

Will

3,823 posts

Will

@_brickner

engineer. destroy the whole world

Chicago, IL

Joined November 2015

Pinned
Will
@_brickner
Dec 24, 2024
wrote a paper: it lets you *train* in 1.58b! could use 97% less energy, 90% less weight memory. leads to a new model format which can store a 175B model in ~20mb. also, no backprop!
970K
Will
@_brickner
Feb 15, 2025
Replying to @nearcyan
behold, new food
16K
Will
@_brickner
Jul 7, 2021
Replying to @atomicthumbs
Will
@_brickner
Aug 16, 2024
Replying to @ChazakielDoremi
i was an 8 year old without specific expectations so i loved it
20K
Will
@_brickner
Dec 24, 2024
Replying to @_brickner
arxiv mods rejected this paper. they won’t say why. I don’t really care at this point, took weeks to get approval to submit. I think twitter boys will like it, that’s what matters.
42K
Will
@_brickner
Dec 24, 2024
Replying to @_brickner
pdf:
GitHub - wbrickner/noise_step: noise_step: Training in 1.58b With No Gradient Memory
From github.com
22K
Will
@_brickner
Dec 24, 2024
Replying to @_brickner
the core trick I use comes from `Gradients without Backpropagation`. using the JVP, you can find the alignment of random vectors to the gradient, and reconstruct it. only a forward pass!
31K
Will
@_brickner
Dec 24, 2024
Replying to @_brickner
doesn’t this violate information theory? no, it’s probably that the domain in which models are compressible is the correlation of their loss gradient to noise. or something.
GIF
25K
Will
@_brickner
Jun 16, 2022
Replying to @chasebratton
the true mark of wealth
Will
@_brickner
Dec 24, 2024
Replying to @_brickner
what about bitnet? bitnet does inference in 1.58b, but training uses precision weights. basically they clamp weights to ternary {-1,0,1} in forward pass, and pretend they didn’t in backward pass.
35K
Will
@_brickner
Dec 24, 2024
Replying to @_brickner
it’s often said academia is unwell. I was unimpressed with how these people operate. Real Science needs a massive cultural change. More openness, less hostility, less structure. I’m not a real researcher; freely discard my comments.
23K
Will
@_brickner
Dec 24, 2024
Replying to @_brickner
actually, one acknowledgement: call me schizo but months ago I was discussing the algorithm loudly at dinner and I swear @DarioAmodei was watching me, grinning just like this. just eating with his wife. perhaps some things are fated.
39K
Will
@_brickner
Dec 24, 2024
Replying to @_brickner
I woke up late, here is a cpu implementation
colab.research.google.com
lk99_proof.ipynb
Colab notebook
26K
Will
@_brickner
Dec 24, 2024
Replying to @_brickner
the other thing is distributed training. the steps are tiny, the optimizer is stateless. imagine: a distributed training cluster across the internet with such low traffic it’s undetectable.
22K