THORN is an optimizer for PyTorch.
It's often a little better than Muon, is definitely faster, uses less memory, & supports FSDP.
Requires PyTorch >= 2.6. Triton is optional but provides a decent speed boost.
optimizer = THORN([
{
'orthogonalize': True,
# Enable `orthogonalize` for matrix parameters.
'params': [p for p in model.parameters() if p.ndim >= 2 and p.requires_grad],
'lr': 0.001,
'betas': (0.95, 0.95), # First- and second-order momentum betas
'gram': False, # Slower when enabled but might be a little better
'nesterov': True, # Best not to touch
'ns_steps': 5, # Best not to touch
'none_grad': True # Automatically performs `zero_grad(set_to_none=True)` after each step.
},
{
'orthogonalize': False,
# Use non-`orthogonalize` group (AdamW) for everything else (bias/norm parameters)
'params': [p for p in model.parameters() if p.ndim < 2 and p.requires_grad],
'lr': 0.001,
'betas': (0.95, 0.98),
'weight_decay': 0.01,
'cautious': True, # Cautious AdamW; results are a tiny bit better
'none_grad': True # Automatically performs `zero_grad(set_to_none=True)` after each step.
}
])For memory savings, you can limit gradients to one layer at a time with thorn_gradient_release. This slows down FSDP significantly, so it is only recommended on single-GPU setups.
Note that gradient release is not compatible with gradient accumulation or float16 mixed precision (but bfloat16 works). You might be able to get most of the same effect of gradient accumulation by tuning the LR & betas.
from thorn import THORN, thorn_gradient_release
model = MyModel().to(dtype=torch.bfloat16)
optimizer = THORN(...)
# enable gradient release mode
thorn_gradient_release(model, optimizer)
scheduler = CosineAnnealingLR(optimizer, ...) # optional
for item in dataset:
loss = model(item)
loss.backward() # <- optimization is done here...
# ...so do not manually step THORN when gradient release is used!
#optimizer.step()
#optimizer.zero_grad()
# but do step the scheduler if you're using one
scheduler.step()- Jordan, K., Jin, Y., Boza, V., You, J., Cesista, F., Newhouse, L., & Bernstein, J. (2024). Muon: An optimizer for hidden layers in neural networks.
- Lim, J., Lee, S., Kim, D., Kim, T., Park, E., Lee, J., … Weon, D. (2025). Motif 2 12.7B technical report.
- Li, Z., Liu, L., Liang, C., Chen, W., & Zhao, T. (2025). NorMuon: Making Muon more efficient and scalable.
- Delattre, B., Barthélemy, Q., Araujo, A., & Allauzen, A. (2023). Efficient Bound of Lipschitz Constant for Convolutional Layers by Gram Iteration.
- Liu, J., Su, J., Yao, X., Jiang, Z., Lai, G., Du, Y., … Yang, Z. (2025). Muon is Scalable for LLM Training.
- Liang, K., Chen, L., Liu, B., & Liu, Q. (2025). Cautious Optimizers: Improving Training with One Line of Code.
- Pudipeddi, B., Mesmakhosroshahi, M., Xi, J., & Bharadwaj, S. (2020). Training Large Neural Networks with Constant Memory using a New Execution Algorithm.
- Flash-Muon by Tianyang Lin
- optimī by Benjamin Warner