THORN 🌹

THORN is an optimizer for PyTorch.

It's often a little better than Muon, is definitely faster, uses less memory, & supports FSDP.

Usage

Requires PyTorch >= 2.6. Triton is optional but provides a decent speed boost.

optimizer = THORN([
	{
		'orthogonalize': True,
		# Enable `orthogonalize` for matrix parameters.
		'params': [p for p in model.parameters() if p.ndim >= 2 and p.requires_grad],
		'lr': 0.001,
		'betas': (0.95, 0.95), # First- and second-order momentum betas
		'gram': False, # Slower when enabled but might be a little better
		'nesterov': True, # Best not to touch
		'ns_steps': 5,    # Best not to touch
		'none_grad': True # Automatically performs `zero_grad(set_to_none=True)` after each step.
	},
	{
		'orthogonalize': False,
		# Use non-`orthogonalize` group (AdamW) for everything else (bias/norm parameters)
		'params': [p for p in model.parameters() if p.ndim < 2 and p.requires_grad],
		'lr': 0.001,
		'betas': (0.95, 0.98),
		'weight_decay': 0.01,
		'cautious': True, # Cautious AdamW; results are a tiny bit better
		'none_grad': True # Automatically performs `zero_grad(set_to_none=True)` after each step.
	}
])

Gradient release mode

For memory savings, you can limit gradients to one layer at a time with thorn_gradient_release. This slows down FSDP significantly, so it is only recommended on single-GPU setups.

Note that gradient release is not compatible with gradient accumulation or float16 mixed precision (but bfloat16 works). You might be able to get most of the same effect of gradient accumulation by tuning the LR & betas.

from thorn import THORN, thorn_gradient_release

model = MyModel().to(dtype=torch.bfloat16)

optimizer = THORN(...)
# enable gradient release mode
thorn_gradient_release(model, optimizer)

scheduler = CosineAnnealingLR(optimizer, ...) # optional

for item in dataset:
	loss = model(item)
	loss.backward() # <- optimization is done here...

	# ...so do not manually step THORN when gradient release is used!
	#optimizer.step()
	#optimizer.zero_grad()

	# but do step the scheduler if you're using one
	scheduler.step()

Based on

Jordan, K., Jin, Y., Boza, V., You, J., Cesista, F., Newhouse, L., & Bernstein, J. (2024). Muon: An optimizer for hidden layers in neural networks.
Lim, J., Lee, S., Kim, D., Kim, T., Park, E., Lee, J., … Weon, D. (2025). Motif 2 12.7B technical report.
Li, Z., Liu, L., Liang, C., Chen, W., & Zhao, T. (2025). NorMuon: Making Muon more efficient and scalable.
Delattre, B., Barthélemy, Q., Araujo, A., & Allauzen, A. (2023). Efficient Bound of Lipschitz Constant for Convolutional Layers by Gram Iteration.
Liu, J., Su, J., Yao, X., Jiang, Z., Lai, G., Du, Y., … Yang, Z. (2025). Muon is Scalable for LLM Training.
Liang, K., Chen, L., Liu, B., & Liu, Q. (2025). Cautious Optimizers: Improving Training with One Line of Code.
Pudipeddi, B., Mesmakhosroshahi, M., Xi, J., & Bharadwaj, S. (2020). Training Large Neural Networks with Constant Memory using a New Execution Algorithm.
Flash-Muon by Tianyang Lin
optimī by Benjamin Warner

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
LICENSE		LICENSE
README.md		README.md
thorn.py		thorn.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

THORN 🌹

Usage

Gradient release mode

Based on

About

Uh oh!

Releases

Packages

Languages

License

pykeio/THORN

Folders and files

Latest commit

History

Repository files navigation

THORN 🌹

Usage

Gradient release mode

Based on

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages