Skip to content

pykeio/THORN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

THORN 🌹

THORN is an optimizer for PyTorch.

It's often a little better than Muon, is definitely faster, uses less memory, & supports FSDP.

Usage

Requires PyTorch >= 2.6. Triton is optional but provides a decent speed boost.

optimizer = THORN([
	{
		'orthogonalize': True,
		# Enable `orthogonalize` for matrix parameters.
		'params': [p for p in model.parameters() if p.ndim >= 2 and p.requires_grad],
		'lr': 0.001,
		'betas': (0.95, 0.95), # First- and second-order momentum betas
		'gram': False, # Slower when enabled but might be a little better
		'nesterov': True, # Best not to touch
		'ns_steps': 5,    # Best not to touch
		'none_grad': True # Automatically performs `zero_grad(set_to_none=True)` after each step.
	},
	{
		'orthogonalize': False,
		# Use non-`orthogonalize` group (AdamW) for everything else (bias/norm parameters)
		'params': [p for p in model.parameters() if p.ndim < 2 and p.requires_grad],
		'lr': 0.001,
		'betas': (0.95, 0.98),
		'weight_decay': 0.01,
		'cautious': True, # Cautious AdamW; results are a tiny bit better
		'none_grad': True # Automatically performs `zero_grad(set_to_none=True)` after each step.
	}
])

Gradient release mode

For memory savings, you can limit gradients to one layer at a time with thorn_gradient_release. This slows down FSDP significantly, so it is only recommended on single-GPU setups.

Note that gradient release is not compatible with gradient accumulation or float16 mixed precision (but bfloat16 works). You might be able to get most of the same effect of gradient accumulation by tuning the LR & betas.

from thorn import THORN, thorn_gradient_release

model = MyModel().to(dtype=torch.bfloat16)

optimizer = THORN(...)
# enable gradient release mode
thorn_gradient_release(model, optimizer)

scheduler = CosineAnnealingLR(optimizer, ...) # optional

for item in dataset:
	loss = model(item)
	loss.backward() # <- optimization is done here...

	# ...so do not manually step THORN when gradient release is used!
	#optimizer.step()
	#optimizer.zero_grad()

	# but do step the scheduler if you're using one
	scheduler.step()

Based on

About

an optimizer for neural networks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages