That new LFM2.5-350M is super overtrained, right? And everyone was shocked about how far they pushed it?
As it turns out, we have a brand new scaling law for that! 🧵
[1/n]
📉📉NEW SCALING LAW PHENOMENON 📉📉
We find that knowledge and reasoning exhibit different scaling behaviors!
Super excited to finally tell you all about our paper on the compute optimal scaling of skills:
arxiv.org/pdf/2503.10061
[1/n]
So many new LLM architectures (Mambas🐍, Transformers🤖,🦙,🦔, Hyenas🐺,🦓…), so little GPU time to combine them into hybrid LLMs…
Good news! Today we release Manticore, a system for creating **pretrained hybrids** from pretrained models! 👨🌾🦁🦂
arxiv.org/pdf/2406.00894
1/n
Scaling laws are very dependent on the set you compute perplexity on, according to OpenAI
youtu.be/6nJZopACRuQ?t=…
Fun fact, this was one of the conclusions of our recent paper on scaling laws for skills! Check it out!
arxiv.org/abs/2503.10061
... this is exactly what we do!
We measure compute optimal scaling performance on a variety of knowledge and code benchmarks and find that for ALL of these, knowledge benchmarks favor *larger models,* whereas code benchmarks favor models trained on *more data!*
[4/n]
Tired of reading about superconductors?
Check out our new work that just hit arXiv: arxiv.org/abs/2307.12226 about rethinking how we get predictions out of classifiers, and how to incorporate the *geometry* of the labels —i.e., how labels relate to one another! ⚙️📐
[1/n]
Surprisingly, the answer is NO! If we increase the amount of skill-relevant training data, we see that *the knowledge optima change at a faster rate than reasoning optima*
This is exciting, and suggests something fundamental about knowledge vs reasoning
[7/n]
(First some context) Scaling laws can tell you how to use your compute budget.
In compute optimal scaling, you are given a compute budget, and you need to decide how to use it, often by balancing model size with the amount of training data
[2/n]
Time for some key takeaways:
- Knowledge and reasoning scale differently
- This is fundamental to knowledge/reasoning, and is unexplained by data selection
- Your choice of validation set and overall goal can hugely impact compute optima, so choose wisely!
[10/n]
ResNet not working? Use our #NeurIPS2021 paper to find what to use in-place of convs by searching our space of “XD-operations” containing convs, Fourier neural operators, graph convs, SOTA ops for neural PDE solvers, and infinitely many more arxiv.org/abs/2103.15798 [1/n]
The best choice of model size and token count is called the "compute optimum," which is usually selected using a validation set...
What if we made decisions about this tradeoff by measuring specific skills of our models, rather than average perf?
[3/n]
It turns out that you can manipulate the compute optima, slightly, by increasing the proportion of code or knowledge data in the pretraining dataset...
But does this actually explain the phenomenon? E.g. can we manipulate the optima enough to flip the script?
[6/n]