Nicholas Roberts (@nick11roberts) / X

Nicholas Roberts

592 posts

Nicholas Roberts

@nick11roberts

Ph.D. student @WisconsinCS. Working on foundation models and breaking past scaling laws. Previously CMU @mldcmu, UCSD @ucsd_cse, FCC @fresnocity. 🤔🤨🧐 e/hmm

Madison, WI

nick11roberts.science

Joined April 2012

Pinned
Nicholas Roberts
@nick11roberts
Apr 6
That new LFM2.5-350M is super overtrained, right? And everyone was shocked about how far they pushed it? As it turns out, we have a brand new scaling law for that! 🧵 [1/n]
68K
Nicholas Roberts
@nick11roberts
Mar 21, 2025
📉📉NEW SCALING LAW PHENOMENON 📉📉 We find that knowledge and reasoning exhibit different scaling behaviors! Super excited to finally tell you all about our paper on the compute optimal scaling of skills: arxiv.org/pdf/2503.10061 [1/n]
136K
Nicholas Roberts
@nick11roberts
Jun 5, 2024
So many new LLM architectures (Mambas🐍, Transformers🤖,🦙,🦔, Hyenas🐺,🦓…), so little GPU time to combine them into hybrid LLMs… Good news! Today we release Manticore, a system for creating **pretrained hybrids** from pretrained models! 👨‍🌾🦁🦂 arxiv.org/pdf/2406.00894 1/n
43K
Nicholas Roberts
@nick11roberts
Apr 27, 2025
Scaling laws are very dependent on the set you compute perplexity on, according to OpenAI youtu.be/6nJZopACRuQ?t=… Fun fact, this was one of the conclusions of our recent paper on scaling laws for skills! Check it out! arxiv.org/abs/2503.10061
12K
Nicholas Roberts
@nick11roberts
Apr 16, 2021
Life update: I'm super excited to share that I'll be starting my Ph.D. at @WisconsinCS in the Fall!
Nicholas Roberts
@nick11roberts
Mar 21, 2025
Replying to @nick11roberts
... this is exactly what we do! We measure compute optimal scaling performance on a variety of knowledge and code benchmarks and find that for ALL of these, knowledge benchmarks favor *larger models,* whereas code benchmarks favor models trained on *more data!* [4/n]
5.4K
Nicholas Roberts
@nick11roberts
Jul 29, 2023
Tired of reading about superconductors? Check out our new work that just hit arXiv: arxiv.org/abs/2307.12226 about rethinking how we get predictions out of classifiers, and how to incorporate the *geometry* of the labels —i.e., how labels relate to one another! ⚙️📐 [1/n]
arxiv.org
Geometry-Aware Adaptation for Pretrained Models
Machine learning models -- including prominent zero-shot models -- are often trained on datasets whose labels are only a small proportion of a larger label space. Such spaces are commonly equipped...
41K
Nicholas Roberts
@nick11roberts
Mar 21, 2025
Replying to @nick11roberts
Surprisingly, the answer is NO! If we increase the amount of skill-relevant training data, we see that *the knowledge optima change at a faster rate than reasoning optima* This is exciting, and suggests something fundamental about knowledge vs reasoning [7/n]
3.3K
Nicholas Roberts
@nick11roberts
Mar 21, 2025
Replying to @nick11roberts
(First some context) Scaling laws can tell you how to use your compute budget. In compute optimal scaling, you are given a compute budget, and you need to decide how to use it, often by balancing model size with the amount of training data [2/n]
6K
Nicholas Roberts
@nick11roberts
Mar 21, 2025
Replying to @nick11roberts
Time for some key takeaways: - Knowledge and reasoning scale differently - This is fundamental to knowledge/reasoning, and is unexplained by data selection - Your choice of validation set and overall goal can hugely impact compute optima, so choose wisely! [10/n]
2.6K
Nicholas Roberts
@nick11roberts
Dec 3, 2021
ResNet not working? Use our #NeurIPS2021 paper to find what to use in-place of convs by searching our space of “XD-operations” containing convs, Fourier neural operators, graph convs, SOTA ops for neural PDE solvers, and infinitely many more arxiv.org/abs/2103.15798 [1/n]
00:00
Nicholas Roberts
@nick11roberts
Mar 21, 2025
Replying to @nick11roberts
The best choice of model size and token count is called the "compute optimum," which is usually selected using a validation set... What if we made decisions about this tradeoff by measuring specific skills of our models, rather than average perf? [3/n]
5K
Nicholas Roberts
@nick11roberts
Mar 21, 2025
Replying to @nick11roberts
arxiv.org
Compute Optimal Scaling of Skills: Knowledge vs Reasoning
Scaling laws are a critical component of the LLM development pipeline, most famously as a way to forecast training decisions such as 'compute-optimally' trading-off parameter count and dataset...
2.4K
Nicholas Roberts
@nick11roberts
Mar 21, 2025
Replying to @nick11roberts
It turns out that you can manipulate the compute optima, slightly, by increasing the proportion of code or knowledge data in the pretraining dataset... But does this actually explain the phenomenon? E.g. can we manipulate the optima enough to flip the script? [6/n]
3.5K