Log inSign up
Nicholas Roberts
592 posts
Image
user avatar
Nicholas Roberts
@nick11roberts
Ph.D. student @WisconsinCS. Working on foundation models and breaking past scaling laws. Previously CMU @mldcmu, UCSD @ucsd_cse, FCC @fresnocity. 🤔🤨🧐 e/hmm
Madison, WI
nick11roberts.science
Joined April 2012
1,900
Following
1,448
Followers
  • Pinned
    user avatar
    Nicholas Roberts
    @nick11roberts
    Apr 6
    That new LFM2.5-350M is super overtrained, right? And everyone was shocked about how far they pushed it? As it turns out, we have a brand new scaling law for that! 🧵 [1/n]
    Image
    68K
  • user avatar
    Nicholas Roberts
    @nick11roberts
    Mar 21, 2025
    📉📉NEW SCALING LAW PHENOMENON 📉📉 We find that knowledge and reasoning exhibit different scaling behaviors! Super excited to finally tell you all about our paper on the compute optimal scaling of skills: arxiv.org/pdf/2503.10061 [1/n]
    Image
    136K
  • user avatar
    Nicholas Roberts
    @nick11roberts
    Jun 5, 2024
    So many new LLM architectures (Mambas🐍, Transformers🤖,🦙,🦔, Hyenas🐺,🦓…), so little GPU time to combine them into hybrid LLMs… Good news! Today we release Manticore, a system for creating **pretrained hybrids** from pretrained models! 👨‍🌾🦁🦂 arxiv.org/pdf/2406.00894 1/n
    Image
    43K
  • user avatar
    Nicholas Roberts
    @nick11roberts
    Apr 27, 2025
    Scaling laws are very dependent on the set you compute perplexity on, according to OpenAI youtu.be/6nJZopACRuQ?t=… Fun fact, this was one of the conclusions of our recent paper on scaling laws for skills! Check it out! arxiv.org/abs/2503.10061
    12K
  • user avatar
    Nicholas Roberts
    @nick11roberts
    Apr 16, 2021
    Life update: I'm super excited to share that I'll be starting my Ph.D. at @WisconsinCS in the Fall!
  • user avatar
    Nicholas Roberts
    @nick11roberts
    Mar 21, 2025
    Replying to @nick11roberts
    ... this is exactly what we do! We measure compute optimal scaling performance on a variety of knowledge and code benchmarks and find that for ALL of these, knowledge benchmarks favor *larger models,* whereas code benchmarks favor models trained on *more data!* [4/n]
    Image
    Image
    5.4K
  • user avatar
    Nicholas Roberts
    @nick11roberts
    Jul 29, 2023
    Tired of reading about superconductors? Check out our new work that just hit arXiv: arxiv.org/abs/2307.12226 about rethinking how we get predictions out of classifiers, and how to incorporate the *geometry* of the labels —i.e., how labels relate to one another! ⚙️📐 [1/n]
    arXiv logo
    arxiv.org
    Geometry-Aware Adaptation for Pretrained Models
    Machine learning models -- including prominent zero-shot models -- are often trained on datasets whose labels are only a small proportion of a larger label space. Such spaces are commonly equipped...
    41K
  • user avatar
    Nicholas Roberts
    @nick11roberts
    Mar 21, 2025
    Replying to @nick11roberts
    Surprisingly, the answer is NO! If we increase the amount of skill-relevant training data, we see that *the knowledge optima change at a faster rate than reasoning optima* This is exciting, and suggests something fundamental about knowledge vs reasoning [7/n]
    Image
    3.3K
  • user avatar
    Nicholas Roberts
    @nick11roberts
    Mar 21, 2025
    Replying to @nick11roberts
    (First some context) Scaling laws can tell you how to use your compute budget. In compute optimal scaling, you are given a compute budget, and you need to decide how to use it, often by balancing model size with the amount of training data [2/n]
    6K
  • user avatar
    Nicholas Roberts
    @nick11roberts
    Mar 21, 2025
    Replying to @nick11roberts
    Time for some key takeaways: - Knowledge and reasoning scale differently - This is fundamental to knowledge/reasoning, and is unexplained by data selection - Your choice of validation set and overall goal can hugely impact compute optima, so choose wisely! [10/n]
    2.6K
  • user avatar
    Nicholas Roberts
    @nick11roberts
    Dec 3, 2021
    ResNet not working? Use our #NeurIPS2021 paper to find what to use in-place of convs by searching our space of “XD-operations” containing convs, Fourier neural operators, graph convs, SOTA ops for neural PDE solvers, and infinitely many more arxiv.org/abs/2103.15798 [1/n]
    Image
    00:00
  • user avatar
    Nicholas Roberts
    @nick11roberts
    Mar 21, 2025
    Replying to @nick11roberts
    The best choice of model size and token count is called the "compute optimum," which is usually selected using a validation set... What if we made decisions about this tradeoff by measuring specific skills of our models, rather than average perf? [3/n]
    5K
  • user avatar
    Nicholas Roberts
    @nick11roberts
    Mar 21, 2025
    Replying to @nick11roberts
    arXiv logo
    arxiv.org
    Compute Optimal Scaling of Skills: Knowledge vs Reasoning
    Scaling laws are a critical component of the LLM development pipeline, most famously as a way to forecast training decisions such as 'compute-optimally' trading-off parameter count and dataset...
    2.4K
  • user avatar
    Nicholas Roberts
    @nick11roberts
    Mar 21, 2025
    Replying to @nick11roberts
    It turns out that you can manipulate the compute optima, slightly, by increasing the proportion of code or knowledge data in the pretraining dataset... But does this actually explain the phenomenon? E.g. can we manipulate the optima enough to flip the script? [6/n]
    Image
    3.5K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement