Log inSign up
Pratyush Maini
DatologyAI
801 posts
Image
user avatar
Pratyush Maini
DatologyAI
@pratyushmaini
Data Quality x Memorization | Founding Team @datologyai | PhD @mldcmu | BTech @iitdelhi
pratyushmaini.github.io
Joined November 2019
572
Following
3,406
Followers
  • Pinned
    user avatar
    Pratyush Maini
    DatologyAI
    @pratyushmaini
    Aug 18, 2025
    1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens๐Ÿง‘๐Ÿผโ€๐Ÿณ - 3B LLMs beat 8B models๐Ÿš€ - Pareto frontier for performance
    Image
    187K
  • user avatar
    Pratyush Maini
    DatologyAI
    @pratyushmaini
    Jun 12, 2024
    1/We've nailed a framework to reliably detect if an LLM was trained on your dataset: LLM Dataset Inference. After over a year of thinking of writing about how hard this is, we had a breakthrough that made me quite literally jump from my seat! ๐Ÿ“: arxiv.org/abs/2406.06443 Long๐Ÿงต
    Image
    90K
  • user avatar
    Pratyush Maini
    DatologyAI
    @pratyushmaini
    Jan 30, 2024
    1/7 Super excited about my Apple Internship work finally coming out: Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling TLDR: You can train 3x faster and with upto 10x lesser data with just synthetic rephrases of the web! ๐Ÿ“ arxiv.org/abs/2401.16380
    Image
    119K
  • user avatar
    Pratyush Maini
    DatologyAI
    @pratyushmaini
    Nov 29, 2022
    1/ML Datasets contain hard examples. Some of them are mislabeled, some rare & some complex. All of them are learned late. In our #NeurIPS2022 paper we separate them using second-split *forgetting time*! ๐Ÿ“tinyurl.com/ssft1 w/@saurabh_garg67 @zacharylipton @zicokolter ๐Ÿงต
    Image
  • user avatar
    Pratyush Maini
    DatologyAI
    @pratyushmaini
    Apr 12, 2024
    1/ ๐ŸฅScaling Laws for Data Filtering ๐Ÿฅ TLDR: Data Curation *cannot* be compute agnostic! In our #CVPR2024 paper, we develop the first scaling laws for heterogeneous & limited web data. w/@goyalsachin007 @zacharylipton @AdtRaghunathan @zicokolter ๐Ÿ“:arxiv.org/abs/2404.07177
    Image
    87K
  • user avatar
    Pratyush Maini
    DatologyAI
    @pratyushmaini
    Oct 29, 2025
    1/It's not often that academic projects get industry-wide adoption. I've been fortunate to develop Rephrasing the Web, a synthetic pretraining pipeline that powers pretty much EVERY frontier model today But probably no one knows: our paper was REJECTED from a workshop 2 yrs ago
    Image
    46K
  • user avatar
    Pratyush Maini
    DatologyAI
    @pratyushmaini
    Dec 17, 2024
    Overheard at NeurIPS: "I wanted to apply for a PhD at ___ University, but I couldn't find three faculty who aren't on a startup-ical."
    33K
  • user avatar
    Pratyush Maini
    DatologyAI
    @pratyushmaini
    Sep 17, 2025
    One thing years of memorization research has made clear: unlearning is fundamentally hard. Neurons are polysemantic & concepts are massively distributed. Thereโ€™s no clean 'delete'. We need architectures that are "unlearnable by design". Introducing, Memorization Sinks ๐Ÿ›โฌ‡๏ธ
    user avatar
    Aditi Raghunathan
    @AdtRaghunathan
    Sep 17, 2025
    Thereโ€™s been a lot of work on unlearning in LLMs, trying to erase memorization without hurting capabilities โ€” but we havenโ€™t seen much success. โ“What if unlearning is actually doomed from the start? ๐Ÿ‘‡This thread explains why and how *memorization sinks* offer a new way forward.
    21K
  • user avatar
    Pratyush Maini
    DatologyAI
    @pratyushmaini
    Oct 21, 2024
    One of my dreams when I started my PhD was to teach my own course. I am very excited that I'm getting a chance to create & teach a new "gamified" course at @SCSatCMU this Fall. 10-799: Data Privacy, Memorization & Copyright in GenAI starts tomorrow! pratyushmaini.github.io/cmu-10-799 1/๐Ÿงต
    Image
    12K
  • user avatar
    Pratyush Maini
    DatologyAI
    @pratyushmaini
    May 3, 2024
    Our work on Scaling Laws for Data Filtering won the Best Paper Award ๐Ÿ† at the Data Problems for Foundation Models Workshop at ICLR2024!! Join us in Vienna on May 11 to hear more about the work!
    Image
    Image
    user avatar
    Pratyush Maini
    DatologyAI
    @pratyushmaini
    Apr 12, 2024
    1/ ๐ŸฅScaling Laws for Data Filtering ๐Ÿฅ TLDR: Data Curation *cannot* be compute agnostic! In our #CVPR2024 paper, we develop the first scaling laws for heterogeneous & limited web data. w/@goyalsachin007 @zacharylipton @AdtRaghunathan @zicokolter ๐Ÿ“:arxiv.org/abs/2404.07177
    23K
  • user avatar
    Pratyush Maini
    DatologyAI
    @pratyushmaini
    Sep 25, 2023
    Phi-1.5: A case of comparing ๐ŸŽto๐ŸŠ? Thereโ€™ve been concerns about Phi-1.5 being trained on benchmark data. I found something more nuanced & concerning. When evaluated on Perplexity, Phi-1.5 is 2x worse than similarly-sized Opt. What is going on? ๐Ÿงต ๐Ÿ“:pratyushmaini.github.io/phi-1_5
    Image
    58K
  • user avatar
    Pratyush Maini
    DatologyAI
    @pratyushmaini
    Apr 25, 2024
    1/What does it mean for an LLM to โ€œmemorizeโ€ a doc? Exactly regurgitating a NYT article? Of course. Just training on NYT?Harder to say We take big strides in this discourse w/*Adversarial Compression* w/@A_v_i__S @zhilifeng @zacharylipton @zicokolter ๐ŸŒ:locuslab.github.io/acr-memorizatiโ€ฆ๐Ÿงต
    Image
    49K
  • user avatar
    Pratyush Maini
    DatologyAI
    @pratyushmaini
    Mar 17, 2025
    1/Being in academia is such a privilege: You get to collaborate with insanely talented & passionate students on their journey to upskill themselves. Very excited to share *OpenUnlearning*: a unified, easily extensible framework for unlearning led by @anmol_mekala @VineethDorna๐Ÿงต
    Image
    28K
  • user avatar
    Pratyush Maini
    DatologyAI
    @pratyushmaini
    Apr 29, 2021
    1/Are you worried that an ML model may be a stolen copy of your model? We introduce *Dataset Inference* in our #ICLR2021 Spotlight paper to resolve model ownership. Paper: arxiv.org/abs/2104.10706 Blog and Video: cleverhans.io/2021/04/28/is-โ€ฆ w/@MYaghini @NicolasPapernot
    Image

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

TermsยทPrivacyยทCookiesยทAccessibilityยทAds Infoยทยฉ 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement