Pratyush Maini (@pratyushmaini) / X

Pratyush Maini

801 posts

Pratyush Maini

@pratyushmaini

Data Quality x Memorization | Founding Team @datologyai | PhD @mldcmu | BTech @iitdelhi

pratyushmaini.github.io

Joined November 2019

Pinned
Pratyush Maini
@pratyushmaini
Aug 18, 2025
1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼‍🍳 - 3B LLMs beat 8B models🚀 - Pareto frontier for performance
187K
Pratyush Maini
@pratyushmaini
Jun 12, 2024
1/We've nailed a framework to reliably detect if an LLM was trained on your dataset: LLM Dataset Inference. After over a year of thinking of writing about how hard this is, we had a breakthrough that made me quite literally jump from my seat! 📝: arxiv.org/abs/2406.06443 Long🧵
90K
Pratyush Maini
@pratyushmaini
Jan 30, 2024
1/7 Super excited about my Apple Internship work finally coming out: Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling TLDR: You can train 3x faster and with upto 10x lesser data with just synthetic rephrases of the web! 📝 arxiv.org/abs/2401.16380
119K
Pratyush Maini
@pratyushmaini
Nov 29, 2022
1/ML Datasets contain hard examples. Some of them are mislabeled, some rare & some complex. All of them are learned late. In our #NeurIPS2022 paper we separate them using second-split *forgetting time*! 📝tinyurl.com/ssft1 w/@saurabh_garg67 @zacharylipton @zicokolter 🧵
Pratyush Maini
@pratyushmaini
Apr 12, 2024
1/ 🥁Scaling Laws for Data Filtering 🥁 TLDR: Data Curation *cannot* be compute agnostic! In our #CVPR2024 paper, we develop the first scaling laws for heterogeneous & limited web data. w/@goyalsachin007 @zacharylipton @AdtRaghunathan @zicokolter 📝:arxiv.org/abs/2404.07177
87K
Pratyush Maini
@pratyushmaini
Oct 29, 2025
1/It's not often that academic projects get industry-wide adoption. I've been fortunate to develop Rephrasing the Web, a synthetic pretraining pipeline that powers pretty much EVERY frontier model today But probably no one knows: our paper was REJECTED from a workshop 2 yrs ago
46K
Pratyush Maini
@pratyushmaini
Dec 17, 2024
Overheard at NeurIPS: "I wanted to apply for a PhD at ___ University, but I couldn't find three faculty who aren't on a startup-ical."
33K
Pratyush Maini
@pratyushmaini
Sep 17, 2025
One thing years of memorization research has made clear: unlearning is fundamentally hard. Neurons are polysemantic & concepts are massively distributed. There’s no clean 'delete'. We need architectures that are "unlearnable by design". Introducing, Memorization Sinks 🛁⬇️
Aditi Raghunathan
@AdtRaghunathan
Sep 17, 2025
There’s been a lot of work on unlearning in LLMs, trying to erase memorization without hurting capabilities — but we haven’t seen much success. ❓What if unlearning is actually doomed from the start? 👇This thread explains why and how *memorization sinks* offer a new way forward.
21K
Pratyush Maini
@pratyushmaini
Oct 21, 2024
One of my dreams when I started my PhD was to teach my own course. I am very excited that I'm getting a chance to create & teach a new "gamified" course at @SCSatCMU this Fall. 10-799: Data Privacy, Memorization & Copyright in GenAI starts tomorrow! pratyushmaini.github.io/cmu-10-799 1/🧵
12K
Pratyush Maini
@pratyushmaini
May 3, 2024
Our work on Scaling Laws for Data Filtering won the Best Paper Award 🏆 at the Data Problems for Foundation Models Workshop at ICLR2024!! Join us in Vienna on May 11 to hear more about the work!
Pratyush Maini
@pratyushmaini
Apr 12, 2024
1/ 🥁Scaling Laws for Data Filtering 🥁 TLDR: Data Curation *cannot* be compute agnostic! In our #CVPR2024 paper, we develop the first scaling laws for heterogeneous & limited web data. w/@goyalsachin007 @zacharylipton @AdtRaghunathan @zicokolter 📝:arxiv.org/abs/2404.07177
23K
Pratyush Maini
@pratyushmaini
Sep 25, 2023
Phi-1.5: A case of comparing 🍎to🍊? There’ve been concerns about Phi-1.5 being trained on benchmark data. I found something more nuanced & concerning. When evaluated on Perplexity, Phi-1.5 is 2x worse than similarly-sized Opt. What is going on? 🧵 📝:pratyushmaini.github.io/phi-1_5
58K
Pratyush Maini
@pratyushmaini
Apr 25, 2024
1/What does it mean for an LLM to “memorize” a doc? Exactly regurgitating a NYT article? Of course. Just training on NYT?Harder to say We take big strides in this discourse w/*Adversarial Compression* w/@A_v_i__S @zhilifeng @zacharylipton @zicokolter 🌐:locuslab.github.io/acr-memorizati…🧵
49K
Pratyush Maini
@pratyushmaini
Mar 17, 2025
1/Being in academia is such a privilege: You get to collaborate with insanely talented & passionate students on their journey to upskill themselves. Very excited to share *OpenUnlearning*: a unified, easily extensible framework for unlearning led by @anmol_mekala @VineethDorna🧵
28K
Pratyush Maini
@pratyushmaini
Apr 29, 2021
1/Are you worried that an ML model may be a stolen copy of your model? We introduce *Dataset Inference* in our #ICLR2021 Spotlight paper to resolve model ownership. Paper: arxiv.org/abs/2104.10706 Blog and Video: cleverhans.io/2021/04/28/is-… w/@MYaghini @NicolasPapernot