Ari Morcos (@arimorcos) / X

Ari Morcos

1,553 posts

Ari Morcos

@arimorcos

CEO and Co-founder @datologyai working to make it easy for anyone to make the most of their data. Former: RS @AIatMeta (FAIR), RS @DeepMind, PhD @PiN_Harvard.

Bay Area, CA

Joined April 2009

Pinned
Ari Morcos
@arimorcos
Feb 18
"The future is already here – it's just not evenly distributed." - William Gibson Today we're sharing results from ÜberWeb, our multilingual curation suite, which matches many of the best open models with a fraction of the train compute, all through better data.
Ricardo Monti
@RicardoMonti9
Feb 18
1/ People often think better multilingual models must come at the cost of English performance. Not true. The constraint isn’t capacity, it’s data quality, and we can fix it. Today @datologyai shares ÜberWeb: a year of multilingual curation lessons, scaled to 20T+ tokens.
4.9K
Ari Morcos
@arimorcos
Jan 29, 2025
This is deeply misleading -- those 800k samples were synthetic reasoning traces, explicitly not generated by humans.
Alexandr Wang
@alexandr_wang
Jan 29, 2025
What does DeepSeek R1 & v3 mean for LLM data? Contrary to some lazy takes I’ve seen, DeepSeek R1 was trained on a shit ton of human-generated data—in fact, the DeepSeek models are setting records for the disclosed amount of post-training data for open-source models: - 600,000
88K
Ari Morcos
@arimorcos
Feb 22, 2024
I'm incredibly excited to announce our new company, @datologyai! Training models is hard and identifying the right data is the most important and difficult part -- our goal @datologyai to make optimizing training data at scale easy and automatic across modalities.
169K
Ari Morcos
@arimorcos
Jul 5, 2022
Neural scaling laws are great for predictability, but power law scaling is slow, especially in the large data regime when 10x the data results in small gains. Can we do better? We show that exponential scaling is possible via intelligent data pruning. arxiv.org/abs/2206.14486
Ari Morcos
@arimorcos
Mar 20, 2023
Web-scale data has driven the incredible progress in AI but do we really need all that data? We introduce SemDeDup, an exceedingly simple method to remove semantic duplicates in web data which can reduce the LAION dataset (& train time) by 2x w/ minimal performance loss. 🧵👇
95K
Ari Morcos
@arimorcos
Nov 18, 2021
We know invariance is important for generalization, but what is the source of this invariance? Does it come from the architecture, augmentations, or the data itself? In our #NeurIPS2021 paper led by @marksibrahim and @D_Bouchacourt, we aim to find out. arxiv.org/abs/2106.05121
Ari Morcos
@arimorcos
Mar 26, 2020
Most approaches to learning generalizable representations have focused on constraining the structure of the representation. But what if you instead constrain *how representations can be manipulated*? We introduce latent canonicalization to test this: arxiv.org/abs/2002.11829
Ari Morcos
@arimorcos
Oct 15, 2020
Are all negatives created equal in contrastive instance discrimination? In new work led by Tiffany Cai, we show that only the hardest 5% of negatives per query are both necessary and largely sufficient for self-supervised learning. Tweetprint time! arxiv.org/abs/2010.06682
Ari Morcos
@arimorcos
Apr 9, 2024
Repeat after me: data >>>>>> architecture. Given enough quality data, many different architectures can achieve comparable performance. The secret sauce was, is, and remains the data, not the model.
Samuel L Smith
@SamuelMLSmith
Apr 9, 2024
Announcing RecurrentGemma! github.com/google-deepmin… - A 2B model with open weights based on Griffin - Replaces transformer with mix of gated linear recurrences and local attention - Competitive with Gemma-2B on downstream evals - Higher throughput when sampling long sequences
54K
Ari Morcos
@arimorcos
Oct 27, 2023
Transformers are great, but I think their importance to the insane progress of the last few years has been massively overstated. The key was and is larger, higher quality datasets.
Yann LeCun
@ylecun
Oct 26, 2023
Compute is all you need. For a given amount of compute, ViT and ConvNets perform the same. Quote from this DeepMind article: "Although the success of ViTs in computer vision is extremely impressive, in our view there is no strong evidence to suggest that pre-trained ViTs
56K
Ari Morcos
@arimorcos
Mar 21, 2018
Excited to share our blog post on our @ICLR18 paper! *Easy-to-interpret neurons are no more important than hard-to-interpret neurons *Generalizing networks are more robust to neuron deletion than memorizing networks Blog: deepmind.com/blog/understan… Paper: arxiv.org/abs/1803.06959
Understanding deep learning through neuron deletion
From deepmind.google
Ari Morcos
@arimorcos
Mar 2, 2020
Recent studies have suggested that the earliest iterations of DNN training are especially critical. In our #ICLR2020 paper with @jefrankle and @davidjschwab, we use the lottery ticket framework to rigorously examine this crucial phase of training. arxiv.org/abs/2002.10365
Ari Morcos
@arimorcos
May 30, 2018
Timely paper from @ShibaniSan, Dimitris Tsipras, @andrew_ilyas , and @aleks_madry providing some new insights into why batch norm works. They perform a number of clever experiments to work it out, finding that internal covariate shift is a red herring! arxiv.org/abs/1805.11604
Ari Morcos
@arimorcos
Feb 1, 2025
Unfortunately, this has long been a problem of interpretability methods, so much so that @leavittron and I wrote a perspective about it, focusing on saliency approaches which were very popular a few years ago and had a similar bevy of issues. arxiv.org/abs/2010.12016
KZ
@kzSlider
Feb 1, 2025
Damn, triple-homicide in one day. SAEs really taking a beating recently
26K