user avatar
Ari Morcos
DatologyAI
@arimorcos
CEO and Co-founder @datologyai working to make it easy for anyone to make the most of their data. Former: RS @AIatMeta (FAIR), RS @DeepMind, PhD @PiN_Harvard.
Bay Area, CA
Joined April 2009
  • Pinned
    user avatar
    "The future is already here – it's just not evenly distributed." - William Gibson Today we're sharing results from ÜberWeb, our multilingual curation suite, which matches many of the best open models with a fraction of the train compute, all through better data.
    1/ People often think better multilingual models must come at the cost of English performance. Not true. The constraint isn’t capacity, it’s data quality, and we can fix it. Today @datologyai shares ÜberWeb: a year of multilingual curation lessons, scaled to 20T+ tokens.
    Image
  • user avatar
    This is deeply misleading -- those 800k samples were synthetic reasoning traces, explicitly not generated by humans.
    What does DeepSeek R1 & v3 mean for LLM data? Contrary to some lazy takes I’ve seen, DeepSeek R1 was trained on a shit ton of human-generated data—in fact, the DeepSeek models are setting records for the disclosed amount of post-training data for open-source models: - 600,000
    Image
    Image
    Image
    Image
  • user avatar
    I'm incredibly excited to announce our new company, @datologyai! Training models is hard and identifying the right data is the most important and difficult part -- our goal @datologyai to make optimizing training data at scale easy and automatic across modalities.
  • user avatar
    Neural scaling laws are great for predictability, but power law scaling is slow, especially in the large data regime when 10x the data results in small gains. Can we do better? We show that exponential scaling is possible via intelligent data pruning. arxiv.org/abs/2206.14486
    Image
  • user avatar
    Web-scale data has driven the incredible progress in AI but do we really need all that data? We introduce SemDeDup, an exceedingly simple method to remove semantic duplicates in web data which can reduce the LAION dataset (& train time) by 2x w/ minimal performance loss. 🧵👇
    Image
  • user avatar
    We know invariance is important for generalization, but what is the source of this invariance? Does it come from the architecture, augmentations, or the data itself? In our #NeurIPS2021 paper led by @marksibrahim and @D_Bouchacourt, we aim to find out. arxiv.org/abs/2106.05121
    Image
  • user avatar
    Most approaches to learning generalizable representations have focused on constraining the structure of the representation. But what if you instead constrain *how representations can be manipulated*? We introduce latent canonicalization to test this: arxiv.org/abs/2002.11829
    Image
  • user avatar
    Are all negatives created equal in contrastive instance discrimination? In new work led by Tiffany Cai, we show that only the hardest 5% of negatives per query are both necessary and largely sufficient for self-supervised learning. Tweetprint time! arxiv.org/abs/2010.06682
    Image
  • user avatar
    Repeat after me: data >>>>>> architecture. Given enough quality data, many different architectures can achieve comparable performance. The secret sauce was, is, and remains the data, not the model.
    Announcing RecurrentGemma! github.com/google-deepmin… - A 2B model with open weights based on Griffin - Replaces transformer with mix of gated linear recurrences and local attention - Competitive with Gemma-2B on downstream evals - Higher throughput when sampling long sequences
    Image
  • user avatar
    Transformers are great, but I think their importance to the insane progress of the last few years has been massively overstated. The key was and is larger, higher quality datasets.
    Compute is all you need. For a given amount of compute, ViT and ConvNets perform the same. Quote from this DeepMind article: "Although the success of ViTs in computer vision is extremely impressive, in our view there is no strong evidence to suggest that pre-trained ViTs
  • user avatar
    Excited to share our blog post on our @ICLR18 paper! *Easy-to-interpret neurons are no more important than hard-to-interpret neurons *Generalizing networks are more robust to neuron deletion than memorizing networks Blog: deepmind.com/blog/understan… Paper: arxiv.org/abs/1803.06959
  • user avatar
    Recent studies have suggested that the earliest iterations of DNN training are especially critical. In our #ICLR2020 paper with @jefrankle and @davidjschwab, we use the lottery ticket framework to rigorously examine this crucial phase of training. arxiv.org/abs/2002.10365
    Image
  • user avatar
    Timely paper from @ShibaniSan, Dimitris Tsipras, @andrew_ilyas , and @aleks_madry providing some new insights into why batch norm works. They perform a number of clever experiments to work it out, finding that internal covariate shift is a red herring! arxiv.org/abs/1805.11604
    Image
  • user avatar
    Unfortunately, this has long been a problem of interpretability methods, so much so that @leavittron and I wrote a perspective about it, focusing on saliency approaches which were very popular a few years ago and had a similar bevy of issues. arxiv.org/abs/2010.12016
    Image
    Image
    Damn, triple-homicide in one day. SAEs really taking a beating recently