Jesse Dodge (@JesseDodge) / X

Jesse Dodge

841 posts

Jesse Dodge

@JesseDodge

Research Scientist at Meta. 10-yr test-of-time ACL 22, Best Demo ACL 25, Best Resource Paper ACL 24, Best Theme Paper ACL 24, Best Student Paper NAACL 15 🏳️‍🌈

Joined March 2009

Jesse Dodge
@JesseDodge
Dec 6, 2023
Today Google released Gemini with a 60-page report in which they repeatedly say the training data is key ("We find that data quality is critical to a highly-performing model"), while providing almost no information about how it was made, how it was filtered, or its contents.
183K
Jesse Dodge
@JesseDodge
Dec 9, 2020
GPT-3 won a best paper award at #NeurIPS2020! Congratulations to that team, it truly is an incredible piece of work, and has changed the way many of us think about what massive LMs can do. But we should also talk about inequality in the research community -- that work couldn't...
Jesse Dodge
@JesseDodge
May 10, 2023
Today Google announced PaLM 2. In their 91 page paper they repeatedly say the training data is key ("we find that the data mixture is a critical component of the final model") while providing almost no information about how it was constructed, how it was sourced, or its contents.
138K
Jesse Dodge
@JesseDodge
Feb 18, 2020
Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping arxiv.org/abs/2002.06305 We found surprisingly large variance just from random seeds when fine-tuning BERT. Both weight inits and the order of the training data have big impact. 1/n
Jesse Dodge
@JesseDodge
Aug 15, 2025
Personal update: I'm excited to be joining @Meta! I'm deeply grateful for the opportunities I've had at @allen_ai over the past 6 years (including three paper awards in the last two years). Onward to the next chapter! 🥳
38K
Jesse Dodge
@JesseDodge
May 25, 2022
WE WON THE ACL 10-YEAR TEST-OF-TIME AWARD!! Ten thousand congratulations to @mmitchell_ai and our co-authors Amit Goyal, @kotymg, @karlstratos, Xufeng Han, Alyssa Mensch, Alex Berg, Tamara Berg, and @haldaume3!
ACL 2026
@aclmeeting
May 25, 2022
The second of the #acl2022nlp 10-year test of time awards goes to @mmitchell_ai et al. for their work on generating image descriptions published at EACL 2012 #NLProc aclanthology.org/E12-1076/
Jesse Dodge
@JesseDodge
Dec 6, 2023
Replying to @JesseDodge
This follows the trend of white papers that are written to read like research papers which don't actually contain the necessary information for basic science. This is a product, and they are purposely obscuring the most important information that makes the models work.
23K
Jesse Dodge
@JesseDodge
Jun 22, 2022
How much CO2 is emitted from training common AI models? New FAccT paper! *Partially* training a 6 B. param. transformer emits about as much as the average US home in a year! Smaller models? Only as much as charging a phone. What can you do? A 🧵: arxiv.org/pdf/2206.05229…
Jesse Dodge
@JesseDodge
Aug 14, 2024
Congrats to our team for winning two paper awards at #ACL2024! OLMo won the Best Theme Paper award, and Dolma won a Best Resource Paper award! All the credit goes to the whole team for the massive group effort 🎉🎉
53K
Jesse Dodge
@JesseDodge
May 8, 2020
Successfully defended my Ph.D. under quarantine!
You’re unable to view this Post because this account owner limits who can view their Posts. Learn more
Jesse Dodge
@JesseDodge
Apr 18, 2021
Ever wonder about the web-scale data massive LMs train on? We wrote some docs for C4! cs.cmu.edu/~jessed/data_h… And we indexed it, and built an interactive demo, so you can search too: c4-search.apps.allenai.org find something cool? report it or discuss here: github.com/allenai/c4-doc…
Jesse Dodge
@JesseDodge
Apr 19, 2023
The best way to understand large language models is to understand what they were trained on. Most pretraining datasets have *zero* documentation of their contents! We worked with @nitashatiku and the other WaPo journalists on this piece, check it out!
Nitasha Tiku
@nitashatiku
Apr 19, 2023
Replying to @nitashatiku
Here's our analysis of the 15 million websites in just one highly-filtered CommonCrawl web scrape-used to train models like Google's T5 & Facebook's LLaMA -copyright symbol appears >200M times -pirated sites, 1 for e-books -half the top 10 = news sites washingtonpost.com/technology/int…
44K
Jesse Dodge
@JesseDodge
May 10, 2023
Replying to @JesseDodge
Now that LLMs are products (not just research), we are at a turning point: for-profit companies will become less and less transparent *specifically* about the components that are most important. Only if the open source community can organize together can we keep up!
7.3K
Jesse Dodge
@JesseDodge
Jul 1, 2021
could not be more proud of @MaartenSap, who just *successfully defended* one of the best PhD theses I've seen. he's already had a successful career, and he's only getting started!