Alex Wettig (@_awettig) / X

Alex Wettig

268 posts

Alex Wettig

@_awettig

composer-ing @cursor_ai

cs.princeton.edu/~awettig/

Joined July 2022

Alex Wettig
@_awettig
Feb 18, 2025
🤔 Ever wondered how prevalent some type of web content is during LM pre-training? In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages 🌐 Key takeaway: domains help us curate better pre-training data! 🧵/N
50K
Alex Wettig
@_awettig
Oct 4, 2024
How to train long-context LMs? (and beat Llama-3.1 🏆) Many takeaways from our new paper! - Focus on diverse & reliable evaluations (not just perplexity) - Find good sources of long data and high-quality short data - ... A 🧵 on how we produced ProLong, a SoTA 8B 512K model
21K
Alex Wettig
@_awettig
Feb 16, 2024
**QuRating**: We get 4 quality signals from GPT-3.5 for selecting LM training data We select 30B out of 260B tokens and train 1.3B LMs from scratch. We find that QuRating can improve perplexity and ICL performance ✅ (w/ Aatmik Gupta, Sauma Malik, @danqi_chen)
13K
Alex Wettig
@_awettig
Jul 25, 2024
Stop by the QuRating *spotlight* poster this afternoon to chat about data quality for LMs ⏰: 1:30-3pm CET /📍: Hall C 4-9 #617
6.7K
Alex Wettig
@_awettig
Jul 16, 2025
Presenting two posters at ICML over the next two days: - Both at 11am - 1:30pm - Both about how to improve pre-training with domains - Both at stall # E-2600 in East Exhibition Hall A-B (!) Tomorrow: WebOrganizer w/ @soldni & @kylelostat Thursday: MeCo by @gaotianyu1350
12K
Alex Wettig
@_awettig
Apr 2, 2024
Stay tuned for our pre-print next week with lots of insights on how to build good SWE agents 🕵️‍♂️
John Yang
@jyangballin
Apr 2, 2024
SWE-agent is our new system for autonomously solving issues in GitHub repos. It gets similar accuracy to Devin on SWE-bench, takes 93 seconds on avg + it's open source! We designed a new agent-computer interface to make it easy for GPT-4 to edit+run code github.com/princeton-nlp/…
4.4K
Alex Wettig
@_awettig
Jul 22, 2024
Simple strategy: (1) keep pre-training with HQ mix of long & short documents (2) quick instruction-tuning with ONLY short UltraChat We find: avg. performance on our long-context evals keeps improving with increasing continual pre-training budgets
Tianyu Gao
@gaotianyu1350
Jul 22, 2024
Meet ProLong, a Llama-3 based long-context chat model! huggingface.co/princeton-nlp/… (64K here, 512K coming soon) ProLong uses a simple recipe (short/long pre-training data + short UltraChat, no synthetic instructions) and achieves top performance on a series of long-context tasks.
4.9K
Alex Wettig
@_awettig
Sep 4, 2024
Check out the paper for the most comprehensive guide to MoE pre-training! 🧭 It's been amazing to witness the push for open LMs by @Muennighoff and @allen_ai -- The transparency is off the charts: Every ablation figure has a link to the full results on wandb in the caption
Niklas Muennighoff
@Muennighoff
Sep 4, 2024
Releasing OLMoE - the first good Mixture-of-Experts LLM that's 100% open-source - 1B active, 7B total params for 5T tokens - Best small LLM & matches more costly ones like Gemma, Llama - Open Model/Data/Code/Logs + lots of analysis & experiments 📜arxiv.org/abs/2409.02060 🧵1/9
2K
Alex Wettig
@_awettig
Jun 5, 2023
New paper where we train transformer-style models and then debug them as python programs!
Dan Friedman
@danfriedman0
Jun 5, 2023
Learning Transformer Programs We designed a modified Transformer that can be trained to solve a task and then automatically converted into a discrete, human-readable program. With @_awettig and @danqi_chen. Paper: arxiv.org/abs/2306.01128 Code: github.com/princeton-nlp/… [1/12]
2.7K
Alex Wettig
@_awettig
Aug 13, 2024
SWE-bench x OpenAI 👀
OpenAI
@OpenAI
Aug 13, 2024
We're releasing a new iteration of SWE-bench, in collaboration with the original authors, to more reliably evaluate AI models on their ability to solve real-world software issues. openai.com/index/introduc…
1.8K
Alex Wettig
@_awettig
May 7, 2025
Big arrow time! We can make huge progress on open-source SWE agents by scaling up the creation of virtual coding environments 🚀
John Yang
@jyangballin
May 7, 2025
40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synthesizing a ton of agentic training data from 100+ Python repos. Today we’re open-sourcing the toolkit that made it happen: SWE-smith.
2.9K
Alex Wettig
@_awettig
Jun 23, 2025
New paper cutting through the thicket of KV cache eviction methods!
Adithya Bhaskar
@AdithyaNLP
Jun 23, 2025
There are many KV cache-reduction methods, but a fair comparison is challenging. We propose a new unified metric called “critical KV footprint”. We compare existing methods and propose a new one - PruLong, which “prunes” certain attn heads to only look at local tokens. 1/7
1.2K
Alex Wettig
@_awettig
Oct 15, 2024
Replying to @kellerjordan0
I would be a little bit more generous towards Tim's intentions and not over-index on this specific aspect of this response.. clearly the bigger challenge is: how can we better extrapolate what works at scale from smaller scale experiments
1K
Alex Wettig
@_awettig
Feb 16, 2024
Replying to @_awettig
📜Check out the paper for extensive analysis of the quality ratings, including a discussion of social biases and the wider implications of data selection:
arxiv.org
QuRating: Selecting High-Quality Data for Training Language Models
Selecting high-quality pre-training data is important for creating capable language models, but existing methods rely on simple heuristics. We introduce QuRating, a method for selecting...
425