🤔 Ever wondered how prevalent some type of web content is during LM pre-training?
In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages 🌐
Key takeaway: domains help us curate better pre-training data! 🧵/N
Alex Wettig
268 posts
composer-ing @cursor_ai
- How to train long-context LMs? (and beat Llama-3.1 🏆) Many takeaways from our new paper! - Focus on diverse & reliable evaluations (not just perplexity) - Find good sources of long data and high-quality short data - ... A 🧵 on how we produced ProLong, a SoTA 8B 512K model
- **QuRating**: We get 4 quality signals from GPT-3.5 for selecting LM training data We select 30B out of 260B tokens and train 1.3B LMs from scratch. We find that QuRating can improve perplexity and ICL performance ✅ (w/ Aatmik Gupta, Sauma Malik, @danqi_chen)
- Stop by the QuRating *spotlight* poster this afternoon to chat about data quality for LMs ⏰: 1:30-3pm CET /📍: Hall C 4-9 #617
- Presenting two posters at ICML over the next two days: - Both at 11am - 1:30pm - Both about how to improve pre-training with domains - Both at stall # E-2600 in East Exhibition Hall A-B (!) Tomorrow: WebOrganizer w/ @soldni & @kylelostat Thursday: MeCo by @gaotianyu1350
- Stay tuned for our pre-print next week with lots of insights on how to build good SWE agents 🕵️♂️SWE-agent is our new system for autonomously solving issues in GitHub repos. It gets similar accuracy to Devin on SWE-bench, takes 93 seconds on avg + it's open source! We designed a new agent-computer interface to make it easy for GPT-4 to edit+run code github.com/princeton-nlp/…
- Simple strategy: (1) keep pre-training with HQ mix of long & short documents (2) quick instruction-tuning with ONLY short UltraChat We find: avg. performance on our long-context evals keeps improving with increasing continual pre-training budgetsMeet ProLong, a Llama-3 based long-context chat model! huggingface.co/princeton-nlp/… (64K here, 512K coming soon) ProLong uses a simple recipe (short/long pre-training data + short UltraChat, no synthetic instructions) and achieves top performance on a series of long-context tasks.
- Check out the paper for the most comprehensive guide to MoE pre-training! 🧭 It's been amazing to witness the push for open LMs by @Muennighoff and @allen_ai -- The transparency is off the charts: Every ablation figure has a link to the full results on wandb in the captionReleasing OLMoE - the first good Mixture-of-Experts LLM that's 100% open-source - 1B active, 7B total params for 5T tokens - Best small LLM & matches more costly ones like Gemma, Llama - Open Model/Data/Code/Logs + lots of analysis & experiments 📜arxiv.org/abs/2409.02060 🧵1/9
- New paper where we train transformer-style models and then debug them as python programs!Learning Transformer Programs We designed a modified Transformer that can be trained to solve a task and then automatically converted into a discrete, human-readable program. With @_awettig and @danqi_chen. Paper: arxiv.org/abs/2306.01128 Code: github.com/princeton-nlp/… [1/12]
- SWE-bench x OpenAI 👀We're releasing a new iteration of SWE-bench, in collaboration with the original authors, to more reliably evaluate AI models on their ability to solve real-world software issues. openai.com/index/introduc…
- Big arrow time! We can make huge progress on open-source SWE agents by scaling up the creation of virtual coding environments 🚀40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synthesizing a ton of agentic training data from 100+ Python repos. Today we’re open-sourcing the toolkit that made it happen: SWE-smith.
- New paper cutting through the thicket of KV cache eviction methods!There are many KV cache-reduction methods, but a fair comparison is challenging. We propose a new unified metric called “critical KV footprint”. We compare existing methods and propose a new one - PruLong, which “prunes” certain attn heads to only look at local tokens. 1/7
- Replying to @kellerjordan0I would be a little bit more generous towards Tim's intentions and not over-index on this specific aspect of this response.. clearly the bigger challenge is: how can we better extrapolate what works at scale from smaller scale experiments
- Replying to @_awettig📜Check out the paper for extensive analysis of the quality ratings, including a discussion of social biases and the wider implications of data selection:



















