Luca Soldaini 🎀 (@soldni) / X

Luca Soldaini 🎀

4,973 posts

Luca Soldaini 🎀

@soldni

data mines are my passion ⛏️ mts @MicrosoftAI / ex Olmo co-lead @allen_ai / pfp @YanhongLi2062 / thoughts are mine, leave my employer alone / 🌈

Seattle, WA, USA

soldaini.net

Joined September 2013

Luca Soldaini 🎀
@soldni
Sep 25, 2024
Olmo goes multimodal! We are launching Molmo, a open family of multimodal models that rival the best closed VLMs out there 🤯 We spent the last 9 months meticulously curating PixMo, a dataset of (a) high-quality image-caption pairs and (b) multimodal instruction data.
90K
Luca Soldaini 🎀
@soldni
Aug 18, 2023
Announcing Dolma, the dataset for @allen_ai's LLM, OLMo. It's 3+ trillion tokens (web/papers/code/books/wiki). We hope it will facilitate study of LLMs & their behavior! Released on @huggingface w ImpACT license huggingface.co/datasets/allen… Overview/datasheet blog.allenai.org/dolma-3-trilli…
112K
Luca Soldaini 🎀
@soldni
Jan 3, 2025
OLMo 2 tech report is out We get in the weeds with this one, with 50+ pages on 4 crucial components of LLM development pipeline:
55K
Luca Soldaini 🎀
@soldni
May 13, 2024
GPT-4o still gets foiled by my favorite tokenization-related question ☺️
40K
Luca Soldaini 🎀
@soldni
Jul 19, 2023
Myself and @kylelostat have just released peS2o 🍃🎓, a collection of 40M open-access papers carefully cleaned for LLM training. V1 has been used by @MosaicML to train MPT, and we have a V2 version! @huggingface page: huggingface.co/datasets/allen… feedback? github.com/allenai/peS2o/…
61K
Luca Soldaini 🎀
@soldni
Nov 4, 2024
Blows my mind that model souping Just Works™️ Same model, same data, train 3-5 times with different seeds, 1-2 extra points on MMLU, Hellaswag, ARC, GSM8k, etc
153K
Luca Soldaini 🎀
@soldni
Feb 25, 2025
So many tokens in PDFs 📜 yet so hard to extract them 🔎 Not anymore! olmOCR gives you plain text version of any doc you can think of: science papers, old scans, brochures with weird layouts, even handwriting ✍️ Try it today 👇
00:14
Ai2
@allen_ai
Feb 25, 2025
Introducing olmOCR, our open-source tool to extract clean plain text from PDFs! Built for scale, olmOCR handles many document types with high throughput. Run it on your own GPU for free—at over 3000 token/s, equivalent to $190 per million pages, or 1/32 the cost of GPT-4o!
44K
Luca Soldaini 🎀
@soldni
Feb 20, 2025
in the upcoming LLMs war, i choose a neutral team
8.6K
Luca Soldaini 🎀
@soldni
Aug 12, 2025
biggest gift to humanity any frontier lab could do is add a ton of uv examples in their posttraining mix 😁 save the people from python dependency management hell!!
25K
Luca Soldaini 🎀
@soldni
Dec 8, 2024
xAI employees burning out faster than surface of the sun please take care of yourselves guys 🥺
Nikita Bier
@nikitabier
Dec 8, 2024
Replying to @justindross
I have literally never seen a team work this hard in my entire life. 9-2am everyday, weekends and holidays included
75K
Luca Soldaini 🎀
@soldni
Sep 10, 2024
Selecting pretraining data points based on correlation with downstream tasks is an effective data mixing technique I love papers that are a simple, elegant idea executed rly well! lovely read from @TristanThrush @ChrisGPotts @tatsu_hashimoto 😊 arxiv.org/abs/2409.05816
28K
Luca Soldaini 🎀
@soldni
Dec 27, 2024
guys Deepseek obviously has more than 2048 H800. that’s just the size of their largest cluster. Deepseek 3 model is amazing but imagine having 130+ researchers on just 2K GPUs lmao
102K
Luca Soldaini 🎀
@soldni
Feb 1, 2024
release day release day! OLMo 1b + 7b out today 🥳 and 65b coming soon... With OLMo, we are really focused on advancing the study of LLMs. We release **everything**, from toolkit to create its training dataset (dolma) to training & inference code. More details in thread 🧵
GIF
49K
Luca Soldaini 🎀
@soldni
Dec 9, 2023
multimodal PDF processing is painful but doesn’t have to! come to our demo at #EMNLP2023 of Papermage, a library for fast manipulation of PDFs (Friday 9/12 @ 9am) we have used it for LLM data cleanup, paper QA, HCI prototypes github.com/allenai/paperm… aclanthology.org/2023.emnlp-dem…
33K