Log inSign up
Luca Soldaini πŸŽ€
4,973 posts
Image
user avatar
Luca Soldaini πŸŽ€
@soldni
data mines are my passion ⛏️ mts @MicrosoftAI / ex Olmo co-lead @allen_ai / pfp @YanhongLi2062 / thoughts are mine, leave my employer alone / 🌈
Seattle, WA, USA
soldaini.net
Joined September 2013
1,274
Following
12.8K
Followers
  • user avatar
    Luca Soldaini πŸŽ€
    @soldni
    Sep 25, 2024
    Olmo goes multimodal! We are launching Molmo, a open family of multimodal models that rival the best closed VLMs out there 🀯 We spent the last 9 months meticulously curating PixMo, a dataset of (a) high-quality image-caption pairs and (b) multimodal instruction data.
    Image
    90K
  • user avatar
    Luca Soldaini πŸŽ€
    @soldni
    Aug 18, 2023
    Announcing Dolma, the dataset for @allen_ai's LLM, OLMo. It's 3+ trillion tokens (web/papers/code/books/wiki). We hope it will facilitate study of LLMs & their behavior! Released on @huggingface w ImpACT license huggingface.co/datasets/allen… Overview/datasheet blog.allenai.org/dolma-3-trilli…
    Dolma's logo. It's the word dolma written in a very curvy, 70s style word art. Text is yellow, and background is blue with swoopy lines that almost make out the word OLMo.
    112K
  • user avatar
    Luca Soldaini πŸŽ€
    @soldni
    Jan 3, 2025
    OLMo 2 tech report is out We get in the weeds with this one, with 50+ pages on 4 crucial components of LLM development pipeline:
    Image
    55K
  • user avatar
    Luca Soldaini πŸŽ€
    @soldni
    May 13, 2024
    GPT-4o still gets foiled by my favorite tokenization-related question ☺️
    When asking 4o to write a list of fruits ending in um, it returns 


Here is a list of fruits ending in "um":

1. Persimmon
2. Durian
3. Mangosteen
4. Starfruit (Carambola)
5. Rambutan
    40K
  • user avatar
    Luca Soldaini πŸŽ€
    @soldni
    Jul 19, 2023
    Myself and @kylelostat have just released peS2o πŸƒπŸŽ“, a collection of 40M open-access papers carefully cleaned for LLM training. V1 has been used by @MosaicML to train MPT, and we have a V2 version! @huggingface page: huggingface.co/datasets/allen… feedback? github.com/allenai/peS2o/…
    A screenshot of the pes2o page on the huggingface hub
    61K
  • user avatar
    Luca Soldaini πŸŽ€
    @soldni
    Nov 4, 2024
    Blows my mind that model souping Just Worksℒ️ Same model, same data, train 3-5 times with different seeds, 1-2 extra points on MMLU, Hellaswag, ARC, GSM8k, etc
    153K
  • user avatar
    Luca Soldaini πŸŽ€
    @soldni
    Feb 25, 2025
    So many tokens in PDFs πŸ“œ yet so hard to extract them πŸ”Ž Not anymore! olmOCR gives you plain text version of any doc you can think of: science papers, old scans, brochures with weird layouts, even handwriting ✍️ Try it today πŸ‘‡
    Image
    Image
    00:14
    user avatar
    Ai2
    @allen_ai
    Feb 25, 2025
    Introducing olmOCR, our open-source tool to extract clean plain text from PDFs! Built for scale, olmOCR handles many document types with high throughput. Run it on your own GPU for freeβ€”at over 3000 token/s, equivalent to $190 per million pages, or 1/32 the cost of GPT-4o!
    44K
  • user avatar
    Luca Soldaini πŸŽ€
    @soldni
    Feb 20, 2025
    in the upcoming LLMs war, i choose a neutral team
    Image
    8.6K
  • user avatar
    Luca Soldaini πŸŽ€
    @soldni
    Aug 12, 2025
    biggest gift to humanity any frontier lab could do is add a ton of uv examples in their posttraining mix 😁 save the people from python dependency management hell!!
    25K
  • user avatar
    Luca Soldaini πŸŽ€
    @soldni
    Dec 8, 2024
    xAI employees burning out faster than surface of the sun please take care of yourselves guys πŸ₯Ί
    user avatar
    Nikita Bier
    X
    @nikitabier
    Dec 8, 2024
    Replying to @justindross
    I have literally never seen a team work this hard in my entire life. 9-2am everyday, weekends and holidays included
    75K
  • user avatar
    Luca Soldaini πŸŽ€
    @soldni
    Sep 10, 2024
    Selecting pretraining data points based on correlation with downstream tasks is an effective data mixing technique I love papers that are a simple, elegant idea executed rly well! lovely read from @TristanThrush @ChrisGPotts @tatsu_hashimoto 😊 arxiv.org/abs/2409.05816
    screenshot of the abstract linked in the tweet.
    28K
  • user avatar
    Luca Soldaini πŸŽ€
    @soldni
    Dec 27, 2024
    guys Deepseek obviously has more than 2048 H800. that’s just the size of their largest cluster. Deepseek 3 model is amazing but imagine having 130+ researchers on just 2K GPUs lmao
    102K
  • user avatar
    Luca Soldaini πŸŽ€
    @soldni
    Feb 1, 2024
    release day release day! OLMo 1b + 7b out today πŸ₯³ and 65b coming soon... With OLMo, we are really focused on advancing the study of LLMs. We release **everything**, from toolkit to create its training dataset (dolma) to training & inference code. More details in thread 🧡
    Image
    GIF
    49K
  • user avatar
    Luca Soldaini πŸŽ€
    @soldni
    Dec 9, 2023
    multimodal PDF processing is painful but doesn’t have to! come to our demo at #EMNLP2023 of Papermage, a library for fast manipulation of PDFs (Friday 9/12 @ 9am) we have used it for LLM data cleanup, paper QA, HCI prototypes github.com/allenai/paperm… aclanthology.org/2023.emnlp-dem…
    A picture of kyle in front of our poster
    A screenshot of the first page of papermage demo paper
    33K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

TermsΒ·PrivacyΒ·CookiesΒ·AccessibilityΒ·Ads InfoΒ·Β© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement