Log inSign up
Olga Golovneva
154 posts
user avatar
Olga Golovneva
@OlgaNLP
Doing research at Meta AI
Joined October 2022
137
Following
1,112
Followers

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
  • Pinned
    user avatar
    Olga Golovneva
    @OlgaNLP
    Jan 30
    While others claim to do RL in pretraining, we actually did it :) To fix safety, factually, hallucinations at *pretraining* we ensure the model is trained to generate only high-quality safe tokens, even for unsafe/corrupted prompts.
    user avatar
    Jason Weston
    @jaseweston
    Jan 30
    📈Self-Improving Pretraining 📈 ✍️: arxiv.org/abs/2601.21343 Reinvents pretraining: no more next token prediction! - Uses existing LM from last self-improvement iteration to give rewards to pretrain new model on *sequences* - Large gains in factuality, safety & quality 🧵1/5
    Image
    15K
  • user avatar
    Olga Golovneva
    @OlgaNLP
    May 30, 2024
    Context matters! In our new work we propose context-aware positional encodings. More details in the linked paper!
    Image
    Image
    user avatar
    Jason Weston
    @jaseweston
    May 30, 2024
    🚨 Contextual Position Encoding (CoPE) 🚨 Context matters! CoPE is a new positional encoding method for transformers that takes into account *context*. - Can "count" distances per head dependent on need, e.g. i-th sentence or paragraph, words, verbs, etc. Not just tokens. -
    120K
  • user avatar
    Olga Golovneva
    @OlgaNLP
    Oct 6, 2024
    Our team at Meta AI (former FAIR Labs) is hiring 2025 research interns and postdoc. Research areas cover LLM reasoning, alignment, memory, and architectures. DM me if interested in chatting during @COLM_conf !
    47K
  • user avatar
    Olga Golovneva
    @OlgaNLP
    Apr 2, 2025
    We have been cooking! 👨‍🍳 🧵(1/6)
    user avatar
    Jason Weston
    @jaseweston
    Apr 2, 2025
    🚨Multi-Token Attention🚨 📝: arxiv.org/abs/2504.00927 Attention is critical for LLMs, but its weights are computed by single query & key vectors, limiting capability. MTA combines query, key & head operations over multiple tokens, improving performance in terms of PPL, std
    Image
    53K
  • user avatar
    Olga Golovneva
    @OlgaNLP
    Sep 17, 2024
    Today we have released the code for Contextual Position Encodings. Please, check it out in our GitHub repo: github.com/facebookresear… #opensource
    user avatar
    Jason Weston
    @jaseweston
    May 30, 2024
    🚨 Contextual Position Encoding (CoPE) 🚨 Context matters! CoPE is a new positional encoding method for transformers that takes into account *context*. - Can "count" distances per head dependent on need, e.g. i-th sentence or paragraph, words, verbs, etc. Not just tokens. -
    Image
    8.7K
  • user avatar
    Olga Golovneva
    @OlgaNLP
    Apr 27, 2025
    Thanks to all the organizers, it was a pleasure to attend and learn from great speakers!
    user avatar
    Arthur Douillard
    @Ar_Douillard
    Apr 27, 2025
    Replying to @Ar_Douillard @ahmetustun89 and 3 others
    Collaborative and modular training by @OlgaNLP
    Image
    4.1K
  • user avatar
    Olga Golovneva
    @OlgaNLP
    Jul 10, 2024
    Happy to share that I have 2 papers accepted at COLM: Nursing the reversal curse, and Branch-Train-Mix! See you there in October! #COLM #COLM2024
    6.9K
  • user avatar
    Olga Golovneva
    @OlgaNLP
    Apr 2, 2025
    Replying to @OlgaNLP
    MTA Recipe: - The high level goal is to make it possible to use the similarities of multiple vector pairs to determine where attention must focus. - We add convolutions for keys, queries, and attention heads to allow conditioning on neighboring tokens! 🧵(3/6)
    Image
    1.1K
  • user avatar
    Olga Golovneva
    @OlgaNLP
    Apr 2, 2025
    Replying to @OlgaNLP
    Motivation: soft attention looks at two tokens at a time to weigh their importance. But often it’s not enough! Suppose you are reading a history book, and you want to find what happened in Rome in 1417. You need to match both city and date *mentioned together*. 🧵(2/6)
    1.1K
  • user avatar
    Olga Golovneva
    @OlgaNLP
    Aug 6, 2024
    It is very challenging to get the right recipe for synthetic data, but the results speak for themselves.
    user avatar
    Jason Weston
    @jaseweston
    Aug 6, 2024
    🚨New paper!🚨 Self-Taught Evaluators - Llama 3-70B trained w/ synthetic data *only* - Iteratively finds better judgments in training - Best LLM-as-a-Judge model on RewardBench (88.3, 88.7 w/ maj vote) - Outperforms bigger models or human labels arxiv.org/abs/2408.02666 🧵(1/4)
    Image
    2K
  • user avatar
    Olga Golovneva
    @OlgaNLP
    May 2, 2024
    Our Large LM (1.4B) finally finished training, and we have updated the paper with more exciting results! TL;DR: mixing training data with Random segment reversal not only resolves the reversal curse, but improves performance on the variety on benchmarks wrt data-matched models!
    user avatar
    Olga Golovneva
    @OlgaNLP
    Mar 21, 2024
    New paper! We propose simple yet effective data augmentation method for training LLMs, that improves model performance and resolves the reversal curse
    15K
  • user avatar
    Olga Golovneva
    @OlgaNLP
    Mar 21, 2024
    New paper! We propose simple yet effective data augmentation method for training LLMs, that improves model performance and resolves the reversal curse
    user avatar
    Jason Weston
    @jaseweston
    Mar 21, 2024
    🚨 Reverse Training to Nurse the Reversal Curse🚨 LLMs fail on “B is A” if only trained on "A is B". - Reverse training doubles training tokens by reversing strings - Outperforms data-matched standard baselines - Fixes issues on reversal tasks arxiv.org/pdf/2403.13799… 🧵(1/6)
    Image
    3.5K
  • user avatar
    Olga Golovneva
    @OlgaNLP
    Apr 2, 2025
    Replying to @OlgaNLP
    Finally, we look at convolution kernels, many interesting patterns! This one for example is responsible for matching sequences. But what else did the model encode in kernels during pretraining? Check the paper for more & thanks for paying attention to all these tokens! 🙏 🧵(6/6)
    Image
    1.1K
  • user avatar
    Olga Golovneva
    @OlgaNLP
    Apr 2, 2025
    Replying to @OlgaNLP
    We train a LLM and show that MTA improves… pretty much everywhere, but especially on long-range dependency tasks, where regular attention struggles. 🧵(5/6)
    Image
    1.3K
Advertisement
Advertisement