Log inSign up
Mike Lewis
277 posts
user avatar
Mike Lewis
@ml_perception
Llama3 pre-training lead. Partially to blame for things like the Cicero Diplomacy bot, BART, RoBERTa, attention sinks, kNN-LM, top-k sampling & Deal Or No Deal.
Seattle
Joined September 2019
252
Following
8,406
Followers

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
  • user avatar
    Mike Lewis
    @ml_perception
    Nov 22, 2022
    New paper in Science today on playing the classic negotiation game "Diplomacy" at a human level, by connecting language models with strategic reasoning! Our agent engages in intense and lengthy dialogues to persuade other players to follow its plans. This was really hard! 1/5
    Image
  • user avatar
    Mike Lewis
    @ml_perception
    Apr 18, 2024
    Excited to share a preview of Llama3, including the release of an 8B and 70B (82 MMLU, should be the best open weights model!), and preliminary results for a 405B model (still training, but already competitive with GPT4). Lots more still to come...
    Image
    ai.meta.com
    Introducing Meta Llama 3: The most capable openly available LLM to date
    Today, we’re introducing Meta Llama 3, the next generation of our state-of-the-art open source large language model. In the coming months, we expect to share new capabilities, additional model sizes,...
    63K
  • user avatar
    Mike Lewis
    @ml_perception
    Apr 18, 2024
    Yes, both the 8B and 70B are trained way more than is Chinchilla optimal - but we can eat the training cost to save you inference cost! One of the most interesting things to me was how quickly the 8B was improving even at 15T tokens.
    93K
  • user avatar
    Mike Lewis
    @ml_perception
    Nov 22, 2022
    Replying to @ml_perception
    In anonymous blitz Diplomacy games against humans, our agent finished in the top 10%, doubling the average human score. This project was a big collaboration between experts in NLP, game theory and Diplomacy (too many to tag!). Paper here: science.org/doi/10.1126/sc… 5/5
    Image
  • user avatar
    Mike Lewis
    @ml_perception
    Jun 29, 2020
    Happy to share MARGE, our new work on rethinking pre-training: given a document, we first retrieve related documents, and then paraphrase these to reconstruct the original. MARGE works well for generation and classification in many languages, sometimes without supervision. (1/6)
    Image
  • user avatar
    Mike Lewis
    @ml_perception
    Oct 31, 2019
    Excited to share our work on BART, a method for pre-training seq2seq models by de-noising text. BART outperforms previous work on a bunch of generation tasks (summarization/dialogue/QA), while getting similar performance to RoBERTa on SQuAD/GLUE
  • user avatar
    Mike Lewis
    @ml_perception
    Nov 22, 2022
    It's designed to never intentionally backstab - all its messages correspond to actions it currently plans to take. However, sometimes it changes its mind...
  • user avatar
    Mike Lewis
    @ml_perception
    Nov 22, 2022
    Replying to @ml_perception
    In Diplomacy, 7 players hold private conversations and then make simultaneous moves. The dialogue is used to establish trust and coordinate actions with other players. Here, our agent (green) de-escalates with another player by reassuring them it will not attack them. 2/5
    Image
  • user avatar
    Mike Lewis
    @ml_perception
    Nov 22, 2022
    Replying to @ml_perception
    Each game, it sends and receives hundreds of messages, which must be precisely grounded in the game state, dialogue history, and its plans. We developed methods for filtering erroneous messages, letting the agent to pass for human in 40 games. Guess which player is AI here... 4/5
    Image
  • user avatar
    Mike Lewis
    @ml_perception
    Mar 24, 2020
    BART is now ridiculously easy to use. Great work as always by @huggingface!
    user avatar
    Hugging Face
    @huggingface
    Mar 24, 2020
    Bored at home? Need a new friend? Hang out with BART, the newest model available in transformers (thx @sam_shleifer) , with the hefty 2.6 release (notes: github.com/huggingface/tr…). Now you can get state-of-the-art summarization with a few lines of code: 👇👇👇
    Image
  • user avatar
    Mike Lewis
    @ml_perception
    Nov 22, 2022
    Replying to @ml_perception
    We automatically labelled training set messages with predicted moves, and use these as control tokens for the LM. During games, the LM is controlled by actions from a planning system. It tries to be helpful - here it (in blue) suggests mutually beneficial moves a human missed 3/5
    Image
  • user avatar
    Mike Lewis
    @ml_perception
    May 15, 2023
    New paper on scaling language models to sequences of a million bytes! MegaByte splits long byte sequences into fixed-size patches (analogous to tokens), then runs a large model between the patches, and a small model to predict each patch byte-by-byte. 1/
    user avatar
    AK
    @_akhaliq
    May 15, 2023
    MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers abs: arxiv.org/abs/2305.07185 paper page: huggingface.co/papers/2305.07…
    Image
    39K
  • user avatar
    Mike Lewis
    @ml_perception
    Apr 19, 2024
    I'm seeing a lot of questions about the limit of how good you can make a small LLM. tldr; benchmarks saturate, models don't. LLMs will improve logarithmically forever with enough good data.
    user avatar
    Mike Lewis
    @ml_perception
    Apr 18, 2024
    Yes, both the 8B and 70B are trained way more than is Chinchilla optimal - but we can eat the training cost to save you inference cost! One of the most interesting things to me was how quickly the 8B was improving even at 15T tokens.
    35K
  • user avatar
    Mike Lewis
    @ml_perception
    Aug 27, 2021
    Some people still complain about the lack of peer review on arxiv - but I think it’s *great* for science when you can share a method on Thursday, and have independent groups confirm its effectiveness on Friday.
    This Post is from an account that no longer exists. Learn more
Advertisement
Advertisement