Mike Lewis (@ml_perception) / X

Mike Lewis

277 posts

Mike Lewis

@ml_perception

Llama3 pre-training lead. Partially to blame for things like the Cicero Diplomacy bot, BART, RoBERTa, attention sinks, kNN-LM, top-k sampling & Deal Or No Deal.

Seattle

Joined September 2019

Mike Lewis
@ml_perception
Nov 22, 2022
New paper in Science today on playing the classic negotiation game "Diplomacy" at a human level, by connecting language models with strategic reasoning! Our agent engages in intense and lengthy dialogues to persuade other players to follow its plans. This was really hard! 1/5
Mike Lewis
@ml_perception
Apr 18, 2024
Excited to share a preview of Llama3, including the release of an 8B and 70B (82 MMLU, should be the best open weights model!), and preliminary results for a 405B model (still training, but already competitive with GPT4). Lots more still to come...
ai.meta.com
Introducing Meta Llama 3: The most capable openly available LLM to date
Today, we’re introducing Meta Llama 3, the next generation of our state-of-the-art open source large language model. In the coming months, we expect to share new capabilities, additional model sizes,...
63K
Mike Lewis
@ml_perception
Apr 18, 2024
Yes, both the 8B and 70B are trained way more than is Chinchilla optimal - but we can eat the training cost to save you inference cost! One of the most interesting things to me was how quickly the 8B was improving even at 15T tokens.
93K
Mike Lewis
@ml_perception
Nov 22, 2022
Replying to @ml_perception
In anonymous blitz Diplomacy games against humans, our agent finished in the top 10%, doubling the average human score. This project was a big collaboration between experts in NLP, game theory and Diplomacy (too many to tag!). Paper here: science.org/doi/10.1126/sc… 5/5
Mike Lewis
@ml_perception
Jun 29, 2020
Happy to share MARGE, our new work on rethinking pre-training: given a document, we first retrieve related documents, and then paraphrase these to reconstruct the original. MARGE works well for generation and classification in many languages, sometimes without supervision. (1/6)
Mike Lewis
@ml_perception
Oct 31, 2019
Excited to share our work on BART, a method for pre-training seq2seq models by de-noising text. BART outperforms previous work on a bunch of generation tasks (summarization/dialogue/QA), while getting similar performance to RoBERTa on SQuAD/GLUE
Mike Lewis
@ml_perception
Nov 22, 2022
It's designed to never intentionally backstab - all its messages correspond to actions it currently plans to take. However, sometimes it changes its mind...
Mike Lewis
@ml_perception
Nov 22, 2022
Replying to @ml_perception
In Diplomacy, 7 players hold private conversations and then make simultaneous moves. The dialogue is used to establish trust and coordinate actions with other players. Here, our agent (green) de-escalates with another player by reassuring them it will not attack them. 2/5
Mike Lewis
@ml_perception
Nov 22, 2022
Replying to @ml_perception
Each game, it sends and receives hundreds of messages, which must be precisely grounded in the game state, dialogue history, and its plans. We developed methods for filtering erroneous messages, letting the agent to pass for human in 40 games. Guess which player is AI here... 4/5
Mike Lewis
@ml_perception
Mar 24, 2020
BART is now ridiculously easy to use. Great work as always by @huggingface!
Hugging Face
@huggingface
Mar 24, 2020
Bored at home? Need a new friend? Hang out with BART, the newest model available in transformers (thx @sam_shleifer) , with the hefty 2.6 release (notes: github.com/huggingface/tr…). Now you can get state-of-the-art summarization with a few lines of code: 👇👇👇
Mike Lewis
@ml_perception
Nov 22, 2022
Replying to @ml_perception
We automatically labelled training set messages with predicted moves, and use these as control tokens for the LM. During games, the LM is controlled by actions from a planning system. It tries to be helpful - here it (in blue) suggests mutually beneficial moves a human missed 3/5
Mike Lewis
@ml_perception
May 15, 2023
New paper on scaling language models to sequences of a million bytes! MegaByte splits long byte sequences into fixed-size patches (analogous to tokens), then runs a large model between the patches, and a small model to predict each patch byte-by-byte. 1/
AK
@_akhaliq
May 15, 2023
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers abs: arxiv.org/abs/2305.07185 paper page: huggingface.co/papers/2305.07…
39K
Mike Lewis
@ml_perception
Apr 19, 2024
I'm seeing a lot of questions about the limit of how good you can make a small LLM. tldr; benchmarks saturate, models don't. LLMs will improve logarithmically forever with enough good data.
Mike Lewis
@ml_perception
Apr 18, 2024
Yes, both the 8B and 70B are trained way more than is Chinchilla optimal - but we can eat the training cost to save you inference cost! One of the most interesting things to me was how quickly the 8B was improving even at 15T tokens.
35K
Mike Lewis
@ml_perception
Aug 27, 2021
Some people still complain about the lack of peer review on arxiv - but I think it’s *great* for science when you can share a method on Thursday, and have independent groups confirm its effectiveness on Friday.
This Post is from an account that no longer exists. Learn more