Our Llama 3.1 405B is now openly available! After a year of dedicated effort, from project planning to launch reviews, we are thrilled to open-source the Llama 3 herd of models and share our findings through the paper:
🔹Llama 3.1 405B, continuously trained with a 128K context
Aston Zhang
208 posts
Pre-training @OpenAI
- Llama 3 has been my focus since joining the Llama team last summer. Together, we've been tackling challenges across pre-training and human data, pre-training scaling, long context, post-training, and evaluations. It's been a rigorous yet thrilling journey: 🔹Our largest models
- "Imagine learning a textbook with no figures." Multimodal chain-of-thought (Multimodal-CoT) in Language Models - Outperform GPT-3.5 by 16% (75%->91%) and surpass human performance on ScienceQA - Less than 1B params (so you can train more easily) - Code & model released [1/6]
- Thrilled that our 'Dive into Deep Learning' book, now published by Cambridge University Press, is the Top New Release on Amazon! To ensure accessibility and affordability, we, the authors, have waived our royalties. Plus, it's always available for free at D2L.ai
- 🚀 Exciting internship opportunity! Join the Llama team @AIatMeta and help redefine what's possible with large language models—from pre-training to post-training. Be part of our 2025 research internship and help shape the future of LLMs. Feel free to email or DM me 📩 Learn
- Cheer AI up with "let's think step by step"? More plz Let’s think not just step by step, but also one by one We can use more cheers & diversity to SAVE huge manual efforts in chain of thought prompt design, matching or even exceeding performance of manual design on GPT-3 [1/7]
- 🚀 New paper from our Llama team @AIatMeta! We discuss "cross capabilities" and "Law of the Weakest Link" of large language models (LLMs): 🔹 Cross capabilities: the intersection of multiple distinct capabilities across different types of expertise necessary to address complex,
- Don't assign the SAME parameter-efficient fine-tuning strategy to DIFFERENT layers New tips: - Group layers, SPINDLE pattern (e.g, 4-8-8-4 layers) - Allocate params to layers uniformly - Tune all groups - Adjust tuning strategies for diff groups @AmazonScience @stanfordnlp[1/4]
- #ICML Long Oral! @AmazonScience Out-of-Distribution (OOD) Detection in Long-Tailed Recognition 📉 Existing OOD detection fails when training data is long-tail distributed 📈 Ours: SOTA on long-tailed ImageNet Paper: arxiv.org/pdf/2207.01160… Code: github.com/amazon-researc… 1/
- Although our D2L.ai book is free online, many readers have been requesting hard copies for tired eyes So excited to announce: ✅ English publication agreement with @CambridgeUP was signed @AmazonScience ✅ Chinese 2nd edition was sent to print Both in @PyTorch
- If your prompt tuning can't converge easily, make it semi-parametric. 🆕Memory prompt: input-adaptive but no need memory prompt tuning ✅Full fine-tuning on 31 tasks -> zero-shot generalization ✅Parameter-efficient fine-tuning on GLUE -> task transferability on SuperGLUE [1/4]

















