This is a long overdue section of the ML Engineering
Understanding Training Loss Patterns
github.com/stas00/ml-engi…
I warn you that the "Understanding" part is overloaded here since most of the time we don't really understand why certain types of spikes happen. Here
Stas Bekman
2,747 posts
Toolmaker. Software creator, optimizer and harmonizer.
Makes ML systems work and fly @ Snowflake.
- PyTorch announced Monarch which is meant to simplify distributed programming—your code looks and feels like a single-machine Python program, but can scale across thousands of GPUs. You can directly use Pythonic constructs—classes, functions, loops, tasks, futures—to express
- The @PyTorch team are working on a new super important tool: github.com/pytorch-labs/t… This repository implements techniques for doing a per-step fault tolerance so you can keep training if errors occur without interrupting the entire training job. Some big companies already have
- Holy! The Machine Learning Engineering Open Book repo has just crossed 9k stars on github! That's insane as I have started writing it ~6 month ago! github.com/stas00/ml-engi… Thank you so much for your vote of confidence! It's super encouraging to continue investing into this
- A special moment for The Machine Learning Engineering Open Book - it has just hit the magical 11111 stars and 666 forks! And it has been 1 year since I started on this structured brain dump! github.com/stas00/ml-engi… A huge thank you, the readers, for your vote of confidence!
- This is the first pass on the new chapter for ML Engineering: The AI Battlefield Engineering - What You Need To Know github.com/stas00/ml-engi… This a WIP and your feedback for improvement is always welcome.
- As the Machine Learning Engineering book was getting too unstructured I did a massive re-org and I present to you the new layout which hopefully is much more intuitive. github.com/stas00/ml-engi… The re-org work isn't 100% completed but it's mostly there. If you feel something is
- I finally installed filters to remove *.medium.com, towardsdatascience\.com from Google Search. Pay to read this is no more! Yay! Why would one write for a for-profit company for **free** when their target audience is forced to pay to read it? Using: addons.mozilla.org/en-CA/firefox/…
- If this is useful for your work, I have just created a <1MB tiny random llama2 model including a tiny 3k tokenizer. huggingface.co/stas/tiny-rand… This is crucial for extremely fast testing/development. You can easily adapt the tiny model maker script to any other model
- This is fantastic article explaining why you should be paying attention to the emergence of hybrid models and why they are likely to replace self-attention-based models (hint: much faster and lower memory foot print inference). pytorch.org/blog/hybrid-mo… This is from vllm folks.
- This is a pretty awesome simple step-by-step guide showing you how to build your own PyTorch (a subset of ops supported) which requires just basic knowledge of C/C++/Python. towardsdatascience.com/recreating-pyt… The reason to walk through it is to better understand how some of the common
- At @BigscienceW 104B GPT training we finally had a BREAKTHROUGH and the training doesn't diverge! The key change of Expirement 11 was to change --init-method-std 0.006 from 0.02 - Thank you, BS Team! Details are in: github.com/bigscience-wor… TB is here: huggingface.co/bigscience/tr8…
- The Model Parallelism chapter of the ML Engineering is now quite complete. github.com/stas00/ml-engi… The future of training LLM/VLMs is exciting with so many great minds putting their smarts into giving the ML community amazing tools to work with. I will now stop making too many
- I have just added a brief summary of Transformers with Mixture of Experts architectures with pointers to papers and blog posts that you can study for more details. huggingface.co/docs/transform… the diagram is from the Google blog post linked in the summary.









