Log inSign up
Abhi Venigalla
942 posts
Image
user avatar
Abhi Venigalla
@ml_hardware
Researcher @Databricks. Former @MosaicML, @CerebrasSystems. Addicted to all things compute.
San Francisco, CA
Joined October 2018
1,529
Following
8,449
Followers
  • user avatar
    Abhi Venigalla
    @ml_hardware
    Jun 30, 2023
    Ready for GPU independence weekend? PyTorch 2.0 and LLM Foundry now work out of the box on ** AMD GPUs! ** We profiled MPT 1B-13B models on AMD MI250 and saw perf within 80% of A100-40GB, which could go up to 94% with better software. It. Just. Works.
    Image
    Training LLMs with AMD MI250 GPUs and MosaicML | Databricks Blog
    From databricks.com
    228K
  • user avatar
    Abhi Venigalla
    @ml_hardware
    May 17, 2023
    CNBC leaks PaLM2-L training config, says it is: * 340B params * 3.6T tokens * 7.3e24 FLOPs using the (6*N*D) approx
    Image
    Google's newest A.I. model uses nearly five times more text data for training than its predecessor
    From cnbc.com
    266K
  • user avatar
    Abhi Venigalla
    @ml_hardware
    Jun 30, 2023
    Replying to @ml_hardware
    And yes, you can switch back and forth between NVIDIA and AMD, even within a single training run. It's Christmas in July!🎄
    Image
    241K
  • user avatar
    Abhi Venigalla
    @ml_hardware
    Oct 31, 2023
    Back in June we @MosaicML showed that our LLM Foundry training stack runs seamlessly on @AMD MI250 GPUs. Today, I'm happy to share that we've scaled up to 128xMI250, with great multi-node performance!
    Image
    119K
  • user avatar
    Abhi Venigalla
    @ml_hardware
    Mar 27, 2024
    We built a new model! 🧱 It's called DBRX 🧱 * mixture of experts * 16 choose 4 experts * 36B active, 132B total * trained on 12T tokens * built e2e in 2 months * using 3072xH100 * served up to 150 tok/s on @Databricks * open weights :)
    47K
  • user avatar
    Abhi Venigalla
    @ml_hardware
    Mar 29, 2024
    This is literally my new LK-99 🙏🙏🙏
    user avatar
    Aaron Defazio
    @aaron_defazio
    Mar 29, 2024
    Update: more experimental results rolling in. Here it is against SGD with both the step-wise and cosine schedule (both baselines heavily tuned, no cheating) This is something special indeed!
    Image
    81K
  • user avatar
    Abhi Venigalla
    @ml_hardware
    Jan 25, 2023
    We're coming for all the models! This week our Vision team profiled Stable Diffusion on @MosaicML Cloud and found that training from scratch costs <$160k, and can be done in under 2 weeks. mosaicml.com/blog/training-…
    50K
  • user avatar
    Abhi Venigalla
    @ml_hardware
    Feb 4, 2023
    Replying to @karpathy
    The @MosaicML perf team just tried this out and... totally confirmed 🤯 GPT-1.3B MFU went from 49% -> 53%
    Image
    127K
  • user avatar
    Abhi Venigalla
    @ml_hardware
    Mar 29, 2024
    If you have apple silicon and > 70GB of RAM, you can run DBRX on your laptop!! Kudos to @awnihannun :)
    Image
    mlx-community/dbrx-instruct-4bit · Hugging Face
    From huggingface.co
    20K
  • user avatar
    Abhi Venigalla
    @ml_hardware
    Apr 26, 2023
    Our Vision team is insane. The original Stable Diffusion reportedly cost $600k... and now we've reproduced it for $50k🤯 and it took <1 week to train! All the training code is open-source! And we make it super fast + easy to customize on your own private data @MosaicML
    user avatar
    Jonathan Frankle
    @jefrankle
    Apr 26, 2023
    And now it's < $50k. 🖼️Announcing @MosaicML's diffusion offering 📷We replicated Stable Diffusion 2.0, training from scratch with huge speedup, and we can do it on your data too. Human eval showed the model to be indistinguishable from the original. Blog: mosaicml.com/blog/training-…
    22K
  • user avatar
    Abhi Venigalla
    @ml_hardware
    Mar 19, 2024
    Replying to @francoisfleuret
    The 30x is real and comes from this technical brief, page 15: nvdam.widen.net/s/xqt56dflgh/n… How is 30x possible given GB200 has only ~2.3x increase in memBW and FLOP/s over H100? It involves comparing per-chip generation throughput = output_tokens/s/chip. The two systems compared are
    nvdam.widen.net
    nvidia-blackwell-architecture-technical-brief.pdf
    29K
  • user avatar
    Abhi Venigalla
    @ml_hardware
    Sep 5, 2023
    Replying to @julien_c
    @julien_c Why is the training so slow? Your screenshot shows 25% MFU. Our users on MosaicML get 40%+ for the same workload on H100s. Screenshot MFU = 6 * 30e9 * 600e9 / 500 / 10 / 3600 / 24 / 1e15 = 0.25 Time to train on HF: 10 days Time to train on MosaicML: * 6.25 days *
    97K
  • user avatar
    Abhi Venigalla
    @ml_hardware
    Jan 4, 2024
    New year, new MME 🎉 @dskhudia and I profiled @intel Gaudi2 accelerators for LLM training and inference, and found great performance and perf/$ !
    Image
    LLM Training and Inference with Intel Gaudi 2 AI Accelerators | Databricks Blog
    From databricks.com
    45K
  • user avatar
    Abhi Venigalla
    @ml_hardware
    Nov 18, 2023
    i love you all = ilya
    user avatar
    Sam Altman
    OpenAI
    @sama
    Nov 18, 2023
    i love you all. today was a weird experience in many ways. but one unexpected one is that it has been sorta like reading your own eulogy while you’re still alive. the outpouring of love is awesome. one takeaway: go tell your friends how great you think they are.
    41K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement