Software

Megatron-LM

NVIDIA

NVIDIA's framework for efficient training and inference of large-scale transformer models. Megatron-LM provides state-of-the-art parallelism techniques enabling models with hundreds of billions of parameters to be trained and served across GPU clusters.

My contributions: Built the inference-optimized MoE layer with CUDA-graphed execution, low-latency NVLS communication kernels, and MXFP8 quantized MoE inference.

GitHub

AxoNN

UMD

A highly scalable, portable, open-source framework for parallel deep learning. AxoNN implements a four-dimensional hybrid parallel algorithm with asynchronous, message-driven execution to maximize GPU efficiency. It has achieved record-breaking performance on Frontier (1.381 Exaflop/s) and Alps (1.423 Exaflop/s).

My contributions: Co-created and led the development of AxoNN, including its core parallelism algorithms, communication optimizations, and scaling to exascale systems. ACM Gordon Bell Finalist (SC'24).

GitHub

DeepSpeed

Microsoft

Microsoft's open-source deep learning optimization library that makes distributed training and inference easy, efficient, and effective. DeepSpeed powers training of many of the world's largest language models.

My contributions: Designed and implemented DeepSpeed-TED, a hybrid tensor-expert-data parallelism approach that enables training MoE models with 4-8x larger base models than prior state-of-the-art.

GitHub