Research Engineer @ Databricks Mosaic AI
I am a research engineer at Databricks focusing on inference performance and GPU kernels. I enjoy working on ML systems and performance engineering, and have been thinking about this area for a while!
At Databricks, I played a key role in building up the custom inference runtime which was optimized for serving open-source language models. My core focus for this runtime was on techniques that improve the hardware utilization: everything from advanced quantization algorithms, efficient fused MoE architecture, novel self-attention and grouped GEMM GPU kernels, and advanced hardware techniques to better overlap kernel executions. The core work that I led is listed in this databricks research blog. Variations of the techniques I proposed have since been adopted into popular open-source projects like vLLM and sglang.
Prior to Databricks, I was at the University of Waterloo, where I did research into both ML algorithms and systems. I also did a lot of internships where I worked across the stack, trying to understand the practical bottlenecks with ML applications, and learning about performance engineering at scale.
Selected Work
- Fast PEFT Serving At Scale Building a lightning-fast inference runtime isn't just about raw speed—it's about solving the right problems for real customers. At Databricks, our focus on Data Intelligence means helping customers turn their proprietary data into AI agents that serve production workloads at massive scale. Over the past year, we've built a custom inference engine that both out-performs open source on our customer workloads by 2x in some cases, but also fewer errors on common benchmarks.
- A Streaming End-to-End Framework For Spoken Language Understanding IJCAI 2021. End-to-end spoken language understanding (SLU) recently attracted increasing interest. We propose a streaming end-to-end framework that can process multiple intentions in an online and incremental way. Detection accuracy is about 97% on all multi-intent settings, comparable to state-of-the-art non-streaming models.
- On Implementing Wear Leveling in Persistent Synchronization Structures DISC 2023. We consider how the wear leveling mechanism can be co-designed with synchronization structures that generate highly skewed memory access patterns in persistent memory systems.
- Sequential End-to-End Intent and Slot Label Classification and Localization Interspeech 2021. We propose a compact e2e SLU architecture for streaming scenarios, where chunks of the speech signal are processed continuously to predict intent and slot values. Reached accuracy as high as 98.97% on single-label classification.
Experience
-
Research Engineer Jul 2024 - PresentDatabricks Mosaic AI
Accelerating inference performance for gen-ai!
-
Software Engineering Intern Sep 2023 - Dec 2023Citadel
Worked on the Electronic Trading Engineering team with a focus on fixed income securities, introducing algorithms and systems in C++ which handled a large sum of transactions and improved reliability.
-
Software Engineering Intern May 2023 - Aug 2023Databricks
Working on ML Systems challenges for inference.
-
Undergraduate Research Assistant Jan 2023 - Apr 2023University of Waterloo
Worked on distributed systems research involving persistent memory under the supervision of Dr. Wojciech Golab.
-
Software Engineering Intern Jan 2022 - Apr 2022Uber
Worked on building distributed backend systems related to ML orchestration and data processing in the safety data team.
-
Software Engineering Intern May 2021 - Aug 2021Tesla
Worked primarily on building a distributed caching system to improve data retrieval latency for model features.
-
Software Engineer (NLP) Intern Sep 2020 - Jan 2021Huawei Noah's Ark Lab
Machine learning in the speech team. Developed spoken language understanding models and training pipelines using PyTorch, Kaldi, CUDA, AWS S3, and Bash Scripting.