Inspiration

Modern deep learning models, including language models (such as BERT and GPT-3), image models (like Resnet-101), and audio models (such as M5), require significant amounts of time to load data from storage servers. According to reports from Microsoft [1], Google [2], and other organizations and research agencies, such data loading can take up to 70% of the model's total runtime, leading to increased operational costs and prolonged time-to-insights. The aim of this project is to address the problem of slow I/O for large-scale models.

What it does

Our project proposes algorithms to perform collaborative large-scale model training, which reduces the I/O wait time by up to 3x compared to current state-of-the-art solutions.

How we built it

We built it in Nvidia's DALI pipeline and Python3 simulation.

Challenges we ran into

One of the challenges we faced was the inability to access large-scale infrastructure. However, we overcame this by digging into data from large organizations such as Microsoft, obtaining real-world traces, and hardware performance metrics. Using this data, we built a performance model that outputs how our algorithm speeds up the I/O pipeline in modern multi-terabyte dataset trainings.

Accomplishments that we're proud of

During the past 24 hours, we evaluated and identified the extent of the impact of data stalls on the duration of one epoch in AI applications. We also built and leveraged the NVIDIA DALI framework to train AI models on Google Cloud.

What we learned

Our project revealed that introducing an intermediate low-latency storage medium can alleviate the data input bottleneck, leading to improved performance in deep learning models.

What's next for HPDSL

Our next step is to integrate our algorithms in DALI and other dataloaders, such as tf.data(), and run them on multi-cloud, multi-gpu, multi-tenant setups to observe the efficiency of our proposed algorithms.

References

[1] Mohan, Jayashree, et al. "Analyzing and mitigating data stalls in DNN training." arXiv preprint arXiv:2007.06775 (2020). [2] Murray, Derek G., et al. "tf. data: A machine learning data processing framework." arXiv preprint arXiv:2101.12127 (2021).

Share this project:

Updates