Stories by HPC-AI Tech on Medium

HPC-AI Tech is Joining NYU’s Endless Frontier Labs Program, Which Has an Under 7% Global…

HPC-AI Tech — Wed, 05 Oct 2022 09:47:22 GMT

HPC-AI Tech is Joining NYU’s Endless Frontier Labs Program, Which Has an Under 7% Global Acceptance Rate

9 Sept, 2022 — HPC-AI Tech was chosen from 1,121 applicants to join the Endless Frontier Labs (EFL) 2022–2023 cohort Digital Tech track after a rigorous selection process by the EFL Team. NYU’s EFL program provides an opportunity for science and technology startups in their early stages, to grow in partnership with the New York University Stern School of Business. This year, over 1,100 startups from 66 countries world-wide, and from 43 states in the U.S. participated in the competition. HPC-AI Tech was among only 78 startups which were selected as finalists for the program.

More About EFL

Endless Frontier Labs (EFL) is a nine-month program at NYU’s Stern School of Business which provides early-stage science and technology startups with three tracks. These tracks include Deep Tech, Life Science, and Digital Tech. Deep Tech is suited for startups who develop technologies in the fields of physical and material science. Life Science is for companies with breakthroughs in biology, medicine and healthcare. Lastly, Digital Tech focuses on startups in software, data analytics, AI and ML.

Over the past few years, the EFL Team has utilized its extraordinary expertise to help startups transform their ideas into strong and successful businesses through the program. One example of innovative startups which have partnered with EFL include Immunai, which created an atlas of the human immune system in 2018 and recently raised about $295 million. Another example is Jetpack Aviation, a powerful company that specializes in turbine powered micro VTOLs; their 150mph Speeder flying motorcycles will hit the markets in 2023 for $380,000.

“Our mission is to bridge the gap between science and markets,” said NYU Stern Professor Deepak Hegde, Founding Director of EFL. “We believe business strategy, validation by scientific peers, and connections to investors are critical for the successful commercialization of scientific breakthroughs. EFL provides these inputs to help transform founders’ ideas into commercial and societal impact. With the opening of our third track focused on software and data oriented startups, we look forward to deepening our impact on the information economy.”

Companies must receive critical and extensive evaluations from the EFL Team in order to be apart of the EFL program. The EFL team’s evaluation process included interviews with over 175 start-up teams from over 60 countries, technical analysis of the companies by their experts, and consideration for how well the company suited the EFL program.

Despite such stringent requirements, what made HPC-AI Tech stand out amongst 1,121 applicants?

HPC-AI Tech is a global company focused on High Performance Computing and Artificial Intelligence. The company has offices in Delaware (USA) and Singapore, building a highly skilled team comprised of members from top universities (e.g., the University of California, Berkeley, Stanford University, etc.) and from leading companies around the world (e.g., Google, Microsoft, Nvidia, IBM, Intel, etc.).

HPC-AI Tech strives to increase AI productivity, aiming to build an outstanding distributed AI development and deployment platform that enables supercomputers and cloud platforms to partake in AI at a much lower cost. Thanks to years of academic achievement and industrial accumulation, HPC-AI Tech has developed an efficient large AI model training and inference system — The Colossal-AI — which offers efficient AI deployment services with high integration, automatization, and intellectualization. Recently, HPC-AI Tech completed the seed and angel round fundraising of 6 million dollars, led by BlueRun Ventures in the angel round, and co-led by Sinovation Ventures and ZhenFund in the seed round. HPC-AI Tech will continue its path of rapid growth and plans to continue expanding globally.

“The team at Endless Frontier Labs is pleased to invite HPC-AI Technology Inc. to join our 2022–2023 Digital Tech track. Congratulations! We are thrilled for the opportunity to work with you and transform your breakthrough science into a highgrowth business.”

——Deepak Hegde, Founding Director of EFL

“As a startup focused on high-performance computing and Artificial Intelligence, we’re elated to join NYU’s Endless Frontier Labs. We will passionately work with the professionals from the EFL Team to equip our innovations and form a successful business. This collaboration is another step towards achieving our goal of being a globally renowned AI company, and we hope we can empower AI businesses in the future. ”

— — Yang You, the Chairman of HPC-AI Tech

Reference:

[1] https://endlessfrontierlabs.com/

[2] https://www.stern.nyu.edu/experience-stern/news-events/endless-frontier-labs-announces-2021-2022-application-launch-new-digital-tech-track-graduating

[3] https://www.swisspod.ch/stories/joining-nyus-endless-frontier-labs

HPC-AI Tech Completes $6 Million Seed and Angel Round Fundraising, Led by BlueRun Ventures in the…

HPC-AI Tech — Tue, 06 Sep 2022 14:43:16 GMT

HPC-AI Tech Completes $6 Million Seed and Angel Round Fundraising, Led by BlueRun Ventures in the Angel Round

6 Sept, 2022 — HPC-AI Tech, the developer of open source deep learning system for big model era, Colossal-AI, today announced the completion of seed and angel round fundraising of $6 million dollars, led by BlueRun Ventures in the angel round, and co-led by Sinovation Ventures and ZhenFund in the seed round. HPC-AI Tech completed two rounds of fundraising very quickly within one year. With this investment, HPC-AI Tech will continue its rapid growth by preparing for global expansion.

HPC-AI Tech is a global company focusing on High Performance Computing and Artificial Intelligence. The entities of HPC-AI Tech were registered in Delaware (USA) and Singapore. It has built a world-class team with core members from top universities (e.g., the University of California, Berkeley, Stanford University, etc.) and leading companies around the world (e.g., Google, Microsoft, Nvidia, IBM, Intel, etc.).

HPC-AI Tech strives to increase AI productivity and build a world-class distributed AI development and deployment platform that enables supercomputers and cloud platforms to serve AI at a much lower cost. Based on years of academic achievements and industrial accumulation, HPC-AI Tech has developed an efficient large AI model training and inference system, Colossal-AI, to deal with the pain points in the industry. HPC-AI Tech will be dedicated to the deep integration of software system design and hardware architecture, offering efficient AI deployment services in a manner of high integration, automatization, and intellectualization.

As cutting-edge AI models are getting larger, the budgets for training and deploying AI models are getting unaffordable both for enterprises and individuals. It can even lead to spending huge amounts of money on supercomputer clusters or hiring experts, it is difficult to apply large-scale AI models efficiently.

To this end, HPC-AI Tech has developed a user-friendly deep learning system, Colossal-AI, that enables companies to maximize the efficiency of AI deployments while drastically reducing costs. Colossal-AI integrates many advanced technologies such as multi-dimensional tensor parallelism, sequence parallelism, heterogeneous memory management, large-scale optimization, adaptive task scheduling, etc.

Since being open source, Colossal-AI has reached NO.1 in trending projects on Github and Papers With Code several times, together with other projects that have as many as 10K stars. Furthermore, Colossal-AI always keeps increasing the availability of AI solutions for industries and is already showing tremendous potential across a variety of fields including medicine, autonomous vehicles, cloud computing, retail, and chip production.

“The global market size of AI is valued at hundreds of billions of dollars and it’s growing fast due to an increase in demand for improving efficiency and productivity of society. HPC-AI Tech’s solution can help enterprises and users effectively reduce the cost of big AI model training and inference, making large AI models more accessible, and thus benefiting wider society. We will continue to explore more effective industrial solutions of large AI models to democratize complicated large model techniques and empower AI enabled businesses. ”

— — Yang You, the Chairman of HPC-AI Tech

“With the rapid evolution and development of AI models, AI is capable of effectively solving various kinds of problems in the industry. However, enterprises face high costs in the training, maintenance, and inference of AI models. To enable customers to effectively integrate extensive AI capabilities into the enterprise lifecycle, HPC-AI Tech has developed an open source efficient deep learning system, that allows users to significantly improve model training and inference efficiency. As an early investor of HPC-AI Tech, BlueRun Ventures is glad to see that HPC-AI Tech has started to develop at a high speed and has been recognized by more and more users in terms of its product and global open source ecology. HPC-AI Tech is believed to grow into a globally renowned AI company and empower various industries with AI in the near future.”

— — Jimmy Shi, Venture Partnership at BlueRun Ventures China

Yang YOU, the Chairman of HPC-AI Tech (left) and Jimmy Shi, Venture Partnership at BlueRun Ventures China (right)

About HPC-AI Tech

HPC-AI Tech is a global company focusing on High Performance Computing and Artificial Intelligence. The company was founded by Dr. Yang You, who is the Presidential Young Professor at the National University of Singapore and received his Ph.D. in Computer Science from UC Berkeley.

HPC-AI Tech has developed an efficient large AI model training and inference system, Colossal-AI, that integrates many advanced technologies such as multi-dimensional tensor parallelism, sequence parallelism, heterogeneous memory management, large-scale optimization, adaptive task scheduling, etc. By using Colossal-AI, we could help users to efficiently and quickly deploy large AI model training and inference, reducing large AI model training budgets and scaling down the labor cost of learning and deployment.

Media Contact:

HPC-AI Technology

Yang You

contact@hpcaitech.com

Meet Colossal-AI Team at SC22 and Other 3 Renowned International Conferences

HPC-AI Tech — Tue, 30 Aug 2022 14:41:21 GMT

Recently, Colossal-AI Team, which developed a unified deep learning system for the big model era, has been accepted and invited to deliver keynote speeches at a series of notable international conferences including SuperComputing 2022 (SC22), Open Data Science Conference (ODSC), World Artificial Intelligence Conference (WAIC), and AWS Summit. In the event, Colossal-AI Team is going to share many up-to-date and amazing things and technologies of High Performance Computing (HPC) and Artificial Intelligence (AI) that will change the world. Follow us and stay tuned!

Colossal-AI Open Source Code: https://github.com/hpcaitech/ColossalAI

SuperComputing 2022 (SC22)

Time: Monday, 14 November 2022 1:30pm — 5pm CST

Link: https://sc22.supercomputing.org/presentation/?id=tut129&sess=sess211

SC (formerly Supercomputing), the International Conference for High Performance Computing, Networking, Storage and Analysis, is the annual conference established in 1988 by the Association for Computing Machinery and the IEEE Computer Society. SC brings together the world’s top research institutions and companies in the computer industry to share about the cutting-edge developments and innovations in HPC, networking, storage and analysis that will unlock new solutions and change our world.

Open Data Science Conference (ODSC)

Time: 1 November — 3 November, 2022

Link: https://odsc.com/california/

Open Data Science Conference (ODSC) hosts one of the largest gatherings of professional data scientists, with major conferences in the USA, Europe, and Asia. ODSC brings about the leading practitioners, innovation experts, and business professionals that drive artificial intelligence across a range of industries. During ODSC, attendees are able to learn more about insightful ideas and innovative AI techniques that will gain further momentum in different industries.

AWS Summit

Time: 22 September — 23 September, 2022

Link: https://summit.awsevents.cn/2022/

AWS Summit is the largest annual technology event of Amazon Cloud Technology in China. As a global cloud computing trendsetter event, AWS Summit gathers many leading technology practitioners and business professionals around the world to share their insightful views. At the summit, they will discuss how Amazon Cloud Technologies continues to explore the possibilities of cloud technology and empower different industries for more potential.

World Artificial Intelligence Conference (WAIC)

Time: 3 September, 2022

Link: https://www.worldaic.com.cn/

World Artificial Intelligence Conference (WAIC) has become the most influential professional and international high-end exchange and cooperation platform that connects the development of artificial intelligence. Over the past three years, the event has invited nearly 300 high-profile speakers, including Turing Award winners, executives from leading multinational conglomerates, and founders of unicorn AI companies. During WAIC 2021, a total of more than 1,000 speakers shared their views and nearly 300 media reported the event.

About us

HPC-AI Tech

Prof. Yang You

Prof. Yang You is a Presidential Young Professor at the National University of Singapore. He received his Ph.D. in Computer Science from UC Berkeley. The focus of his current research is scaling up deep neural networks training on distributed systems or supercomputers. In 2017, his team broke the world record of ImageNet training speed, which was covered by the technology media like NSF, ScienceDaily, Science NewsLine, and i-programmer. In 2019, his team broke the world record of BERT training speed. The BERT training techniques have been used by many tech giants like Google, Microsoft, and NVIDIA. Yang You’s LARS and LAMB optimizers are available in industry benchmark MLPerf. He is a winner of IPDPS 2015 Best Paper Award (0.8%), ICPP 2018 Best Paper Award (0.3%) and ACM/IEEE George Michael HPC Fellowship. Yang You is a Siebel Scholar and a winner of Lotfi A. Zadeh Prize. Yang You was nominated by UC Berkeley for ACM Doctoral Dissertation Award (2 out of 81 Berkeley EECS PhD students graduated in 2020). He also made Forbes 30 Under 30 Asia list (2021) for young leaders and IEEE-CS TCHPC early career award.

Prof. James Demmel

Prof. James Demmel is the Dr. Richard Carl Dehmel Distinguished Professor of Computer Science and Mathematics at the University of California at Berkeley, and former Chair of the EECS Dept. He also serves as Chief Strategy Officer for the start-up HPC-AI Tech, whose goal is to make large-scale machine learning much more efficient, with little programming effort required by users. Demmel’s research is in high performance computing, numerical linear algebra, and communication avoiding algorithms. He is known for his work on the widely used LAPACK and ScaLAPACK linear algebra libraries. He is a member of the National Academy of Sciences, National Academy of Engineering, and American Academy of Arts and Sciences; a Fellow of the AAAS, ACM, AMS, IEEE and SIAM; and winner of the IPDPS Charles Babbage Award, IEEE Computer Society Sidney Fernbach Award, the ACM Paris Kanellakis Award, the J. H. Wilkinson Prize in Numerical Analysis and Scientific Computing, and numerous best paper prizes.

Reference

https://www.163.com/dy/article/HEDANKUM0552XXBZ.html

https://www.worldaic.com.cn/forum

https://odsc.com/california/events-west/

https://sc22.supercomputing.org/

https://mp.weixin.qq.com/s/kdjXzsaPdmCl8gAeEZ9_1w

https://baike.baidu.com/item/%E5%85%A8%E7%90%83%E8%B6%85%E7%BA%A7%E8%AE%A1%E7%AE%97%E5%A4%A7%E4%BC%9A/16765988

https://www.baike.com/wikiid/7900314974072756281?from=wiki_content&prd=innerlink&view_id=1s5vstdve7b400

https://summit.awsevents.cn/2022/

https://en.wikipedia.org/wiki/ACM/IEEE_Supercomputing_Conference

Accelerating Structure Prediction of Protein Monomers and Multimer by 11 Times!

HPC-AI Tech — Tue, 23 Aug 2022 10:03:31 GMT

Accelerating Structure Prediction of Protein Monomers and Multimer by 11 Times! An Open Source Solution from Colossal-AI and BioMap

The latest solution from the Colossal-AI team (https://github.com/hpcaitech/ColossalAI) and BioMap for protein monomer and multimer structure prediction, xTrimo Multimer, has recently become open-source to the public. This new solution can predict both monomer and multimer structure simultaneously accelerating the process by up to 11 times!

The hero behind is Colossal-AI, which is a powerful deep learning system that aims to make large AI model training easy and accessible in the community and industry. By integrating large model training techniques and optimizations provided by Colossal-AI, we can significantly reduce the time and cost of both protein monomer and multimer structure prediction during model training and inference. As an important practice of the Colossal-AI system in the pharmaceutical industry, xTrimo Multimer can greatly increase the pace of the model design and development for protein structure prediction, facilitating breakthroughs regarding large AI model applications in healthcare and bioinformatics.

Learn more about our powerful solutions here: https://github.com/hpcaitech/ColossalAI/#xTrimoMultimer

Colossal-AI is a user-friendly deep learning system that enables companies to maximize the efficiency of AI deployments while drastically reducing costs. Since open source to the public, Colossal-AI has reached №1 in trending projects on Github and Papers With Code several times, together with other projects that have as many as 10K stars. Furthermore, Colossal-AI always keeps increasing the availability of AI solutions for industries and is already showing tremendous potential across a variety of fields including medicine, autonomous vehicles, cloud computing, retail, and chip production. The most recent application is the partnership of Colossal-AI with BioMap to propose the latest cost-effective solution in protein monomer and multimer structure prediction. This application is able to help healthcare providers and pharmaceutical companies in diagnosis and stimulate novel drug research and discovery.

Protein structure prediction is one of the most important topics in structural biology and supplements the understanding of gene translation and protein function. Unfortunately, the multi-level structure and sophisticated protein interactions make it challenging to predict the 3D structure accurately.

In recent years, the success of deep neural networks has transformed various practices. Since DeepMind’s release of AlphaFold (which accurately predicts protein structure based on amino acid sequences), the field of biology has witnessed a boom in utilization of AI for protein structure prediction.

Specifically, AlphaFold can generate end-to-end 3D structure predictions of protein monomers directly from amino acid sequences. The use of AlphaFold also exceeds the realm of monomers. Since the majority of proteins function as multimers, DeepMind’s AlphaFold-Multimer model is recently released to be able to predict the structure of multimers.

To boost the development of AlphaFold, the Colossal-AI team has already released FastFold, an open-source and optimized implementation of AlphaFold in the past few months. The Colossal-AI team managed to successfully minimize AlphaFold’s training time from 11 days to only 67 hours abd accelerate inference time by up to ~11.6 times. The Colossal-AI team continues their efforts to democratize large-scale AI model applications in the pharmaceutical field.

Interactions between proteins are critical to their biological functions. To address difficulties related to protein monomer and multimer structure prediction, the Colossal-AI Team proposed the industry’s latest solution, the xTrimo Multimer. xTrimo Multimer is able to better reflect protein interactions, thus enhancing the potential target analysis, protein structure prediction/ simulation, as well as high-precision antibody designs in drug discovery and development.

The unaffordable economic and time costs from AlphaFold’s inference has led to challenges in its research and development, particularly when facing long sequence inferences with rising computational complexity and memory consumption. Based upon computational features in the AlphaFold-Multimer model, the Colossal-AI team introduced CUDA optimization and Kernel Fusion techniques for the xTrimo Multimer achieving remarkable inference performances. Compared to AlphaFold2 and OpenFold (from Columbia University), the xTrimo Multimer has a significantly improved inference performance on a single GPU by 1.58–2.14 times and 1.14–2.23 times, respectively.

Additionally, the xTrimo Multimer model supports distributed inferences for lengthy sequences. After introducing Dynamic Axial Parallelism, the xTrimo Multimer was able to efficiently distribute computation and partial GPU memory across a variety of devices, thereby solving computational and memory challenges that long sequences face. xTrimo Multimer achieves a 8.47x and 11.15x speedup compared to OpenFold and AlphaFold 2 on multiple GPUs with sequence lengths ranging anywhere from 2–3K. xTrimo Multimer also shows a 4.45x acceleration compared to Uni-Fold 2.0. Furthermore, xTrimo Multimer can support inferences with sequences reaching up to 4K, whereas OpenFold and AlphaFold 2 are restricted from such lengths due to GPU memory. With xTrimo Multimer, scientists can run a 4K length sequence inference in about 20 minutes.

“The collaboration with HPC-AI Tech brings together the cutting-edge technology in large AI model training from Colossal-AI team and the biocomputing domain expertise from BioMap. The release of xTrimo Multimer is an important step towards integrating the advantages of large AI model training and inference into the construction of BioMap’s xTrimo multimodal system. ” — Le Song, Chief AI Scientist of BioMap

“Our latest solution of protein Monomer and Multimer structure prediction is an important progress of Colossal-AI to solve the industrial problems in the real world. In the future, we will continue to cooperate with BioMap more deeply in biocomputing large models, to stimulate the application and implementation of deep learning in innovative drug development. ” — Yang You, the Chairman of HPC-AI Tech

The accelerated implementation, xTrimo Multimer, will serve as one of the important products, alongside other impressive industrial solutions built upon Colossal-AI, to facilitate the large-scale AI modeling for global companies. The Colossal-AI team will continue to explore all the emerging possibilities of AI model training in various fields, endeavor to tackle issues in the modern industry, and empower the future of the global AI market.

Portal

Open Source xTrimoMultimer: https://github.com/hpcaitech/ColossalAI/#xTrimoMultimer

Open Source Colossal-AI: https://github.com/hpcaitech/ColossalAI

About BioMap

BioMap is a team of world-renowned scientists who have extensive expertise in disease biology, bioinformatics, machine learning/deep learning, and antibody engineering. BioMap was co-founded by Baidu’s Founder/CEO Robin Li and the former CEO of Baidu Ventures, Wei Liu. They are committed to bringing first-in-class medicine for unmet medical needs in the areas of immune-oncology, autoimmune diseases, fibrosis and aging-related diseases.

About HPC-AI Tech

HPC-AI Tech is a global company which aims to help users improve the efficiency of training and deploying large AI models. The company was founded by Dr. Yang You, who received his Ph.D. in Computer Science from UC Berkeley and is currently the Presidential Young Professor at the National University of Singapore. HPC-AI Tech has developed an efficient large AI model training and inference system, Colossal-AI, that integrates advanced technologies which help users efficiently deploy large AI model training and inference at a low cost.

Reference:

https://www.biomap.com/en/team

https://www.technologyreview.com/2021/07/22/1029973/deepmind-alphafold-protein-folding-biology-disease-drugs-proteome/

Job Opening: Sales Director

HPC-AI Tech — Fri, 29 Jul 2022 13:47:24 GMT

HPC-AI Tech is a dynamic startup that is blazing the way in applying high-performance computing techniques to artificial intelligence. We are pioneers in writing performant, extensible and easy-to-use AI systems that run on the cloud.

Now, HPC-AI Tech is looking for an experienced sales director. He/She should manage and oversee the sales operations in our company. His/Her main duties include keeping good relationships with key clients and customers, designing and articulating plans to meet sales targets, evaluating costs to determine our products’ price, keeping abreast of AI era trends, and training sales managers.

HPC-AI Tech provides a high basic salary (80k — 160k USD per year) + a high sales commission (15% of each order) or a bonus (80k — 160k USD per year). We hope you work in the U.S.

Come and join us!

Roles and Responsibilities

Developing the company’s marketing strategy and helping the company’s products to be applied in the industry.
Responsible for the sales of the company’s products to customers, developing and executing strategic plans to achieve sales targets.
Building and maintaining long-term and strong relationships with customers while partnering with them to better understand their business objectives and needs.
Understanding industry-specific trends and landscapes, matching customer needs with solutions, and defining a clear path to client satisfaction and revenue growth for the company.
Increasing brand awareness and market share of the company’s products.

Qualifications

Bachelor’s degree in business, marketing, communications, or related fields.
At least 3-year B2B sales experience in AI, Cloud Computing, or related fields and a basic understanding of AI. Familiarity with industry-specific trends and landscapes of High-Performance Computing and AI will be advantageous.
Proven performance in industry sales and rich accumulation of convertible industry customer resources in AI, Internet, medical, automotive, and other related fields.
Excellent listening, negotiation, presentation, and communication skills
Experienced, energetic, and able to demonstrate the ability to sell.

Join Us：contact@hpcaitech.com

Image Source: https://pixabay.com/zh/photos/ladder-success-success-ladder-2713346/

HPC-AI Tech Joins NVIDIA Inception

HPC-AI Tech — Tue, 28 Jun 2022 13:00:15 GMT

June, 2022 — HPC-AI Tech today announced it has joined NVIDIA Inception, a program designed to nurture startups revolutionizing industries with technology advancements.

HPC-AI Tech is focused on increasing AI productivity and building a world-class distributed AI development and deployment platform that enables supercomputers and cloud platforms to serve AI at a much lower cost. Based on years of academic achievements and industrial accumulation, HPC-AI Tech has developed an efficient large AI model training and inference system, Colossal-AI, to deal with the pain points in the industry. HPC-AI Tech will be dedicated to the deep integration of software system design and hardware architecture, offering efficient AI deployment services in a manner of high integration, automatization, and intellectualization.

NVIDIA Inception will allow HPC-AI Tech to evolve faster through access to cutting-edge technology and NVIDIA experts, connections with venture capitalists, and co-marketing support to heighten HPC-AI Tech’s visibility. NVIDIA will work closely with HPC-AI Tech to provide the best technical tools, latest resources, and opportunities to connect with investors. The program will also offer HPC-AI Tech the opportunity to collaborate with industry-leading experts and other AI-driven organizations.

“We feel very happy to join NVIDIA Inception. It is such an impressive program that helps us to drive our business forward through go-to-market support, training, technical assistance, etc. We will work with NVIDIA to explore more effective applications of GPUs to democratize complicated large model techniques in the AI community.” — Yang You, the Chairman of HPC-AI Tech

NVIDIA Inception helps startups during critical stages of product development, prototyping and deployment. Every NVIDIA Inception member gets a custom set of ongoing benefits, such as NVIDIA Deep Learning Institute credits, marketing support, and technology assistance, which provides startups with the fundamental tools to help them grow.

About HPC-AI Tech

Media Contact:

HPC-AI Technology

Yang You

contact@hpcaitech.com

Surpassing NVIDIA FasterTransformer’s Inference Performance by 50%, Open Source Project Powers into…

HPC-AI Tech — Mon, 30 May 2022 23:53:55 GMT

Surpassing NVIDIA FasterTransformer’s Inference Performance by 50%, Open Source Project Powers into the Future of Large Models Industrialization

The development of artificial intelligence(AI) and high-performance computing(HPC) has dramatically revolutionized our lives and brought significant potential to our society. For example, Facebook has 1.82 billion daily active users and issues tens of trillions of inference queries per day, which requires essential changes in the development process and system design.

Traditional deep learning(DL) systems usually focus on the single-model single-machine inference setting. However, the exponential growth of DL models leads to an inability to execute large pre-trained models on a single machine. Specifically, super large NLP models like GPT-3 require more than one hundred GBs of memory for inference. One single GPU is not enough to hold such tasks, which makes using multiple distributed computing devices to inference collaboratively the future of large model inference.

The Colossal-AI team developed Energon-AI, as a subsystem, to provide inference service for super-scale DL models. Focusing on the specific pain points, Colossal-AI Team delved into multi-device inference situations, developed large scale inference system Energon-AI, with their idea of “High Performance, High Usability, High Versatility”.

With few changes to existing projects, users could easily develop large models for inference, achieve superlinear speedups on parallel extensions. Compared to FastTransformer introduced by NVIDIA, Energon-AI managed to reach an improvement of 50% on parallelized inference speedups with large AI models.

Besides, different from current inference solutions, Energon-AI does not require manual settings for communication or memory usage, neither does extra compiling. By using Energon-AI, inference becomes way easier than before.

Open source address: https://github.com/hpcaitech/ColossalAI

Difficulties with Large AI Model Inference

Rapid growth of model parameters [https://arxiv.org/abs/2111.14247]

Recently, computing devices such as GPUs have been greatly upgraded focusing on their parallel computing capability, memory capacity, memory speed, etc. However, such performance improvements on a single device could never meet the requirements of large models whose parameter size grows exponentially. Current deep learning inference systems mainly focus on simple scenarios like multi-model single-machine and single-model single-machine, overlooking the challenges and opportunities of single-model multi-machine scenarios that are essential for large AI model inference. Thus, we introduced Energon-AI system to solve these severe issues.

Large Scale Inference System: Energon- AI

System Design

Schematic diagram of Energon-AI

Focusing on the deployment of large AI models, we design a single-model multi-machine inference system, Energon-AI. Its system is comprised of three levels, namely runtime system (Runtime), distributed inference instance (Engine) and service system (Serving):

Runtime: In the design of the runtime system, we find that with tremendous model scaling, the time taken by general matrix multiplications gradually increases. Conversely, the time taken by memory-intensive operators and Kernel Launch shows a decreasing pattern. The process is further migrated from memory-intensive to computing-intensive, and the effect of TensorRT and specified inference systems for memory-intensive operations is greatly reduced. Energon-AI Runtime relies on Colossal-AI to achieve tensor parallelism. Meanwhile, we also design a pipeline parallel packaging method for insufficient memory. In addition, we introduce a large number of special inference operators and methods. For example, when it comes to the variable length of input in NLP, we introduce operators such as transpose_padding_rebulid and transpose_padding_remove to efficiently process MLP layers in the Encoder and Decoder models.
Engine: To make Engine have exactly the same behavior as single-device inference through encapsulation, we adapt a semi-centralized method and use RPC to call initialization or inference methods on each device in the main process, so that distributed inference can be centrally controlled. Meanwhile, each device maintains its own communication logic for Tensor Parallel and Pipeline Parallel. We design and maintain a distributed message queue in each process to ensure the consistency of multi-threaded call execution in multiple processes.
Serving: Energon-AI introduces a dynamic batching mechanism. After the requests in the request queue are optimally packaged according to machine performance, Energon-AI selects the batch processing with the highest priority regarding the waiting time, batch size, batch expansion possibility (based on the sentence length after padding), etc., which enables the maximization of the GPU usage while avoiding the starvation problem and reducing the average request delay.

Schematic diagram of batch management process

Performance Testing

Superlinear Scaling of Parallel Inference

Tensor parallel scalability test results. Hardware: 8*A100 GPU 80GB. Since the memory of a single device cannot meet the inference requirements of GPT-3, above is the test result of GPT-3 with 12 layers, and the sentence length is 1/2 of Padding.

When the Batch Size is 32, Energon-AI parallel inference with 8 GPUs can achieve an 8.5-time super-linear speedup compared to Pytorch direct inference with one GPU.

Runtime Inference Performance Improved by 50%

Comparison of inference latency of tensor parallel runtime systems. Hardware: 8*A100 GPU 80GB. Sentence length is set to be 1/2 length of Padding. GPT-3–24-Layers for TP=2, GPT-3–48-Layers for TP=4.

We compare our Energon-AI with highly optimized FasterTransformer GPT-3 introduced by NVIDIA. Faster Transformer introduces its distributed inference feature in its 4.0 version, and currently supports distributed inference of the GPT-3 model. However, due to its highly coupled pure C++ code, its flexibility and usability are relatively low. In addition, regarding the characteristics of different lengths of input sentences for NLP inference, its distributed inference has no redundant computation elimination method.

For the GPT-3 model, the runtime system of Energon-AI performs slightly lower than FasterTransformer when the batch size is 1, while it could achieve more than 50% performance improvement when the batch size increases.

30% Increase in Dynamic Batching Throughput

Dynamic batching versus direct packaging batch throughput. Hardware: 8*A100 GPU 80GB. The model used in the test is GPT-3, the test sentence length is randomly generated with a maximum of 256, and the padding strategy conforms to the longest padding in the batch.

We simulate a real-world scenario where multiple users send a large number of vary-length inference requests at the same time, and compare the throughput of our dynamic batch method with the traditional FIFO (first-in-first-out) method. Since the dynamic batching algorithm alleviates the problem of massive redundant computation caused by direct padding, the throughput of dynamic batching is improved by 34.7%.

High Usability

from gpt import gpt3

from gpt_batch_server import launch_engine

# for engine

model_class = gpt3

model_type = "gpt"

host = "127.0.0.1"

port = 29401

half = True

backend = "nccl"

# for parallel

tp_init_size = 4

pp_init_size = 2

# for server

engine_server = launch_engine

server_host = "127.0.0.1"

server_port = 8020

rm_padding = False

energonai service init --config_file=gpt_config.py

While pursuing high performance, Energon-AI hopes to maintain high flexibility and usability of the system. Users only need to customize [parallel model], [parallel parameters] and [service request logic] in the configuration file to start an inference task. Currently, we provide the most common GPT, BERT and ViT models as examples, and more detailed tutorials will be provided in the near future.

When building a new parallel model, Energon-AI uses Python, and the usage is similar to Pytorch. It has the concept of layers and its logic of initialization and execution is clear. Users do not need to consider memory management or parallel communication. The following codes show how to run a model with two Linear layers in parallel with Energon-AI.

class MLP(nn.Module):

def __init__(self, dim, dtype, bias):

super().__init__()

self.dense_0 = Linear1D_Col(dim, dim, dtype=dtype, bias=bias, gather_output=False)

self.dense_1 = Linear1D_Row(dim, dim, dtype=dtype, bias=bias, parallel_input=True)

def forward(self, x):

x = self.dense_0(x)

x = self.dense_1(x)

return x

In contrast, when using FasterTransformer to build a new parallel model, users are required to write in C++ and manually set underlying behavior organizations such as managing memory, defining communication, etc. Due to space limitations, the following code shows the part needed for memory management, and communication settings of a two Linear layer model running in parallel with FasterTransformer. Besides, users need to spend a lot of time and energy to debug if they want the code to execute correctly, and C++ code needs additional compilation works. These all pose serious challenges to the user’s parallel knowledge and programming ability.

// Memory Allocation (only for a single paramerter).

T *d_inter_kernel = NULL

param_.ffn.intermediate_weight.kernel = d_inter_kernel;

device_malloc(&d_inter_kernel, dim * dim);

// Two MLP Layers

cublasMM_cublasLtMM_wrapper(param_.cublaslt_handle, param_.cublas_handle, CUBLAS_OP_N, CUBLAS_OP_N, n, m, k, &alpha, param_.ffn.intermediate_weight.kernel, AType_, n, attr_matmul_buf_, BType_, k, &beta, (DataType_ *)inter_matmul_buf_, CType_, n, param_.stream, cublasAlgoMap_, sm_, cublas_workspace_);

add_bias_act_kernelLauncher(inter_matmul_buf_, param_.ffn.intermediate_weight.bias, m, n, ActivationType::GELU, param_.stream);

n = k;

cublasMM_cublasLtMM_wrapper(param_.cublaslt_handle, param_.cublas_handle, CUBLAS_OP_N, CUBLAS_OP_N, n, m, k, &alpha, param_.ffn.output_weight.kernel, AType_, n, inter_matmul_buf_, BType_, k, &beta, (DataType_ *)(param_.transformer_out), CType_, n, param_.stream, cublasAlgoMap_, sm_, cublas_workspace_);

add_bias_input_layernorm_kernelLauncher(param_.transformer_out,                                                             attr_matmul_buf_, param_.ffn.output_weight.bias, param_.ffn_layernorm.gamma, param_.ffn_layernorm.beta, m, n, param_.stream);

// Communication

if(t_parallel_param_.world_size > 1)

all2all_gather(nccl_logits_buf_, nccl_logits_buf_, local_batch * n, t_parallel_param_, decoding_params.stream);

More Features

The Energon-AI subsystem we released recently is beta version. Based on user feedback and plans on schedule, intensive iterative updates will be carried out gradually, and the official version will be provided to users as soon as possible to fully meet the different inference deployment needs. Your requirements and suggestions for Energon-AI are always welcomed here.

Building a Large AI Model Ecosystem

Living in an age of large AI models, in order to solve the pain points of the existing solutions such as limited parallel dimension, low efficiency, poor versatility, difficult deployment, and lack of maintenance, Colossal-AI uses technologies such as efficient multi-dimensional parallelism and heterogeneous parallelism to allow users to deploy large AI models efficiently and quickly with only a few modifications of their codes.

For example, for a super-large AI model such as GPT-3, compared to the NVIDIA solution, Colossal-AI only needs half the computing resources; if the same computing resources are used, the speed could be further increased by 11%, which could reduce the training cost of GPT-3 over a million dollars.

For AlphaFold, which is used for protein structure prediction, our team has introduced FastFold based on the Colossal-AI acceleration scheme. FastFold successfully surpassed other schemes proposed by Google and Columbia University, reducing the training time of AlphaFold from 11 days to 67 hours, and the total cost is lowered as well. Also, in long sequence inference, we achieved a speed improvement of 9.3 ~ 11.6 times.

Colossal-AI is compatible with low-end devices and can train GPT with up to 18 billion parameters on a PC with only one GPU; ordinary laptops can also train models with more than one billion parameters. Compared with current popular solutions, the parameter capacity can be increased by more than 10 times, which greatly reduces the threshold for downstream tasks and application deployment such as the fine-tuning and inference of AI large model.

Recently, Colossal-AI reached N.01 on the top trending projects on Github, against a backdrop of many projects that have as many as 10K stars.

On the Papers With Code, a website which highlights trending Machine Learning research, Colossal-AI has also topped the trending list and attracted many researchers’ attention with its amazing features.

Portal

Project address: https://github.com/hpcaitech/ColossalAI

Funding

HPC-AI Tech raised 4.7 million USD from top VC firms in just 3 months after the company was founded. For more information, please email contact@hpcaitech.com.

Reference:

https://arxiv.org/pdf/2111.14247.pdf

https://www.sciencedirect.com/science/article/abs/pii/S0950584920301373

Implementing the Gargantuan Pathways with Colossal-AI, easy and efficient!

HPC-AI Tech — Wed, 18 May 2022 06:25:19 GMT

Today’s AI models are great at specializing at one task. Whether that be object detection over a particular domain of labels (for example cats and dogs), or generating natural language (e.g. a book or an essay) based on a prompt. To ask a model to do both, however, and you get into very murky territory. That is until 2021 where Google’s Jeff Dean came up with what may be the future of deep learning: Pathways. Pathways is a multi-modal, sparse, deep learning architecture that can generalize to millions of tasks. It is an incremental step towards a more artificially general and intelligent machine.

Nevertheless, compared to the original Transformers architecture, Pathways Language Model(PaLM) has made a series of bold innovations. However, due to its inordinate complexity, it, once again, places the burden on programmers to implement it efficiently on modern hardware. Moreover, significant parts of PaLM are not open sourced, making it difficult to implement parts of it on available hardware, such as GPUs.

Fret not! The Colossal-AI team has implemented the model structure of PaLM and applied several state-of-the-art (sota) high performance computing (HPC) techniques to its implementation, in a parallelised manner, across GPUs, extracting the very last iota of performance.

Let’s talk about the Colossal-AI project, Pathways and our efficient implementation.

If you feel interested, do check out our GitHub repo: https://github.com/hpcaitech/PaLM-colossalai

About Colossal-AI

Colossal-AI is a deep learning system that makes it easy to write performant parallel code for your large AI models. It is based on PyTorch and provides an entire ecosystem to scale up your training to the next level.

GitHub page: https://github.com/hpcaitech/ColossalAI

Colossal-AI’s claim to fame is that it supports a variety of distribution methods, including tensor parallelism, pipeline parallelism, zero-redundant data parallelism etc… We have provided, as examples, efficient implementations of BERT, GPT and ViT that support hybrid levels of parallelism (combining multiple distributed acceleration methods at once). Evolving with the machine learning community, we now attempt to support the PaLM model’s distributed training.

About Pathways

Pathways is the next stage in the evolution of deep learning architectures. It is a large multi-modal system that is capable of generalizing across as many as millions of tasks. Unfortunately, Pathways is built for Google’s needs. As such, it is built over extremely special hardware (TPUs) and over its own special network as well. These mean that, whilst efficient in the Google ecosystem, additional works need to be carried out to scale to accessible hardwares (GPUs) for the society. The model size of PaLM is quite huge with 540 billion parameters, which strongly affects the efficiency a naive data parallel method can grant. The question that begs to be answered is: can we use our common GPU clusters to implement a performant PaLM Model? Well, the engineers of HPC-AI Tech have proposed an answer with Colossal-AI.

But before we delve into the implementation details, let’s talk about the intricacies of what makes the PaLM model great.

PaLM Model Explained

Compared to the regular Transformer architecture, PaLM innovates primarily in following ways:

It uses the SwiGLU activation function over the ReLU, GeLU or even Swish activation functions (these latter activations are commonly used in Transformer architectures).

2. Parallelised Transformer layer that differs from normal parallelisation

The transformer model has two main modules: Attention module and MLP module. In general, the attention layer usually precedes the MLP module. However, in PaLM, the attention and MLP layers are combined together for an overall more compute-efficient architecture.

As shown in Fig. 1, the normal Transformer layer can be represented as follows

While detailed as Fig. 2, the Transformer layer of PaLM can then represented as

According to the paper, fusing the first linear layer of the MLP and the first linear Attention layer can bring an improvement of about 15%! However, we found that even the second layers can be fused, granting an even greater improvement. The architecture of the model after fusion can thus be represented in the following diagram.

Figure 2: Structure of PaLM Transformer layer

3. Multi-Query attention mechanism

Unlike conventional multi-head attention, PaLM has only one head for both the keys and values, while the queries maintain multiple heads as normal multi-head attention. This reduces the number of parameters and thus increases both the speed of training and inference, whilst retaining great model performance on accuracy. We provide both multi-query and multi-head mechanisms in our implementation.

4. Both linear and layernorm layers do not use bias weights, which researchers from Google claim is beneficial for training stability.

We firstly reproduced the PaLM model architecture on one GPU according to the PaLM paper’s description. Here we have referred following repo for the reproduction: https://github.com/lucidrains/PaLM-pytorch

ColossalAI’s Enhanced Parallelisation

Colossal-AI comes to help users easily implement the PaLM model in a parallel manner. Let us describe some of the interesting features that Colossal-AI permits us to write a parallelised, distributed PaLM.

Tensor parallel

One of the core tenets of Colossal-AI is ease of use. That’s why we built a system that is consistent with PyTorch’s interface and does not require deep learning practitioners to learn a completely new framework (yet again). For example, simply replacing toch.nn.Linear with colossalai.nn.Linear will enable users to enjoy all of Colossal-AI’s sota HPC techniques (such as tensor parallelism).

More information can be found here: https://www.colossalai.org/docs/features/1D_tensor_parallel/

Tensor parallelism afforded in an intuitive easy to use manner by Colossal-AI is important, particularly in the quest to implement an efficient Pathways model.

Due to PaLM’s special setting in its attention structure, Colossal-AI needs to deal with a problem involving query, key, and value(Q, K, V). Existing Tensor parallelism in Colossal-AI cuts the last dimension of query, key and value (the first dimension may be cut depending on the parallel mode, but it does not affect the computation, so it will be ignored below), and since key and value are single-head, we need to perform additional communication to ensure their correctness. Here we use B to denote batch size, S to denote sequence length, H to denote hidden size, N to denote the number of attention heads, A to denote the size of a single attention head, and P to denote the number cuts of tensor parallel, where H = NA.

In the non-parallel case, the size of our multi-head Q is (B, S, H) and the size of single-head K and V is (B, S, A). By converting Q into (B, S, N, A), we can directly compute attention together with K and V. But in the parallel situation, Q is (B, S, H/P) and K and V are (B, S, A/P). We can change Q into (B, S, N/P, A) so that we can cut the head dimension of the Q on different GPUs. But this is still not computable because the values on K and V are not sufficient to form a complete attention head, so we need to introduce an additional all-gather operation to form a completed head, i.e., (B, S, A/P) -> (B, S, A). In this way, normal attention computation can be performed.

ZeRO Parallelism

Colossal-AI can additionally provide a memory optimisation by removing redundant operations in the data-parallel approach, dubbed ZeRO parallelism (initially proposed by Microsoft). We can combine the ZeRO parallel approach with different forms of tensor parallelisms as mentioned above to enjoy further gains in efficiency.

Heterogeneous Training

To support large-scale AI model training on a single node, we implement a dynamic heterogeneous memory management mechanism. We optimize the placement of tensors on CPUs and GPUs so as to minimize movement of large tensors. In this manner, we thus leverage heterogeneous memory efficiently.

Training process

The training process is as simple as it gets with Colossal-AI. One simple configuration file can specify what parallel approaches we require. In the PaLM training process, we specify tensor parallelism as well as the configuration of ZeRO.

With this configuration file setup, we are good to go! We can use a simple API: colossal.initialise, to build our training engine and provide us with APIs that resemble PyTorch style code. We can use Colossal-AI very easily in this manner for large-scale training without sacrificing performance.

Performance Testing

We tested all our results on a single server with 8 A100 40GB GPUs. Our network consists of NVLink to interconnect adjacent pairs of GPUs at high speed, and PCI-E between four pairs of GPUs.

We constructed a PaLM structured network with 8 billion parameters and trained it using a hybrid parallel strategy (a mix of 1D, 2D, 2.5D Tensor Parallel approaches as well as the ZeRO optimization). Colossal-AI can additionally switch between training strategies inexpensively by simply changing the comprehensible configuration file.

In the following figure, b denotes the batch size of each data parallel process group, XXtpY denotes the tensor-parallel strategy, XX denotes the 1D, 2D, 2.5D parallel scheme, and Y denotes the parallel degree of tensor parallelism. zeroX denotes in ZeRO parallelism setting with a ZeRO parallelism of degree X. Here we have data parallel degree times model parallel degree quals to the total number of GPUs.

Through numerous experiments, we found that heterogeneous training is a necessity. All the aforementioned schemes enjoyed greater compute efficiencies when implemented with heterogeneous training. Without it, running a model with 8 billion parameters would not be possible.

Specifically, in the cases of 2,4 and 8 GPUs, we find that a 1D tensor parallel degree of 2 works best. This manifests in our setup because the communication bandwidth between adjacent GPUs is high. With the aforementioned optimal configuration, most of the communication is placed between adjacent GPUs. With different network configurations, 2D or even 2.5D tensor parallelism could be more apt. However, with colossal-AI, all that needs to be changed is a simple configuration file, and training can be adapted to a new parallel configuration.

Summary

To conclude, we would like to repeat that we have reproduced the architecture of PaLM. Unfortunately, due to the scarcity of computational resources, we could not fully reproduce the full 100 billion parameter model. Moreover, due to PaLM not being open-sourced, our implementation could deviate slightly from the original implementation of Google’s.

If you have any questions, please feel free to raise an issue or a PR on Github and we will try our best to answer them as soon as possible!

Funding

HPC-AI Tech raised 4.7 million USD from top VC firms in just 3 months after the company was founded. For more information, please email contact@hpcaitech.com.

Train 18-billion-parameter GPT models with a single GPU on your personal computer!

HPC-AI Tech — Mon, 16 May 2022 23:51:50 GMT

Train 18-billion-parameter GPT models with a single GPU on your personal computer! Open source project Colossal-AI has added new features！

When it comes to training large AI models, people will think about using thousands of GPUs, expensive training costs, and only a few tech giants can afford them. While AI users, like researchers from startups or universities, could do nothing but get overwhelmed by news about large models~

Now, a PC with only one GPU can train GPT with up to 18 billion parameters, and a laptop can also train a model with more than one billion parameters. Compared with the existing mainstream solutions, the parameter capacity can be increased by more than ten times!

Such a significant improvement comes from Colossal-AI, which is an efficient training system for general large AI models. Best of all, it’s completely open-sourced and requires only minimal modifications to allow existing deep learning projects to be trained with much larger models on a single consumer-grade graphics card, allowing everyone to train large AI models at home! In particular, it makes downstream tasks and application deployments such as large AI model fine-tuning and inference much easier!

By providing various popular efficient parallelisms, Colossal-AI could also help its users easily deploy existing projects to large-scale computing clusters.

Check out the project over here: https://github.com/hpcaitech/ColossalAI

Tech giants strive for large AI models

After Google proposed the BERT model with 300 million parameters in 2018, the large models’ parameter records have been updated many times in just a few years, such as GPT-3 with 175 billion parameters proposed by OpenAI, MT-NLG with 530 billion parameters introduced by Microsoft and NVIDIA jointly…

The dense model has achieved the scale with more than hundreds of billions of parameters, while the sparse Mixture of Experts (MoE) model, such as the Switch Transformer released by Google in 2021, has brought the number of parameters to the level of trillion.

However, training such large models from scratch might be extremely expensive. It usually requires hundreds or even thousands of professional high-performance GPUs such as NVIDIA A100 at the same time. If we use a dedicated InfiniBand high-speed network to build a supercomputer cluster, the cost for training could even reach ten million dollars.

Use a single consumer-level GPU to train large AI models

It’s obvious that AI users like university students and individual developers could not afford such high costs to train large models, and the most popular computing resources are the NVIDIA RTX GPUs for this kind of people in the AI community.

In order to enhance AI productivity, allow large models to benefit more developers, and truly realize our vision to make the use of large AI models “fast and cheap”, Colossal-AI requires only a few lines of codes to achieve a ten-fold increase in the capacity of model training.

On all types of hardware, Colossal-AI performs better than vanilla PyTorch and mainstream distributed solutions such as Microsoft’s DeepSpeed.

For the representative of large models — GPT, Colossal-AI is capable of training it with up to 1.5 billion parameters on a gaming laptop with RTX 2060 6GB. For a PC with RTX3090 24GB, Colossal-AI could help its users to train GPT with 18 billion parameters. For high performance graphics cards such as Tesla V100, Colossal-AI could bring significant improvements as well.

Colossal-AI has also successfully implemented Google’s PaLM (Pathways Language Model), which was published recently. It has also shown excellent performance improvements on various hardware, while Microsoft DeepSpeed has not published its official PaLM implementation.

Key technology: Enhanced heterogeneous training

The biggest problem with using a single consumer-grade GPU to train a large AI model is that the GPU memory capacity is extremely limited, which severely restricts the model parameters that can be accommodated. The ZeRO-offload method proposed by Microsoft DeepSpeed tries to split the model and utilize the CPU memory with larger capacity and lower cost. At present, there have been several modified versions based on DeepSpeed for heterogeneous training. But as shown on the left part of the figure below, when the GPU memory is insufficient for its corresponding model requirements, the system will crash even if there is still memory available on the CPU.

Different from the derivatives based on DeepSpeed, Colossal-AI team built its core technologies such as ZeRO from scratch, solving current problems like DeepSpeed only statically divides model data between CPU and GPU memory, and only uses fixed memory layout for different training configurations. A lot of improvements have been made by Colossal-AI to improve the efficiency of the usage of GPU and CPU memory. After all, CPU memory is much cheaper than high performance graphics cards with large memory.

The Gemini mechanism designed by Colossal-AI, efficiently manages and utilizes the heterogeneous memory of GPU and CPU, so that tensors are dynamically distributed in the storage space of CPU-GPU during the training process, therefore model training can break the GPU’s memory barrier.

We take advantage of the iterative nature of the deep learning network training process, and divide the training into two stages: warmup stage and non-warmup stage according to the number of iterations. In the initial warmup phase, memory information is monitored; in the non-warmup phase, the collected information is used to efficiently move tensors to minimize CPU-GPU data movement.

It sounds easy, but its implementation is troublesome. It is actually hard to obtain the memory usage of non-model data, because the life cycle of non-model data is not managed by users, and the existing deep learning frameworks do not expose the tracking interface of non-model data to its users. Secondly, non-framework overhead such as CUDA context also needs to be considered.

Colossal-AI obtains the usage of CPU and GPU memory by sampling in the warmup stage. While the usage of non-model data can be obtained by comparing the maximum system memory usage and model memory usage between two moments. The memory usage of the model could be known by querying the memory manager, as shown by the black solid line in the figure below.

All tensors from models are managed by the memory manager, and each tensor is marked with information of states, including HOLD, COMPUTE, FREE, etc. According to the dynamically queried memory usage, Colossal-AI changes states of tensors, and adjusts positions of tensors continuously. Finally, the efficient usage of GPU and CPU memory is realized, maximizing model capacity and balancing training speed in the case of extremely limited hardware, which is of great significance for AI democratization and low-cost fine-tuning of downstream tasks for large models.

Furthermore: convenient and efficient parallelizations

Parallel and distributed technologies are important methods to further accelerate model training. To train the world’s largest and most advanced AI models within the shortest time, efficient distributed parallelization is still a necessity. Aiming at the pain points of the existing solutions such as limited parallel dimension, low efficiency, poor versatility, difficult deployment, and lack of maintenance, Colossal-AI uses technologies such as efficient multi-dimensional parallelism and heterogeneous parallelism to allow users to deploy large AI models efficiently and quickly with only a few modifications of their codes.

For example, for a super-large AI model such as GPT-3, compared to the NVIDIA solution, Colossal-AI only needs half the computing resources to start training; if the same computing resources are used, the speed could be further increased by 11%, which could reduce the training cost of GPT-3 over a million dollars.

Besides, Colossal-AI values open source community construction, providing English and Chinese tutorials, and supporting the latest cutting-edge applications such as PaLM and AlphaFold. Colossal-AI will roll out new and innovative features regularly as well. We always welcome suggestions and discussions from the community, and we would be more than willing to help you if you encounter any issues. You can raise an issue here or create a discussion topic in our forum. Your suggestions are highly appreciated here. Recently, Colossal-AI reached No. 1 on the top trending projects on Github, against a backdrop of many projects that have as many as 10K stars.

About us

The core members of the HPC-AI Tech team are from the UC Berkeley, Stanford University, Tsinghua University, Peking University, National University of Singapore, Nanyang Technological University and other well-known universities. Also, they have work experience from tech giants like Google Brain, IBM, Intel, Microsoft, NVIDIA, etc. The company has received seed round funds from many top VC institutions such as Sinovation Ventures and ZhenFund as well.

Prof. Yang You, Founder of HPC-AI Tech

Ph.D., University of California, Berkeley

IPDPS/ICPP Best Paper Author

ACM/IEEE CS George Michael Memorial HPC Fellowship

Forbes 30 Under 30 (Asia 2021)

IEEE-CS Outstanding Newcomer Award in Supercomputing

UC Berkeley EECS Lotfi A. Zadeh Prize

Prof. James Demmel, CSO of HPC-AI Tech

Distinguished Professor, University of California, Berkeley

ACM/IEEE Fellow

Member of the American Academy of Sciences, the Academy of Engineering, and the Academy of Arts and Sciences

Portal

Code:

GitHub - hpcaitech/ColossalAI: Making large AI models cheaper, faster and more accessible

Reference:

5 Must-Follow Features That Are Seeing Colossal-AI’s Success

HPC-AI Tech — Tue, 05 Apr 2022 12:59:36 GMT

Colossal-AI, HPC-AI Tech’s flagship large-scale parallel AI training system, has become an overnight success. Recently, it was touted as one of the top trending projects on Github, against a backdrop of many projects that have as many as 10K stars.

Colossal-AI is an easy-to-use deep learning system that enables users to maximize the efficiency of AI deployments whilst drastically reducing costs. If you would like to learn more about the project, do check out our GitHub repository: https://github.com/hpcaitech/ColossalAI

In celebration of the first official release of Colossal-AI, we describe some of its awesome new features that we believe are responsible for its success:

A brand new ZeRO module;
Profiler TensorBoard plugin of Beta version (Finer grained monitoring than PyTorch for communication, memory, etc);
MOE feature and example;
Fix bugs and complete tutorials;
Making Colossal-AI work with HuggingFace;

As well as some of its core features that has made Colossal-AI a popular choice in training and deploying large AI models.

Professional Help with Large Model Training

Firstly, it’s important to note that Colossal-AI solves an important problem. In recent years, deep learning has seen an insurgence of large models dominating performance charts. AI models that operate at the frontier have increased ten thousand-fold in just a few years, far outpacing the growth of hardware technologies. Not only are these large AI models far beyond the capacity of a single GPU, but they also often demand hundreds of years of a single GPU.

It thus becomes imperative to improve the capacity of a single GPU via its efficient use to achieve high-performance training of large AI models. This is precisely what Colossal-AI does without burdening the programmer.

Colossal-AI empowers the developer to write performant code for AI deployments through a variety of techniques such as multi-dimensional parallelism and better tools for the maintenance and deployment of models. It allows the programmer to both quickly and efficiently deploy large AI models as well as train them in a distributed manner with only minimal code level modifications. Colossal-AI’s claim to fame are its efficient multi-dimensional parallelism, GPU memory optimization, large-scale optimizer library and fine-grained monitoring.

Let’s dive into some of these key features to see what makes Colossal-AI really great!

[Key Features] Multi-dimensional Parallelism

Existing solutions apply limited modes of parallelism in the process of training large scale AI-models, primarily: data parallelism, one-dimensional tensor parallelism and pipeline parallelism. Colossal-AI, however, provides further modes of parallelism, including 2, 2.5, 3-dimensional tensor parallelism as well as sequence parallelism. It offers additional multidimensional hybrid parallel solutions on top of these as well.

14x batch size and 5x training speed when tensor parallelism of ViT is 64

Amongst these new parallel techniques, high-dimensional tensor parallelism can greatly reduce memory consumption, improve communication efficiency, and make more efficient use of computing resources.

Sequence parallelism helps BERT improve training speed by 2X, or 1.5X sequence length

Colossal-AI even engineers a new form of parallelism: sequence parallelism, which can also be applied to large images, video, long text, long periods of medical surveillance and other data. It enables machines to process long pieces of sequential data that were previously not possible.

[New Features] GPU Memory Optimization

Next up is Colossal-AI’s GPU memory optimization innovations. It combines multiple GPU memory optimization technologies, including multi-dimensional parallelism, ZeRO redundancy memory elimination, CPU offloading, Gradient Checkpointing, Automatic Mixing Accuracy (AMP), etc., to help users avoid memory bottlenecks and to reduce training resource requirements.

GPT-2 uses Colossal-AI to train a model 24 times the size, or 3 times acceleration with the same hardware

[Product Improvement & New Features] Ease of Use

Finally, Colossal-AI prizes simplicity and ease of use. It is designed to be compatible with PyTorch, allowing existing projects to work with minimal modifications. In addition, the system is easily extensible, making it easy to add new features as needed whilst maintaining its performance.

It adds a new fine-grained monitoring tool that allows developers to monitor the network, communication, and memory states within an iteration. Compared to existing frameworks, which can only record the training process with iterations, Colossal-AI makes it easy for developers to accurately analyze and debug deep learning code.

Lastly, Colossal-AI provides a large-scale optimizer library including efficient optimizers like LAMB and LARS, which extend the training batch size to 65536 for the first time. It is also compatible with all of PyTorch’s own optimizers. It is now easier than ever to use large batch optimization methods to train large AI models with Colossal-AI.

[Benefits] Solutions for Industry

This is all great in theory, but what about in practice? Luckily, Colossal-AI has proven its capabilities in its application to hard problems across a variety of industries such as autonomous driving, cloud computing, retail, medicine and chip production. It has also established cooperation with top open source AI organizations such as Hugging Face.

Helping Drug Research and Development: FastFold

One such monumental application is to the domain of protein folding. Recently, AlphaFold was selected by Science and Nature as one of the top 10 scientific breakthroughs in 2021 for its powerful AI ability to predict protein structure. Nonetheless, it suffers from lengthy training times and high costs.

We applied Colossal-AI to develop an accelerated AI model to predict protein structures: FastFold. It introduces Colossal-AI’s novel GPU optimization and large-model training techniques to AlphaFold training and inference. Fastfold, successfully outperforms the solutions from Google and Columbia University, reducing AlphaFold training time from 11 days to 67 hours at a lower total cost whilst achieving a 9.3~11.6x speedup in long-sequence inference.

Comparison of inference performance for long sequences

GPT-3 Training with Half of the Machines

For veteran state-of-the art models, Colossal-AI yet again makes headways. For the notorious GPT-3 model, Colossal-AI requires only half the computational resources to start training compared with NVIDIA’s Megatron-LM. This means that if we use the same computational resources, the speedup incurred via Colossal-AI amounts to ~11%, which can reduce the cost to train GPT-3 by over a million dollars!

Your Thoughts?

Colossal-AI will roll out new and innovative features regularly. We always welcome suggestions and discussions from the community, and we would be more than willing to help you if you encounter any issue. You can raise an issue here or create a discussion topic in the forum. Your voice is a perfect tutor to Colossal-AI.

Portal

Code: https://github.com/hpcaitech/ColossalAI

Paper: https://arxiv.org/abs/2110.14883

Tutorial: https://www.colossalai.org/

Join Us

HPC-AI Tech is a global team and the core members are from the University of California, Berkeley, Stanford University, Tsinghua University, Peking University, National University of Singapore, Singapore Nanyang Technological University, and other top universities in the world. Currently, HPC-AI Tech is recruiting full-time/intern AI system/architecture/compiler/network/CUDA/SaaS/k8s core system developers, open source program operators, and sales personnel.

HPC-AI Tech provides highly competitive compensation packages. Our staff can also work remotely. You are also welcome to recommend outstanding talents to HPC-AI Tech. If they successfully join HPC-AI Tech, we will provide you with a recommendation fee of thousands of US dollars.

Resume delivery mailbox: hr@hpcaitech.com

Prof. Yang You, Founder of HPC-AI Tech

Ph.D., University of California, Berkeley

IPDPS/ICPP Best Paper

ACM/IEEE George Michael HPC Fellowship

Forbes Elite Under 30 (Asia 2021)

IEEE-CS Outstanding Newcomer Award in Supercomputing

UC Berkeley EECS Lotfi A. Zadeh Outstanding Graduate Award

Prof. James Demmel, CSO of HPC-AI Tech

Distinguished Professor, University of California, Berkeley

ACM/IEEE Fellow

Member of the American Academy of Sciences, the Academy of Engineering, and the Academy of Arts and Sciences

Funding

HPC-AI Tech raised 4.7 million USD from top VC firms in just 3 months after the company was founded. For more information, please email contact@hpcaitech.com

Reference

https://unsplash.com/photos/4YoINz4XvnQ?utm_source=unsplash&utm_medium=referral&utm_content=creditShareLink