Azure-powered hyperscale model training on DeepSpeed
In recent years, many new products and features have been developed as a result of large-scale deep learning models being trained on large data sets. There is an increasing demand on an unprecedented scale to train and optimize even more complex models.
Microsoft already trains some of the largest models including the 530 billion-parameter Megatron-Turing (MT-NLG 530B) model and GPT-3 on Azure. Usually, it requires several manual, error-prone steps that users need to set up in order to train these models.
However, owing to a new tech stack breakthrough, we can now use DeepSpeed with Azure, in straightforward training pipelines that make use of either the suggested AzureML recipes or bash scripts for VMSS-based settings.
DeepSpeed is a PyTorch deep learning optimization toolkit. The library is intended to reduce computational power and memory consumption while training large distributed models with improved parallelism on existing computer hardware.
Designed for training with low latency and high throughput, DeepSpeed features the Zero Redundancy Optimizer (ZeRO), which is useful for training models with one trillion or more parameters.
Full stack optimization
A full stack optimization approach is employed by Microsoft to achieve excellent performance and scalability avoiding unnecessary complexity. In other words, all necessary components have been optimized, integrated, and thoroughly tested, including the hardware, the OS, the VM image, the Docker image (containing optimized Python packages such as PyTorch, DeepSpeed, and ONNX Runtime), and the Azure ML APIs for user interaction.
Large models may be trained well at scale with DeepSpeed on Azure, thanks to this newly improved architecture.
To handle extensive AI training, Azure Machine Learning (AzureML) deploys sizable fleets of the newest GPUs propelled by the InfiniBand network. The upgraded tech stack increased performance outcomes, scalable to twice as many GPUs (1024 vs. 512), twice as large model sizes, and up to 1.8 times greater compute throughput per GPU (150 TFLOPs vs. 81 TFLOPs).
The Microsoft engineers were able to maintain an efficient throughput/GPU (>157 TFLOPs) in a near-linear fashion as the model size increased from 175 billion parameters to 2 trillion parameters using the DeepSpeed ZeRO-3, its novel CPU offloading capabilities, and a high-performance Azure stack powered by InfiniBand interconnects and A100 GPUs
By effectively breaking the GPU memory barrier, Azure and DeepSpeed enable the users to quickly and effectively train trillion-parameter models at scale. From the evidence, this alliance can easily be said as the future of hyperscale model training.