Azure-powered hyperscale model training on DeepSpeed

19 December 2022

AzureML, DeepSpeed, GPT-3, Microsoft Azure, Python, PyTorch, ZeRO

In recent years, many new products and features have been developed as a result of large-scale deep learning models being trained on large data sets. There is an increasing demand on an unprecedented scale to train and optimize even more complex models.

Microsoft already trains some of the largest models including the 530 billion-parameter Megatron-Turing (MT-NLG 530B) model and GPT-3 on Azure. Usually, it requires several manual, error-prone steps that users need to set up in order to train these models.

However, owing to a new tech stack breakthrough, we can now use DeepSpeed with Azure, in straightforward training pipelines that make use of either the suggested AzureML recipes or bash scripts for VMSS-based settings.

Why DeepSpeed

DeepSpeed is a PyTorch deep learning optimization toolkit. The library is intended to reduce computational power and memory consumption while training large distributed models with improved parallelism on existing computer hardware.

Designed for training with low latency and high throughput, DeepSpeed features the Zero Redundancy Optimizer (ZeRO), which is useful for training models with one trillion or more parameters.

Full stack optimization

A full stack optimization approach is employed by Microsoft to achieve excellent performance and scalability avoiding unnecessary complexity. In other words, all necessary components have been optimized, integrated, and thoroughly tested, including the hardware, the OS, the VM image, the Docker image (containing optimized Python packages such as PyTorch, DeepSpeed, and ONNX Runtime), and the Azure ML APIs for user interaction.

Large models may be trained well at scale with DeepSpeed on Azure, thanks to this newly improved architecture.

To handle extensive AI training, Azure Machine Learning (AzureML) deploys sizable fleets of the newest GPUs propelled by the InfiniBand network. The upgraded tech stack increased performance outcomes, scalable to twice as many GPUs (1024 vs. 512), twice as large model sizes, and up to 1.8 times greater compute throughput per GPU (150 TFLOPs vs. 81 TFLOPs).

The Microsoft engineers were able to maintain an efficient throughput/GPU (>157 TFLOPs) in a near-linear fashion as the model size increased from 175 billion parameters to 2 trillion parameters using the DeepSpeed ZeRO-3, its novel CPU offloading capabilities, and a high-performance Azure stack powered by InfiniBand interconnects and A100 GPUs

Key Takeaway

By effectively breaking the GPU memory barrier, Azure and DeepSpeed enable the users to quickly and effectively train trillion-parameter models at scale. From the evidence, this alliance can easily be said as the future of hyperscale model training.

Tags:

AzureML DeepSpeed GPT-3 Microsoft Azure Python PyTorch ZeRO

Cisco Training Courses

Cisco Certifications

Cisco Learning Credits

Cisco Continuing Education

Cisco U

Cisco Business Enablement

Fortinet Technical Certifications

Fortinet Technical Courses

ATC Status

Fortinet Services Packages

Prepforce Bootcamp

Microsoft Training

Technical Training

ATP Accreditation

What we do

About Us

Azure-powered hyperscale model training on DeepSpeed

Azure-powered hyperscale model training on DeepSpeed

Why DeepSpeed

Full stack optimization

Key Takeaway

Recent Blogs

Navigating the Future: Key SD-WAN Trends to Watch in 2024

Network Transformation and Technology Trends for 2020

How Does The AI Make Smart Cities Smarter

Relevant Blogs

What do I need to know to pass the New CCNA Certification?

Digital Transformation Trends: How they Impact your Business?

DevNet Certifications That Any Network Engineer Needs

Deploying Network Automation in your Infrastructure

New Cisco Certification [2020 Update] – FAQ

Top-Rated Network Certifications and Skills

No Comments

Address

Contact

Connect with us

Newsletter Sign up

Cisco Training Courses

Cisco Certifications

Cisco Learning Credits

Cisco Continuing Education

Cisco U

Cisco Business Enablement

Fortinet Technical Certifications

Fortinet Technical Courses

ATC Status

Fortinet Services Packages

Prepforce Bootcamp

Microsoft Training

Technical Training

ATP Accreditation

What we do

About Us

Azure-powered hyperscale model training on DeepSpeed

Azure-powered hyperscale model training on DeepSpeed

Why DeepSpeed

Full stack optimization

Key Takeaway

Share this post!

Recent Blogs

Relevant Blogs

No Comments