Community Blog How Does DeepSpeed + Kubernetes Easily Implement Large-Scale Distributed Training?

How Does DeepSpeed + Kubernetes Easily Implement Large-Scale Distributed Training?

This article describes how to build and run DeepSpeed distributed training tasks based on the cloud-native AI suite of ACK.

By Lin Liu


With the widespread adoption of ChatGPT, numerous large-scale language models have emerged.

Due to the increasing size of models, a single GPU is no longer capable of loading the entire model, making distributed model training inevitable. The key to achieving distributed model training lies in the DeepSpeed framework, which supports GPT-NeoX and Bloom.

DeepSpeed is an open-source training library designed for deep learning. It offers various optimization strategies, including mixed-precision training, data parallelism, model parallelism, and pipeline parallelism. These strategies significantly accelerate the training of large-scale models. Moreover, DeepSpeed provides a high-performance distributed training framework that supports mainstream deep learning frameworks and can be utilized across different hardware and cloud platforms. By leveraging DeepSpeed, algorithm engineers can train large-scale deep learning models more rapidly, thereby enhancing model accuracy and efficiency.

Currently, an increasing number of enterprises are conducting large-scale distributed deep learning training on the cloud using containers and Kubernetes. By capitalizing on the advantages of elasticity, extensibility, automation, and high availability, they significantly improve the efficiency and reliability of distributed training while reducing management costs and complexity.

However, as model scale expands and enterprises strive for production efficiency, several challenges persist in building and running DeepSpeed distributed training tasks in Kubernetes. These challenges include low GPU resource utilization, poor distributed training extensibility, and difficulties in obtaining real-time logs and monitoring data.

Solution Description

Currently, the cloud-native AI suite of Alibaba Cloud Container Service for Kubernetes (ACK) supports DeepSpeed distributed training, providing you with an efficient and convenient solution.

After you prepare the training code and data, you can use Arena to deploy a DeepSpeed-based distributed training job in an ACK cluster. In addition, the status and results of training jobs can be easily viewed through the TensorBoard visualization tool, making DeepSpeed distributed training easier and more efficient.


Core Advantages

Building and running DeepSpeed distributed training tasks based on the cloud-native AI suite of ACK provides the following benefits:

1. Large-scale Heterogeneous Resource Management

ACK enables the management of large-scale and heterogeneous resources, allowing you to quickly build standard Kubernetes clusters based on different types of computing resources such as CPUs, GPUs, and FPGAs. This unified and flexible management approach allows for efficient scheduling and maintenance of heterogeneous resources. The cloud-native AI suite also supports various GPU scheduling policies (such as sharing + isolation, priority, and topology awareness) and provides GPU monitoring and alerting capabilities to optimize resource utilization.

2. Flexibility and Cost Optimization

With ACK elastic node pools and HPA/VPA, you can easily scale the number of GPU nodes and pods according to your needs. Additionally, AHPA elastic prediction based on GPU metrics can be implemented. The cloud-native AI suite supports advanced scheduling for hybrid elastic resources, including ECS and ECI, with features such as affinity/anti-affinity, pod topology spreading, and deployment set-aware scheduling. By supporting the termination of insufficient resources, automatic checkpoint storage, fault tolerance, and failover, this solution addresses the availability issues of distributed training based on preemptible instances. It reduces training costs without compromising the success rate of training jobs. ACK also provides cost monitoring and analysis capabilities to manage and optimize the cost of distributed training tasks.

3. Efficient Task Management and Scheduling

The cloud-native AI suite provides the command-line tool Arena, which simplifies and efficiently manages core production tasks in deep learning. It covers data management, model development, model training, model evaluation, and inference service deployment. Arena allows for quick submission of distributed training tasks, improves the performance of training tasks from submission to execution, and manages the lifecycle of tasks effectively. The cloud-native AI suite also offers scheduling policies for optimizing distributed scenarios, including GPU card allocation using the binpack algorithm to improve GPU utilization. It supports custom task priority management and tenant elastic resource quota control. By ensuring proper resource allocation for users and promoting resource sharing, overall resource utilization of the cluster is enhanced.

Quick Start

The following describes how to quickly build and run DeepSpeed distributed training based on the cloud-native AI suite of ACK.


• A Kubernetes cluster that contains GPUs is created. For more information, see Create a Kubernetes Cluster that Includes GPUs.
• Cloud-native AI Suite (ack-arena 0.9.6 or later) is installed. For more information, see Deploy Cloud-native AI Suite [2].
• An Arena client whose version is 0.9.6 or later is installed. For more information, see Install Arena [3].
• A PVC used by Arena is configured for the cluster. For more information, see Configure a NAS shared storage [4] or Configure a CPFS shared storage [5].

Usage Notes

This example uses DeepSpeed to train a masked language model. For ease of running, the sample code [6] and dataset are downloaded to the sample image. If you do not need to use the sample image, you can download the source code from the Git URL and store the dataset in the shared storage system (NAS-based PV and PVC). In this example, you have obtained a PVC instance named training-data (shared storage) to store the training results.

To customize a training image, perform the following steps:

Method 1:

Install OpenSSH in the base image by referring to the Dockerfile [7]

Method 2:

Use the DeepSpeed base image provided by ACK:


Example Introduction

In this example, a Transformer Bert model is created to populate sentences based on the context. Take the following sentence as an example.

In the beautiful season of ____ the ____ shed their leaves.

According to the provided words ‘In the beautiful season of’ and ‘shed their leaves’, it can be predicted that the blank should be filled with 'Autumn' and 'trees'.

This example improves the speed and efficiency of training by integrating the capabilities of DeepSpeed. It is optimized in the following aspects:

• Mixed precision training: DeepSpeed supports mixed precision training by using the fp16 data type. You can enable mixed precision training by setting the following configuration in ds_config.

"fp16": {
  "enabled": True

• ZeRO data parallelism: Zero Redundancy Optimizer can support each GPU to store only a part of the model parameters, gradients, and optimizer state, thus reducing the GPU memory usage and supporting larger models. Currently, three phases are supported. Phase 1 shards the optimizer status. Phase 2 shards the gradient. Stage 3 further shards the model weights. Start Phase 1 by setting the following configuration in ds_config.

"zero_optimization": {
  "stage": 1

• ZeRO-Offload: A single GPU can support a larger model by leveraging both the compute and storage resources of the GPU and CPU, like keeping optimizer states and gradients in memory. For example, it is impossible to train a model with 2 billion parameters on a P40 GPU, but it can be done with ZeRO-Offload. ZeRO-Offload can be enabled by setting the following configuration in ds_config.

"zero_optimization": {
  "offload_optimizer": {
    "device": "cpu"

In this example, the complete configuration file ds_config of DeepSpeed is as follows.

ds_config = {
    "train_micro_batch_size_per_gpu": batch_size,
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 1e-4
    "fp16": {
        "enabled": True
    "zero_optimization": {
        "stage": 1,
        "offload_optimizer": {
            "device": "cpu"

DeepSpeed is an AllReduce-based distributed training framework. It stores all worker information in the hostfile. The launcher reads the hostfile to obtain the worker information and starts a training task on each worker in PDSH mode.

The operator component in ACK cloud-native AI Suite automatically generates the preceding configurations. After you prepare the model training code and data, you can submit a DeepSpeed distributed training task in Arena. As shown in the following figure, the operator creates a launcher pod and multiple worker pods, and starts a training task on the worker pods.



1. Submit a DeepSpeed job

Use the following sample code to submit a DeepSpeed training task that contains one launcher node and three worker nodes. This task will use three machines with one GPU card on each machine for training.

arena submit etjob \
    --name=deepspeed-helloworld \
    --gpus=1 \
    --workers=3 \
    --image=registry.cn-beijing.aliyuncs.com/acs/deepspeed:hello-deepspeed \
    --data=training-data:/data \
    --tensorboard \
    --logdir=/data/deepspeed_data \
    "deepspeed /workspace/DeepSpeedExamples/HelloDeepSpeed/train_bert_ds.py --checkpoint_dir /data/deepspeed_data"

The expected command output is as follows

trainingjob.kai.alibabacloud.com/deepspeed-helloworld created
INFO[0007] The Job deepspeed-helloworld has been submitted successfully
INFO[0007] You can run `arena get deepspeed-helloworld --type etjob` to check the job status

2. Run the following command to obtain job details

arena get deepspeed-helloworld

The expected command output is as follows

Name:      deepspeed-helloworld
Status:    RUNNING
Namespace: default
Priority:  N/A
Trainer:   ETJOB
Duration:  6m

  NAME                           STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
  ----                           ------   ---  --------  --------------  ----
  deepspeed-helloworld-launcher  Running  6m   true      0               cn-beijing.192.1xx.x.x
  deepspeed-helloworld-worker-0  Running  6m   false     1               cn-beijing.192.1xx.x.x
  deepspeed-helloworld-worker-1  Running  6m   false     1               cn-beijing.192.1xx.x.x
  deepspeed-helloworld-worker-2  Running  6m   false     1               cn-beijing.192.1xx.x.x

Your tensorboard will be available on:

3. View TensorBoard in a browser

Run the following command to map TensorBoard in the cluster to the local 9090 port.

kubectl port-forward svc/deepspeed-helloworld-tensorboard 9090:6006

Visit localhost:9090 in a browser to view TensorBoard.


Execute the following command to obtain the job log:

arena logs deepspeed-helloworld

For more information about operating commands and parameters, see Cloud-native AI Suite Documentation [8].

Related Links

[1] Create a Kubernetes Cluster that Includes GPUs
[2] Deploy the Cloud-native AI Suite
[3] Install Arena
[4] Configure NAS Shared Storage
[5] Configure CPFS Shared Storage
[6] Sample Code
[7] Dockerfile
[8] Cloud-native AI Suite Documentation

0 1 0
Share on

You may also like


Related Products