By Lin Liu
With the widespread adoption of ChatGPT, numerous large-scale language models have emerged.
Due to the increasing size of models, a single GPU is no longer capable of loading the entire model, making distributed model training inevitable. The key to achieving distributed model training lies in the DeepSpeed framework, which supports GPT-NeoX and Bloom.
DeepSpeed is an open-source training library designed for deep learning. It offers various optimization strategies, including mixed-precision training, data parallelism, model parallelism, and pipeline parallelism. These strategies significantly accelerate the training of large-scale models. Moreover, DeepSpeed provides a high-performance distributed training framework that supports mainstream deep learning frameworks and can be utilized across different hardware and cloud platforms. By leveraging DeepSpeed, algorithm engineers can train large-scale deep learning models more rapidly, thereby enhancing model accuracy and efficiency.
Currently, an increasing number of enterprises are conducting large-scale distributed deep learning training on the cloud using containers and Kubernetes. By capitalizing on the advantages of elasticity, extensibility, automation, and high availability, they significantly improve the efficiency and reliability of distributed training while reducing management costs and complexity.
However, as model scale expands and enterprises strive for production efficiency, several challenges persist in building and running DeepSpeed distributed training tasks in Kubernetes. These challenges include low GPU resource utilization, poor distributed training extensibility, and difficulties in obtaining real-time logs and monitoring data.
Currently, the cloud-native AI suite of Alibaba Cloud Container Service for Kubernetes (ACK) supports DeepSpeed distributed training, providing you with an efficient and convenient solution.
After you prepare the training code and data, you can use Arena to deploy a DeepSpeed-based distributed training job in an ACK cluster. In addition, the status and results of training jobs can be easily viewed through the TensorBoard visualization tool, making DeepSpeed distributed training easier and more efficient.
Building and running DeepSpeed distributed training tasks based on the cloud-native AI suite of ACK provides the following benefits:
ACK enables the management of large-scale and heterogeneous resources, allowing you to quickly build standard Kubernetes clusters based on different types of computing resources such as CPUs, GPUs, and FPGAs. This unified and flexible management approach allows for efficient scheduling and maintenance of heterogeneous resources. The cloud-native AI suite also supports various GPU scheduling policies (such as sharing + isolation, priority, and topology awareness) and provides GPU monitoring and alerting capabilities to optimize resource utilization.
With ACK elastic node pools and HPA/VPA, you can easily scale the number of GPU nodes and pods according to your needs. Additionally, AHPA elastic prediction based on GPU metrics can be implemented. The cloud-native AI suite supports advanced scheduling for hybrid elastic resources, including ECS and ECI, with features such as affinity/anti-affinity, pod topology spreading, and deployment set-aware scheduling. By supporting the termination of insufficient resources, automatic checkpoint storage, fault tolerance, and failover, this solution addresses the availability issues of distributed training based on preemptible instances. It reduces training costs without compromising the success rate of training jobs. ACK also provides cost monitoring and analysis capabilities to manage and optimize the cost of distributed training tasks.
The cloud-native AI suite provides the command-line tool Arena, which simplifies and efficiently manages core production tasks in deep learning. It covers data management, model development, model training, model evaluation, and inference service deployment. Arena allows for quick submission of distributed training tasks, improves the performance of training tasks from submission to execution, and manages the lifecycle of tasks effectively. The cloud-native AI suite also offers scheduling policies for optimizing distributed scenarios, including GPU card allocation using the binpack algorithm to improve GPU utilization. It supports custom task priority management and tenant elastic resource quota control. By ensuring proper resource allocation for users and promoting resource sharing, overall resource utilization of the cluster is enhanced.
The following describes how to quickly build and run DeepSpeed distributed training based on the cloud-native AI suite of ACK.
• A Kubernetes cluster that contains GPUs is created. For more information, see Create a Kubernetes Cluster that Includes GPUs.
• Cloud-native AI Suite (ack-arena 0.9.6 or later) is installed. For more information, see Deploy Cloud-native AI Suite [2].
• An Arena client whose version is 0.9.6 or later is installed. For more information, see Install Arena [3].
• A PVC used by Arena is configured for the cluster. For more information, see Configure a NAS shared storage [4] or Configure a CPFS shared storage [5].
This example uses DeepSpeed to train a masked language model. For ease of running, the sample code [6] and dataset are downloaded to the sample image. If you do not need to use the sample image, you can download the source code from the Git URL and store the dataset in the shared storage system (NAS-based PV and PVC). In this example, you have obtained a PVC instance named training-data (shared storage) to store the training results.
To customize a training image, perform the following steps:
Install OpenSSH in the base image by referring to the Dockerfile [7]
Use the DeepSpeed base image provided by ACK:
registry.cn-beijing.aliyuncs.com/acs/deepspeed:v072_base
In this example, a Transformer Bert model is created to populate sentences based on the context. Take the following sentence as an example.
In the beautiful season of ____ the ____ shed their leaves.
According to the provided words ‘In the beautiful season of’ and ‘shed their leaves’, it can be predicted that the blank should be filled with 'Autumn' and 'trees'.
This example improves the speed and efficiency of training by integrating the capabilities of DeepSpeed. It is optimized in the following aspects:
• Mixed precision training: DeepSpeed supports mixed precision training by using the fp16 data type. You can enable mixed precision training by setting the following configuration in ds_config.
"fp16": {
"enabled": True
}
• ZeRO data parallelism: Zero Redundancy Optimizer can support each GPU to store only a part of the model parameters, gradients, and optimizer state, thus reducing the GPU memory usage and supporting larger models. Currently, three phases are supported. Phase 1 shards the optimizer status. Phase 2 shards the gradient. Stage 3 further shards the model weights. Start Phase 1 by setting the following configuration in ds_config.
"zero_optimization": {
"stage": 1
}
• ZeRO-Offload: A single GPU can support a larger model by leveraging both the compute and storage resources of the GPU and CPU, like keeping optimizer states and gradients in memory. For example, it is impossible to train a model with 2 billion parameters on a P40 GPU, but it can be done with ZeRO-Offload. ZeRO-Offload can be enabled by setting the following configuration in ds_config.
"zero_optimization": {
"offload_optimizer": {
"device": "cpu"
}
}
In this example, the complete configuration file ds_config of DeepSpeed is as follows.
ds_config = {
"train_micro_batch_size_per_gpu": batch_size,
"optimizer": {
"type": "Adam",
"params": {
"lr": 1e-4
}
},
"fp16": {
"enabled": True
},
"zero_optimization": {
"stage": 1,
"offload_optimizer": {
"device": "cpu"
}
}
}
DeepSpeed is an AllReduce-based distributed training framework. It stores all worker information in the hostfile. The launcher reads the hostfile to obtain the worker information and starts a training task on each worker in PDSH mode.
The operator component in ACK cloud-native AI Suite automatically generates the preceding configurations. After you prepare the model training code and data, you can submit a DeepSpeed distributed training task in Arena. As shown in the following figure, the operator creates a launcher pod and multiple worker pods, and starts a training task on the worker pods.
Use the following sample code to submit a DeepSpeed training task that contains one launcher node and three worker nodes. This task will use three machines with one GPU card on each machine for training.
arena submit etjob \
--name=deepspeed-helloworld \
--gpus=1 \
--workers=3 \
--image=registry.cn-beijing.aliyuncs.com/acs/deepspeed:hello-deepspeed \
--data=training-data:/data \
--tensorboard \
--logdir=/data/deepspeed_data \
"deepspeed /workspace/DeepSpeedExamples/HelloDeepSpeed/train_bert_ds.py --checkpoint_dir /data/deepspeed_data"
The expected command output is as follows
trainingjob.kai.alibabacloud.com/deepspeed-helloworld created
INFO[0007] The Job deepspeed-helloworld has been submitted successfully
INFO[0007] You can run `arena get deepspeed-helloworld --type etjob` to check the job status
arena get deepspeed-helloworld
The expected command output is as follows
Name: deepspeed-helloworld
Status: RUNNING
Namespace: default
Priority: N/A
Trainer: ETJOB
Duration: 6m
Instances:
NAME STATUS AGE IS_CHIEF GPU(Requested) NODE
---- ------ --- -------- -------------- ----
deepspeed-helloworld-launcher Running 6m true 0 cn-beijing.192.1xx.x.x
deepspeed-helloworld-worker-0 Running 6m false 1 cn-beijing.192.1xx.x.x
deepspeed-helloworld-worker-1 Running 6m false 1 cn-beijing.192.1xx.x.x
deepspeed-helloworld-worker-2 Running 6m false 1 cn-beijing.192.1xx.x.x
Your tensorboard will be available on:
http://192.1xx.x.xx:31870
Run the following command to map TensorBoard in the cluster to the local 9090 port.
kubectl port-forward svc/deepspeed-helloworld-tensorboard 9090:6006
Visit localhost:9090
in a browser to view TensorBoard.
Execute the following command to obtain the job log:
arena logs deepspeed-helloworld
For more information about operating commands and parameters, see Cloud-native AI Suite Documentation [8].
[1] Create a Kubernetes Cluster that Includes GPUs
https://www.alibabacloud.com/help/en/doc-detail/171074.html?spm=a2c4g.171073.0.0.4c78f95a00Mb5P
[2] Deploy the Cloud-native AI Suite
https://www.alibabacloud.com/help/en/doc-detail/201997.htm#section-2al-4w3-dpd
[3] Install Arena
https://www.alibabacloud.com/help/en/doc-detail/212117.htm#task-1917487
[4] Configure NAS Shared Storage
https://www.alibabacloud.com/help/en/doc-detail/212177.htm#task-1917776
[5] Configure CPFS Shared Storage
https://www.alibabacloud.com/help/en/doc-detail/212178.htm#task-1917776
[6] Sample Code
https://github.com/microsoft/DeepSpeedExamples/tree/master/training/HelloDeepSpeed
[7] Dockerfile
https://github.com/kubeflow/arena/blob/master/samples/deepspeed/Dockerfile
[8] Cloud-native AI Suite Documentation
https://www.alibabacloud.com/help/en/doc-detail/2249322.html?spm=a2c4g.201994.0.0.23257371FAcYuc
Use OpenKruise to Implement End-to-end Canary Release Based on Higress
503 posts | 48 followers
FollowAlibaba Cloud Native Community - September 20, 2023
Alibaba Cloud Native - April 2, 2024
Alibaba Container Service - July 24, 2024
Alibaba Cloud Native Community - September 25, 2023
Alibaba Cloud Data Intelligence - June 18, 2024
Alibaba Cloud Native Community - June 29, 2022
503 posts | 48 followers
FollowProvides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resources
Learn MoreAlibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.
Learn MoreAccelerate and secure the development, deployment, and management of containerized applications cost-effectively.
Learn MoreMulti-source metrics are aggregated to monitor the status of your business and services in real time.
Learn MoreMore Posts by Alibaba Cloud Native Community