How to use Arena to submit and view distributed training jobs - Container Service for Kubernetes

DeepSpeed is an open source deep learning optimization software suite that provides distributed training and model optimization to accelerate model training. This topic describes how to use Arena to quickly submit DeepSpeed distributed training jobs and how to use TensorBoard to visualize training jobs.

Prerequisites
Usage notes
Procedure

Prerequisites

A Container Service for Kubernetes (ACK) cluster that contains GPU-accelerated nodes is created. For more information, see Create an ACK cluster with GPU-accelerated nodes.
The cloud-native AI suite is installed. The version of the CLI ack-arena for machine learning is 0.9.10 or later. For more information, see Deploy the cloud-native AI suite.
The Arena client is installed and the Arena version is 0.9.10 or later. For more information, see Configure the Arena client.
Persistent volume claims (PVCs) are created in the cluster. For more information, see Configure a shared NAS volume.

Usage notes

In this example, DeepSpeed is used to train a masked language model. The sample code and dataset are downloaded and added to the sample image registry.cn-beijing.aliyuncs.com/acs/deepspeed:hello-deepspeed. If you do not need to use the sample image, you can download the source code and dataset from the GitHub URL and save the dataset to a shared NAS volume created by using a pair of PV and PVC. In this example, a PVC named training-data is created to claim a shared NAS volume to store the training results.

If you want to use a custom image, use one of the following methods:

Refer to Dockerfile and install OpenSSH in the base image.
Note
Training jobs can be accessed only through SSH without passwords. Therefore, you must ensure the confidentiality of the Secrets in the production environment.

Use the DeepSpeed base image provided by ACK.

registry.cn-beijing.aliyuncs.com/acs/deepspeed:v072_base

Procedure

Run the following command to query available GPU resources in the cluster:

arena top node

Expected output:

NAME                       IPADDRESS      ROLE    STATUS  GPU(Total)  GPU(Allocated)
cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   0           0
cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   0           0
cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   1           0
cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   1           0
cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   1           0
---------------------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/3 (0.0%)

The output indicates that three GPU-accelerated nodes can be used to run training jobs.

Run the arena submit deepspeedjob [--falg] command command to submit a DeepSpeed job.

If you use the DeepSpeed base image provided by ACK, run the following code to submit a DeepSpeed job that uses one Launcher node and three worker nodes.

arena submit deepspeedjob \
    --name=deepspeed-helloworld \
    --gpus=1 \
    --workers=3 \
    --image=registry.cn-beijing.aliyuncs.com/acs/deepspeed:hello-deepspeed \
    --data=training-data:/data \
    --tensorboard \
    --logdir=/data/deepspeed_data \
    "deepspeed /workspace/DeepSpeedExamples/HelloDeepSpeed/train_bert_ds.py --checkpoint_dir /data/deepspeed_data"

Parameter	Required	Description	Default value
--name	Yes	The name of the job, which is globally unique.	None
--gpus	No	The number of GPUs that are used by the worker nodes where the training job runs.	0
--workers	No	The number of worker nodes.	1
--image	Yes	The address of the image that is used to deploy the runtime.	None
--data	No	Allow the training job to access the data stored in the PVC by mounting the PVC to the runtime. The value consists of two parts separated by a colon (:). Specify the name of the PVC on the left side of the colon. Run the `arena data list` command to view the available PVCs in the cluster. Specify the path of the runtime to which the PVC will be mounted on the right side of the colon. The training job retrieves data from the specified path. If no PVC is available, you can create one. For more information, see Configure a shared NAS volume.	None
--tensorboard	No	Enable a TensorBoard service to visualize the training job. You must configure both this parameter and the --logdir parameter, which specifies the path from which TensorBoard reads event files. If you do not specify this parameter, TensorBoard is not used.	None
--logdir	No	The path from which TensorBoard reads event files. You must specify both this parameter and the --tensorboard parameter.	/training_logs

If you use a non-public Git repository, you can run the following command to submit a DeepSpeed job:

 arena submit deepspeedjob \
        ...
        --sync-mode=git \ # The code synchronization mode, which can be git or rsync. 
        --sync-source=<Address of the non-public Git repository>  \ # The address of the repository. You must specify both this parameter and the --sync-mode parameter. If you set --sync-mode to git, you can set this parameter to the address of any GitHub project. 
        --env=GIT_SYNC_USERNAME=yourname \
        --env=GIT_SYNC_PASSWORD=yourpwd \
        "deepspeed /workspace/DeepSpeedExamples/HelloDeepSpeed/train_bert_ds.py --checkpoint_dir /data/deepspeed_data"

In the preceding code block, the Arena client synchronizes the source code in git-sync mode. You can customize the environment variables that are defined in the git-sync project.

Expected output:

trainingjob.kai.alibabacloud.com/deepspeed-helloworld created
INFO[0007] The Job deepspeed-helloworld has been submitted successfully
INFO[0007] You can run `arena get deepspeed-helloworld --type deepspeedjob` to check the job status

Run the following command to query all training jobs submitted by using Arena:

arena list

Expected output:

NAME                  STATUS   TRAINER         DURATION  GPU(Requested)  GPU(Allocated)  NODE
deepspeed-helloworld  RUNNING  DEEPSPEEDJOB    3m        3               3               192.168.9.69

Run the following command to query the GPU resources that are used by the jobs:

arena top job

Expected output:

NAME                  STATUS   TRAINER         AGE  GPU(Requested)  GPU(Allocated)  NODE
deepspeed-helloworld  RUNNING  DEEPSPEEDJOB    4m   3               3               192.168.9.69

Total Allocated/Requested GPUs of Training Jobs: 3/3

Run the following command to query the GPU resources that are used in the cluster:

arena top node

Expected output:

NAME                       IPADDRESS      ROLE    STATUS  GPU(Total)  GPU(Allocated)
cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   0           0
cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   0           0
cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   1           1
cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   1           1
cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   1           1
---------------------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
3/3 (100%)

Run the following command to query the detailed information about a job and the address of the TensorBoard web service:

arena get deepspeed-helloworld

Expected output:

Name:      deepspeed-helloworld
Status:    RUNNING
Namespace: default
Priority:  N/A
Trainer:   DEEPSPEEDJOB
Duration:  6m

Instances:
  NAME                           STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
  ----                           ------   ---  --------  --------------  ----
  deepspeed-helloworld-launcher  Running  6m   true      0               cn-beijing.192.1xx.x.x
  deepspeed-helloworld-worker-0  Running  6m   false     1               cn-beijing.192.1xx.x.x
  deepspeed-helloworld-worker-1  Running  6m   false     1               cn-beijing.192.1xx.x.x
  deepspeed-helloworld-worker-2  Running  6m   false     1               cn-beijing.192.1xx.x.x

Your tensorboard will be available on:
http://192.1xx.x.xx:31870

TensorBoard is used in this example. Therefore, you can find the URL of TensorBoard in the last two rows of the job information. If TensorBoard is not used, the last two rows are not returned.

Use a browser to view the training results in TensorBoard.
1. Run the following command to map TensorBoard to the local port 9090:
```
kubectl port-forward svc/deepspeed-helloworld-tensorboard 9090:6006
```
2. Enter localhost:9090 into the address bar of the web browser to access TensorBoard. The following figure shows an example.

Print the log of a job.

Run the following command to print the log of a job:

arena logs deepspeed-helloworld

Expected output:

deepspeed-helloworld-worker-0: [2023-03-31 08:38:11,201] [INFO] [logging.py:68:log_dist] [Rank 0] step=7050, skipped=24, lr=[0.0001], mom=[(0.9, 0.999)]
deepspeed-helloworld-worker-0: [2023-03-31 08:38:11,254] [INFO] [timer.py:198:stop] 0/7050, RunningAvgSamplesPerSec=142.69733028759384, CurrSamplesPerSec=136.08094834473613, MemAllocated=0.06GB, MaxMemAllocated=1.68GB
deepspeed-helloworld-worker-0: 2023-03-31 08:38:11.255 | INFO     | __main__:log_dist:53 - [Rank 0] Loss: 6.7574
deepspeed-helloworld-worker-0: [2023-03-31 08:38:13,103] [INFO] [logging.py:68:log_dist] [Rank 0] step=7060, skipped=24, lr=[0.0001], mom=[(0.9, 0.999)]
deepspeed-helloworld-worker-0: [2023-03-31 08:38:13,134] [INFO] [timer.py:198:stop] 0/7060, RunningAvgSamplesPerSec=142.69095076844823, CurrSamplesPerSec=151.8552037291255, MemAllocated=0.06GB, MaxMemAllocated=1.68GB
deepspeed-helloworld-worker-0: 2023-03-31 08:38:13.136 | INFO     | __main__:log_dist:53 - [Rank 0] Loss: 6.7570
deepspeed-helloworld-worker-0: [2023-03-31 08:38:14,924] [INFO] [logging.py:68:log_dist] [Rank 0] step=7070, skipped=24, lr=[0.0001], mom=[(0.9, 0.999)]
deepspeed-helloworld-worker-0: [2023-03-31 08:38:14,962] [INFO] [timer.py:198:stop] 0/7070, RunningAvgSamplesPerSec=142.69048436022115, CurrSamplesPerSec=152.91029839772997, MemAllocated=0.06GB, MaxMemAllocated=1.68GB
deepspeed-helloworld-worker-0: 2023-03-31 08:38:14.963 | INFO     | __main__:log_dist:53 - [Rank 0] Loss: 6.7565

You can run the arena logs $job_name -f command to print the job log in real time and run the arena logs $job_name -t N command to print N lines from the bottom of the log. You can also run the arena logs --help command to query the parameters for printing logs.

For example, you can run the following command to print five lines from the bottom of the log:

arena logs deepspeed-helloworld -t 5

Expected output:

deepspeed-helloworld-worker-0: [2023-03-31 08:47:08,694] [INFO] [launch.py:318:main] Process 80 exits successfully.
deepspeed-helloworld-worker-2: [2023-03-31 08:47:08,731] [INFO] [launch.py:318:main] Process 44 exits successfully.
deepspeed-helloworld-worker-1: [2023-03-31 08:47:08,946] [INFO] [launch.py:318:main] Process 44 exits successfully.
/opt/conda/lib/python3.8/site-packages/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)

Container Service for Kubernetes:DeepSpeed distributed training

Table of contents

Prerequisites

Usage notes

Procedure