All Products
Search
Document Center

Container Service for Kubernetes:DeepSpeed distributed training

Last Updated:Oct 09, 2023

DeepSpeed is an open source deep learning optimization software suite that provides distributed training and model optimization to accelerate model training. This topic describes how to use Arena to quickly submit DeepSpeed distributed training jobs and how to use TensorBoard to visualize training jobs.

Table of contents

Prerequisites

Usage notes

In this example, DeepSpeed is used to train a masked language model. The sample code and dataset are downloaded and added to the sample image registry.cn-beijing.aliyuncs.com/acs/deepspeed:hello-deepspeed. If you do not need to use the sample image, you can download the source code and dataset from the GitHub URL and save the dataset to a shared NAS volume created by using a pair of PV and PVC. In this example, a PVC named training-data is created to claim a shared NAS volume to store the training results.

If you want to use a custom image, use one of the following methods:

  • Refer to Dockerfile and install OpenSSH in the base image.

    Note

    Training jobs can be accessed only through SSH without passwords. Therefore, you must ensure the confidentiality of the Secrets in the production environment.

  • Use the DeepSpeed base image provided by ACK.

    registry.cn-beijing.aliyuncs.com/acs/deepspeed:v072_base

Procedure

  1. Run the following command to query available GPU resources in the cluster:

    arena top node

    Expected output:

    NAME                       IPADDRESS      ROLE    STATUS  GPU(Total)  GPU(Allocated)
    cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   0           0
    cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   0           0
    cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   1           0
    cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   1           0
    cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   1           0
    ---------------------------------------------------------------------------------------------------
    Allocated/Total GPUs In Cluster:
    0/3 (0.0%)

    The output indicates that three GPU-accelerated nodes can be used to run training jobs.

  2. Run the arena submit deepspeedjob [--falg] command command to submit a DeepSpeed job.

    • If you use the DeepSpeed base image provided by ACK, run the following code to submit a DeepSpeed job that uses one Launcher node and three worker nodes.

      arena submit deepspeedjob \
          --name=deepspeed-helloworld \
          --gpus=1 \
          --workers=3 \
          --image=registry.cn-beijing.aliyuncs.com/acs/deepspeed:hello-deepspeed \
          --data=training-data:/data \
          --tensorboard \
          --logdir=/data/deepspeed_data \
          "deepspeed /workspace/DeepSpeedExamples/HelloDeepSpeed/train_bert_ds.py --checkpoint_dir /data/deepspeed_data"

      Parameter

      Required

      Description

      Default value

      --name

      Yes

      The name of the job, which is globally unique.

      None

      --gpus

      No

      The number of GPUs that are used by the worker nodes where the training job runs.

      0

      --workers

      No

      The number of worker nodes.

      1

      --image

      Yes

      The address of the image that is used to deploy the runtime.

      None

      --data

      No

      Allow the training job to access the data stored in the PVC by mounting the PVC to the runtime. The value consists of two parts separated by a colon (:).

      • Specify the name of the PVC on the left side of the colon. Run the arena data list command to view the available PVCs in the cluster.

      • Specify the path of the runtime to which the PVC will be mounted on the right side of the colon. The training job retrieves data from the specified path.

      If no PVC is available, you can create one. For more information, see Configure a shared NAS volume.

      None

      --tensorboard

      No

      Enable a TensorBoard service to visualize the training job. You must configure both this parameter and the --logdir parameter, which specifies the path from which TensorBoard reads event files. If you do not specify this parameter, TensorBoard is not used.

      None

      --logdir

      No

      The path from which TensorBoard reads event files. You must specify both this parameter and the --tensorboard parameter.

      /training_logs

    • If you use a non-public Git repository, you can run the following command to submit a DeepSpeed job:

       arena submit deepspeedjob \
              ...
              --sync-mode=git \ # The code synchronization mode, which can be git or rsync. 
              --sync-source=<Address of the non-public Git repository>  \ # The address of the repository. You must specify both this parameter and the --sync-mode parameter. If you set --sync-mode to git, you can set this parameter to the address of any GitHub project. 
              --env=GIT_SYNC_USERNAME=yourname \
              --env=GIT_SYNC_PASSWORD=yourpwd \
              "deepspeed /workspace/DeepSpeedExamples/HelloDeepSpeed/train_bert_ds.py --checkpoint_dir /data/deepspeed_data"

      In the preceding code block, the Arena client synchronizes the source code in git-sync mode. You can customize the environment variables that are defined in the git-sync project.

    Expected output:

    trainingjob.kai.alibabacloud.com/deepspeed-helloworld created
    INFO[0007] The Job deepspeed-helloworld has been submitted successfully
    INFO[0007] You can run `arena get deepspeed-helloworld --type deepspeedjob` to check the job status
  3. Run the following command to query all training jobs submitted by using Arena:

    arena list

    Expected output:

    NAME                  STATUS   TRAINER         DURATION  GPU(Requested)  GPU(Allocated)  NODE
    deepspeed-helloworld  RUNNING  DEEPSPEEDJOB    3m        3               3               192.168.9.69
  4. Run the following command to query the GPU resources that are used by the jobs:

    arena top job

    Expected output:

    NAME                  STATUS   TRAINER         AGE  GPU(Requested)  GPU(Allocated)  NODE
    deepspeed-helloworld  RUNNING  DEEPSPEEDJOB    4m   3               3               192.168.9.69
    
    Total Allocated/Requested GPUs of Training Jobs: 3/3
  5. Run the following command to query the GPU resources that are used in the cluster:

    arena top node

    Expected output:

    NAME                       IPADDRESS      ROLE    STATUS  GPU(Total)  GPU(Allocated)
    cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   0           0
    cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   0           0
    cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   1           1
    cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   1           1
    cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   1           1
    ---------------------------------------------------------------------------------------------------
    Allocated/Total GPUs In Cluster:
    3/3 (100%)
  6. Run the following command to query the detailed information about a job and the address of the TensorBoard web service:

    arena get deepspeed-helloworld

    Expected output:

    Name:      deepspeed-helloworld
    Status:    RUNNING
    Namespace: default
    Priority:  N/A
    Trainer:   DEEPSPEEDJOB
    Duration:  6m
    
    Instances:
      NAME                           STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
      ----                           ------   ---  --------  --------------  ----
      deepspeed-helloworld-launcher  Running  6m   true      0               cn-beijing.192.1xx.x.x
      deepspeed-helloworld-worker-0  Running  6m   false     1               cn-beijing.192.1xx.x.x
      deepspeed-helloworld-worker-1  Running  6m   false     1               cn-beijing.192.1xx.x.x
      deepspeed-helloworld-worker-2  Running  6m   false     1               cn-beijing.192.1xx.x.x
    
    Your tensorboard will be available on:
    http://192.1xx.x.xx:31870

    TensorBoard is used in this example. Therefore, you can find the URL of TensorBoard in the last two rows of the job information. If TensorBoard is not used, the last two rows are not returned.

  7. Use a browser to view the training results in TensorBoard.

    1. Run the following command to map TensorBoard to the local port 9090:

      kubectl port-forward svc/deepspeed-helloworld-tensorboard 9090:6006
    2. Enter localhost:9090 into the address bar of the web browser to access TensorBoard. The following figure shows an example.

      效果图
  8. Print the log of a job.

    • Run the following command to print the log of a job:

      arena logs deepspeed-helloworld

      Expected output:

      deepspeed-helloworld-worker-0: [2023-03-31 08:38:11,201] [INFO] [logging.py:68:log_dist] [Rank 0] step=7050, skipped=24, lr=[0.0001], mom=[(0.9, 0.999)]
      deepspeed-helloworld-worker-0: [2023-03-31 08:38:11,254] [INFO] [timer.py:198:stop] 0/7050, RunningAvgSamplesPerSec=142.69733028759384, CurrSamplesPerSec=136.08094834473613, MemAllocated=0.06GB, MaxMemAllocated=1.68GB
      deepspeed-helloworld-worker-0: 2023-03-31 08:38:11.255 | INFO     | __main__:log_dist:53 - [Rank 0] Loss: 6.7574
      deepspeed-helloworld-worker-0: [2023-03-31 08:38:13,103] [INFO] [logging.py:68:log_dist] [Rank 0] step=7060, skipped=24, lr=[0.0001], mom=[(0.9, 0.999)]
      deepspeed-helloworld-worker-0: [2023-03-31 08:38:13,134] [INFO] [timer.py:198:stop] 0/7060, RunningAvgSamplesPerSec=142.69095076844823, CurrSamplesPerSec=151.8552037291255, MemAllocated=0.06GB, MaxMemAllocated=1.68GB
      deepspeed-helloworld-worker-0: 2023-03-31 08:38:13.136 | INFO     | __main__:log_dist:53 - [Rank 0] Loss: 6.7570
      deepspeed-helloworld-worker-0: [2023-03-31 08:38:14,924] [INFO] [logging.py:68:log_dist] [Rank 0] step=7070, skipped=24, lr=[0.0001], mom=[(0.9, 0.999)]
      deepspeed-helloworld-worker-0: [2023-03-31 08:38:14,962] [INFO] [timer.py:198:stop] 0/7070, RunningAvgSamplesPerSec=142.69048436022115, CurrSamplesPerSec=152.91029839772997, MemAllocated=0.06GB, MaxMemAllocated=1.68GB
      deepspeed-helloworld-worker-0: 2023-03-31 08:38:14.963 | INFO     | __main__:log_dist:53 - [Rank 0] Loss: 6.7565
    • You can run the arena logs $job_name -f command to print the job log in real time and run the arena logs $job_name -t N command to print N lines from the bottom of the log. You can also run the arena logs --help command to query the parameters for printing logs.

      For example, you can run the following command to print five lines from the bottom of the log:

      arena logs deepspeed-helloworld -t 5

      Expected output:

      deepspeed-helloworld-worker-0: [2023-03-31 08:47:08,694] [INFO] [launch.py:318:main] Process 80 exits successfully.
      deepspeed-helloworld-worker-2: [2023-03-31 08:47:08,731] [INFO] [launch.py:318:main] Process 44 exits successfully.
      deepspeed-helloworld-worker-1: [2023-03-31 08:47:08,946] [INFO] [launch.py:318:main] Process 44 exits successfully.
      /opt/conda/lib/python3.8/site-packages/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
        warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)