All Products
Search
Document Center

Container Service for Kubernetes:Work with topology-aware GPU scheduling (TensorFlow edition)

Last Updated:Sep 21, 2023

Container Service for Kubernetes (ACK) supports topology-aware GPU scheduling based on the scheduling framework. This feature selects a combination of GPUs from GPU-accelerated nodes to achieve optimal GPU acceleration for training jobs. This topic describes how to use topology-aware GPU scheduling to achieve optimal GPU acceleration for TensorFlow distributed jobs.

Prerequisites

  • A Container Service for Kubernetes (ACK) Pro cluster is created and the instance type of the cluster is set to Elastic GPU Service. For more information, see Create an ACK managed cluster.

  • Arena is installed.

  • The topology-aware GPU scheduling component is installed.

  • The versions of the system components meet the following requirements.

    Component

    Version

    Kubernetes

    1.18.8 and later

    Nvidia

    418.87.01 and later

    NVIDIA Collective Communications Library (NCCL)

    2.7+

    Operating system

    • CentOS 7.6

    • CentOS 7.7

    • Ubuntu 16.04

    • Ubuntu 18.04

    • Alibaba Cloud Linux 2

    • Alibaba Cloud Linux 3

    GPU

    V100

Usage notes

  • Topology-aware GPU scheduling is applicable to only Message Passing Interface (MPI) jobs that are trained by using a distributed framework.

  • The resources that are requested by pods must meet specific requirements before the pods can be created to submit and run jobs. Otherwise, the requests remain pending for resources.

Procedure

Configure nodes

Run the following command to set node labels and explicitly enable topology-aware GPU scheduling for nodes:

kubectl label node <Your Node Name> ack.node.gpu.schedule=topology
Note

After topology-aware GPU scheduling is enabled on nodes, regular GPU scheduling cannot be enabled. You can run the following command to change the label and enable regular GPU scheduling:

kubectl label node <Your Node Name> ack.node.gpu.schedule=default --overwrite

Submit jobs

Submit an MPI job and set --gputopology to true.

arena submit --gputopology=true --gang ***

Example 1: Train VGG16

Note

The cluster used in this example consists of two nodes. Each node provides eight V100 GPUs.

Use topology-aware GPU scheduling to train VGG16

  1. Run the following command to submit a job to the cluster:

    arena submit mpi \
      --name=tensorflow-topo-4-vgg16 \
      --gpus=1 \
      --workers=4 \
      --gang \
      --gputopology=true \
      --image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/tensorflow-benchmark:tf2.3.0-py3.7-cuda10.1 \
      "mpirun --allow-run-as-root -np "4" -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /tensorflow/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=vgg16 --batch_size=64 --variable_update=horovod"
  2. Run the following command to query the status of the job:

    arena get tensorflow-topo-4-vgg16 --type mpijob

    Expected output:

    Name:      tensorflow-topo-4-vgg16
    Status:    RUNNINGNamespace: default
    Priority:  N/A
    Trainer:   MPIJOB
    Duration:  2m
    
    Instances:
      NAME                                    STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
      ----                                    ------   ---  --------  --------------  ----
      tensorflow-topo-4-vgg16-launcher-lmhjl  Running  2m   true      0               cn-shanghai.192.168.16.172
      tensorflow-topo-4-vgg16-worker-0        Running  2m   false     1               cn-shanghai.192.168.16.173
      tensorflow-topo-4-vgg16-worker-1        Running  2m   false     1               cn-shanghai.192.168.16.173
      tensorflow-topo-4-vgg16-worker-2        Running  2m   false     1               cn-shanghai.192.168.16.173
      tensorflow-topo-4-vgg16-worker-3        Running  2m   false     1               cn-shanghai.192.168.16.173
  3. Run the following command to print the job log:

    arena logs -f tensorflow-topo-4-vgg16

    Expected output:

    total images/sec: 991.92

Use regular GPU scheduling to train VGG16

  1. Run the following command to submit a job to the cluster:

    arena submit mpi \
      --name=tensorflow-4-vgg16 \
      --gpus=1 \
      --workers=4 \
      --image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/tensorflow-benchmark:tf2.3.0-py3.7-cuda10.1 \
      "mpirun --allow-run-as-root -np "4" -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /tensorflow/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=vgg16 --batch_size=64 --variable_update=horovod"
  2. Run the following command to query the status of the job:

    arena get tensorflow-4-vgg16 --type mpijob

    Expected output:

    Name:      tensorflow-4-vgg16
    Status:    RUNNING
    Namespace: default
    Priority:  N/A
    Trainer:   MPIJOB
    Duration:  9s
    
    Instances:
      NAME                               STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
      ----                               ------   ---  --------  --------------  ----
      tensorflow-4-vgg16-launcher-xc28k  Running  9s   true      0               cn-shanghai.192.168.16.172
      tensorflow-4-vgg16-worker-0        Running  9s   false     1               cn-shanghai.192.168.16.172
      tensorflow-4-vgg16-worker-1        Running  9s   false     1               cn-shanghai.192.168.16.173
      tensorflow-4-vgg16-worker-2        Running  9s   false     1               cn-shanghai.192.168.16.172
      tensorflow-4-vgg16-worker-3        Running  9s   false     1               cn-shanghai.192.168.16.173
  3. Run the following command to print the job log:

    arena logs -f tensorflow-4-vgg16

    Expected output:

    total images/sec: 200.47

Example 2: Train ResNet50

Use topology-aware GPU scheduling to train ResNet50

  1. Run the following command to submit a job to the cluster:

    arena submit mpi \
      --name=tensorflow-topo-4-resnet50 \
      --gpus=1 \
      --workers=4 \
      --gang \
      --gputopology=true \
      --image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/tensorflow-benchmark:tf2.3.0-py3.7-cuda10.1 \
      "mpirun --allow-run-as-root -np "4" -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /tensorflow/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=resnet50 --batch_size=64  --variable_update=horovod"
  2. Run the following command to query the status of the job:

    arena get tensorflow-topo-4-resnet50 --type mpijob

    Expected output:

    Name:      tensorflow-topo-4-resnet50
    Status:    RUNNING
    Namespace: default
    Priority:  N/A
    Trainer:   MPIJOB
    Duration:  8s
    
    Instances:
      NAME                                       STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
      ----                                       ------   ---  --------  --------------  ----
      tensorflow-topo-4-resnet50-launcher-7ln8j  Running  8s   true      0               cn-shanghai.192.168.16.172
      tensorflow-topo-4-resnet50-worker-0        Running  8s   false     1               cn-shanghai.192.168.16.173
      tensorflow-topo-4-resnet50-worker-1        Running  8s   false     1               cn-shanghai.192.168.16.173
      tensorflow-topo-4-resnet50-worker-2        Running  8s   false     1               cn-shanghai.192.168.16.173
      tensorflow-topo-4-resnet50-worker-3        Running  8s   false     1               cn-shanghai.192.168.16.173
  3. Run the following command to print the job log:

    arena logs -f tensorflow-topo-4-resnet50

    Expected output:

    total images/sec: 1471.55

Use regular GPU scheduling to train ResNet50

  1. Run the following command to submit a job to the cluster:

    arena submit mpi \
      --name=tensorflow-4-resnet50 \
      --gpus=1 \
      --workers=4 \
      --image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/tensorflow-benchmark:tf2.3.0-py3.7-cuda10.1 \
      "mpirun --allow-run-as-root -np "4" -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /tensorflow/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=resnet50 --batch_size=64  --variable_update=horovod"
  2. Run the following command to query the status of the job:

    arena get tensorflow-4-resnet50 --type mpijob

    Expected output:

    Name:      tensorflow-4-resnet50
    Status:    RUNNING
    Namespace: default
    Priority:  N/A
    Trainer:   MPIJOB
    Duration:  9s
    
    Instances:
      NAME                                  STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
      ----                                  ------   ---  --------  --------------  ----
      tensorflow-4-resnet50-launcher-q24hv  Running  9s   true      0               cn-shanghai.192.168.16.172
      tensorflow-4-resnet50-worker-0        Running  9s   false     1               cn-shanghai.192.168.16.172
      tensorflow-4-resnet50-worker-1        Running  9s   false     1               cn-shanghai.192.168.16.173
      tensorflow-4-resnet50-worker-2        Running  9s   false     1               cn-shanghai.192.168.16.172
      tensorflow-4-resnet50-worker-3        Running  9s   false     1               cn-shanghai.192.168.16.173
  3. Run the following command to print the job log:

    arena logs -f tensorflow-4-resnet50

    Expected output:

    total images/sec: 745.38

Performance comparison

The following figure shows the performance difference between topology-aware GPU scheduling and regular GPU scheduling based on the preceding examples.GPU31

The figure shows that after topology-aware GPU scheduling is enabled, the TensorFlow distributed jobs are accelerated.

Important

The performance values in this topic are theoretical values. The performance of topology-aware GPU scheduling varies based on your model and cluster environment. The actual performance statistics shall prevail. You can repeat the preceding steps to evaluate your models.