Container Service for Kubernetes (ACK) allows you to install the managed Prometheus plug-in. You can use the plug-in to monitor GPU resources. You can use the cGPU solution to schedule multiple applications to one GPU and isolate the GPU memory and computing power that are allocated to each application. This topic describes how to monitor GPU memory usage of a cluster by using the managed Prometheus plug-in and how to isolate GPU memory by using cGPU.

Scenarios

This topic applies to dedicated Kubernetes clusters that have cGPU enabled and professional Kubernetes clusters that have cGPU enabled.

Prerequisites

Background information

The development of AI is fueled by high computing power, large amounts of data, and optimized algorithms. NVIDIA GPUs provide common heterogeneous computing techniques. These techniques are the basis for high-performance deep learning. The cost of GPUs is high. If each application uses one dedicated GPU in model prediction scenarios, computing resources may be wasted. GPU sharing improves resource usage. You must consider how to achieve the highest query rate at the lowest cost and how to fulfill the application service level agreement (SLA).

Use the managed Prometheus plug-in to monitor dedicated GPUs

  1. Log on to the ARMS console.
  2. In the left-side navigation pane, click Prometheus Monitoring.
  3. On the Prometheus Monitoring page, select the region where the cluster is deployed and click Install in the Actions column.
  4. In the Confirmation message, click OK.
    It requires about 2 minutes to install the Prometheus plug-in. After the Prometheus plug-in is installed, it appears in the Installed Dashboards column.
  5. You can deploy the following sample application by using a CLI. For more information, see Manage applications by using commands.
    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: app-3g-v1
      labels:
        app: app-3g-v1
    spec:
      replicas: 1
      serviceName: "app-3g-v1"
      podManagementPolicy: "Parallel"
      selector: # define how the deployment finds the pods it manages
        matchLabels:
          app: app-3g-v1
      updateStrategy:
        type: RollingUpdate
      template: # define the pods specifications
        metadata:
          labels:
            app: app-3g-v1
        spec:
          containers:
          - name: app-3g-v1
            image: registry.cn-shanghai.aliyuncs.com/tensorflow-samples/cuda-malloc:3G
            resources:
              limits:
                nvidia.com/gpu: 1
    After the application is deployed, run the following command to query the status of the application. The output indicates that the application name is app-3g-v1-0.
    kubectl get pod

    Expected output:

    NAME          READY   STATUS    RESTARTS   AGE
    app-3g-v1-0   1/1     Running   1          2m56s
  6. Find and click the cluster where the application is deployed. On the Dashboards page, click GPU APP in the Name column.
    The following figure shows that the application uses only 20% of the GPU memory, which indicates that 80% of the GPU memory is wasted. The total GPU memory is about 16 GB. However, the memory usage stabilizes at about 3.4 GB. If you allocate one GPU to each application, a large amount of the GPU resources is wasted. To improve GPU resource usage, you can use cGPU to share a GPU among multiple applications. GPU memory usage

Share one GPU among multiple containers

  1. Add labels to GPU-accelerated nodes.
    1. Log on to the ACK console.
    2. In the left-side navigation pane of the ACK console, click Clusters.
    3. On the Clusters page, find the cluster that you want to manage and click the name of the cluster, or click Applications in the Actions column.
    4. In the left-side navigation pane of the details page, choose Nodes > Nodes.
    5. On the Nodes page, click Manage Labels and Taints in the upper-right corner of the page.
    6. On the Manage Labels and Taints page, select the nodes that you want to manage and click Add Label.
    7. In the Add dialog box, set Name to cgpu, set Value to true, and then click OK.
      Notice If a worker node is added with the cgpu=true label, the GPU resource nvidia.com/gpu is no longer exclusive to pods on the worker node. To disable cGPU for the worker node, set the value of cgpu to false. This makes the GPU resource nvidia.com/gpu exclusive only to pods on the worker node.
  2. Install the cGPU component.
    1. Log on to the ACK console.
    2. In the left-side navigation pane of the ACK console, choose Marketplace > App Catalog.
    3. On the App Catalog page, search for ack-cgpu and click ack-cgpu after it appears.
    4. In the Deploy section on the right side of the page, select the cluster you created, select the namespace where you want to deploy ack-cgpu, and then click Create.
    5. Log on to a master node and run the following command to query GPU resources.

      For more information, see Connect to ACK clusters by using kubectl.

      kubectl inspect cgpu

      Expected output:

      NAME                       IPADDRESS      GPU0(Allocated/Total)  GPU Memory(GiB)
      cn-hangzhou.192.168.2.167  192.168.2.167  0/15                   0/15
      ----------------------------------------------------------------------
      Allocated/Total GPU Memory In Cluster:
      0/15 (0%)
      Note The output indicates that the GPU resources are switched from GPUs to GPU memory.
  3. Deploy workloads that share GPU resources.
    1. Modify the YAML file that was used to deploy the sample application.
      • Modify the number of replicated pods from 1 to 2. This allows you to deploy two pods to run the application. Before you enable cGPU, the GPU is exclusive to the only pod. After you enable cGPU, the GPU is shared by two pods.
      • Change the resource type from nvidia.com/gpu to aliyun.com/gpu-mem. The unit of GPU resources is changed to GB.
      apiVersion: apps/v1
      kind: StatefulSet
      metadata:
        name: app-3g-v1
        labels:
          app: app-3g-v1
      spec:
        replicas: 2
        serviceName: "app-3g-v1"
        podManagementPolicy: "Parallel"
        selector: # define how the deployment finds the pods it manages
          matchLabels:
            app: app-3g-v1
        template: # define the pods specifications
          metadata:
            labels:
              app: app-3g-v1
          spec:
            containers:
            - name: app-3g-v1
              image: registry.cn-shanghai.aliyuncs.com/tensorflow-samples/cuda-malloc:3G
              resources:
                limits:
                  aliyun.com/gpu-mem: 4   # Each pod requests 4 GB of GPU memory. Two replicated pods are configured. Therefore, a total of 8 GB of GPU memory is requested by the application.
    2. Recreate workloads based on the modified configurations.
      The output indicates that the two pods are scheduled to the same GPU-accelerated node.
      kubectl inspect cgpu -d

      Expected output:

      NAME:       cn-hangzhou.192.168.2.167
      IPADDRESS:  192.168.2.167
      
      NAME         NAMESPACE  GPU0(Allocated)
      app-3g-v1-0  default    4
      app-3g-v1-1  default    4
      Allocated :  8 (53%)
      Total :      15
      --------------------------------------------------------
      
      Allocated/Total GPU Memory In Cluster:  8/15 (53%)
    3. Run the following command to log on to two containers one by one.
      The output indicates that the GPU memory limit is 4,301 MiB, which means that each container can use at most 4,301 MiB of GPU memory.
      • Run the following command to log on to container app-3g-v1-0:
        kubectl exec -it app-3g-v1-0 nvidia-smi

        Expected output:

        Mon Apr 13 01:33:10 2020
        +-----------------------------------------------------------------------------+
        | NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
        |-------------------------------+----------------------+----------------------+
        | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
        | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
        |===============================+======================+======================|
        |   0  Tesla V100-SXM2...  On   | 00000000:00:07.0 Off |                    0 |
        | N/A   37C    P0    57W / 300W |   3193MiB /  4301MiB |      0%      Default |
        +-------------------------------+----------------------+----------------------+
        
        +-----------------------------------------------------------------------------+
        | Processes:                                                       GPU Memory |
        |  GPU       PID   Type   Process name                             Usage      |
        |=============================================================================|
        +-----------------------------------------------------------------------------+
      • Run the following command to log on to container app-3g-v1-1:
        kubectl exec -it app-3g-v1-1 nvidia-smi

        Expected output:

        Mon Apr 13 01:36:07 2020
        +-----------------------------------------------------------------------------+
        | NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
        |-------------------------------+----------------------+----------------------+
        | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
        | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
        |===============================+======================+======================|
        |   0  Tesla V100-SXM2...  On   | 00000000:00:07.0 Off |                    0 |
        | N/A   38C    P0    57W / 300W |   3193MiB /  4301MiB |      0%      Default |
        +-------------------------------+----------------------+----------------------+
        
        +-----------------------------------------------------------------------------+
        | Processes:                                                       GPU Memory |
        |  GPU       PID   Type   Process name                             Usage      |
        |=============================================================================|
        +-----------------------------------------------------------------------------+
    4. Log on to the GPU-accelerated node to check the GPU usage.
      The output indicates that the total GPU memory in use is 6,396 MiB, which is the sum of the memory used by the two containers. This shows that cGPU has isolated GPU memory among containers. If you log on to a container and apply for more GPU resources, a memory allocation error is reported.
      1. Run the following command to log on to a GPU-accelerated node:
        kubectl exec -it app-3g-v1-1 bash
      2. Run the following command to query the GPU usage:
        cuda_malloc -size=1024

        Expected output:

        gpu_cuda_malloc starting...
        Detected 1 CUDA Capable device(s)
        
        Device 0: "Tesla V100-SXM2-16GB"
          CUDA Driver Version / Runtime Version          10.1 / 10.1
          Total amount of global memory:                 4301 MBytes (4509925376 bytes)
        Try to malloc 1024 MBytes memory on GPU 0
        CUDA error at cgpu_cuda_malloc.cu:119 code=2(cudaErrorMemoryAllocation) "cudaMalloc( (void**)&dev_c, malloc_size)"
You can monitor the GPU usage of each application or node in the ARMS console.
  • GPU APP: You can view the amount and percentage of GPU memory used by each application. GPU App
  • GPU Node: You can view the memory usage of each GPU. GPU node

Use the managed Prometheus plug-in to monitor GPU sharing

If the amount of GPU memory requested by an application exceeds the upper limit, the GPU memory isolation module of cGPU can prevent other applications from being affected.

  1. Deploy a new application that uses the shared GPU.
    The application requests 4 GB of GPU memory. However, the actual memory usage of the application is 6 GB.
    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: app-6g-v1
      labels:
        app: app-6g-v1
    spec:
      replicas: 1
      serviceName: "app-6g-v1"
      podManagementPolicy: "Parallel"
      selector: # define how the deployment finds the pods it manages
        matchLabels:
          app: app-6g-v1
      template: # define the pods specifications
        metadata:
          labels:
            app: app-6g-v1
        spec:
          containers:
          - name: app-6g-v1
            image: registry.cn-shanghai.aliyuncs.com/tensorflow-samples/cuda-malloc:6G
            resources:
              limits:
                aliyun.com/gpu-mem: 4 # Each pod requests 4 GB of GPU memory. One replicated pod is configured for the application. Therefore, a total of 4 GB of GPU memory is requested by the application.
  2. Run the following command to query the status of the pod.
    The pod that runs the new application remains in the CrashLoopBackOff state. The two existing pods run as normal.
    kubectl get pod

    Expected output:

    NAME          READY   STATUS             RESTARTS   AGE
    app-3g-v1-0   1/1     Running            0          7h35m
    app-3g-v1-1   1/1     Running            0          7h35m
    app-6g-v1-0   0/1     CrashLoopBackOff   5          3m15s
  3. Run the following command to check errors in the container log.
    The output indicates that a cudaErrorMemoryAllocation error has occurred.
    kubectl logs app-6g-v1-0

    Expected output:

    CUDA error at cgpu_cuda_malloc.cu:119 code=2(cudaErrorMemoryAllocation) "cudaMalloc( (void**)&dev_c, malloc_size)"
  4. Use the GPU APP dashboard provided by the managed Prometheus plug-in to view the status of containers.
    The following figure shows that the existing containers are not affected after the new application is deployed. Memory isolation details in the GPU APP dashboard