Container Service for Kubernetes provides graphics processing unit (GPU) sharing and scheduling capabilities. The NVIDIA kernel driver ensures the isolation of the GPU memory used by each container. This topic describes how to implement GPU sharing and scheduling by installing a resource isolation module and a status query tool for GPU scheduling on a GPU.

Prerequisites

Instructions

Item Supported versions
Kubernetes 1.16.06
Helm 3.0 and later
NVIDIA driver 418.87.01 and later
Docker 19.03.5
Operating system CentOS 7.6, CentOS 7.7, Ubuntu 16.04, and Ubuntu 18.04
Graphics card Telsa P4, Telsa P100, Telsa T4, and Telsa V100 (16 GB)

Step 1: Label the nodes on which GPUs are installed

  1. Log on to the ACK console.
  2. In the left-side navigation pane, choose Cluster > Nodes.
  3. On the Nodes page, select the target cluster and click Manage Labels in the upper-right corner.
  4. On the Manage Labels page, select target nodes and click Add Label.
    Add labels
  5. In the displayed Add dialog box, set Name and Value.
    Notice Make sure that you set Name to cgpu and Value to true.
    Add labels
  6. Click OK.

Step 2: Install cGPU for the labeled nodes

  1. Log on to the ACK console.
  2. In the left-side navigation pane, choose Marketplace > App Catalog.
  3. On the App Catalog page, click ack-cgpu.
  4. In the Deploy section on the App Catalog - ack-cgpu page, select the target cluster, and click Create.
    You do not need to set Namespaces or Release Name. The default values of these parameters are displayed.cgpu
    You can run the helm get manifest cgpu -n kube-system | kubectl get -f- command to check whether cGPU is installed. If the following command output is displayed, cGPU is installed.
    # helm get manifest cgpu -n kube-system | kubectl get -f -
    NAME                                    SECRETS   AGE
    serviceaccount/gpushare-device-plugin   1         39s
    serviceaccount/gpushare-schd-extender   1         39s
    
    NAME                                                           AGE
    clusterrole.rbac.authorization.k8s.io/gpushare-device-plugin   39s
    clusterrole.rbac.authorization.k8s.io/gpushare-schd-extender   39s
    
    NAME                                                                  AGE
    clusterrolebinding.rbac.authorization.k8s.io/gpushare-device-plugin   39s
    clusterrolebinding.rbac.authorization.k8s.io/gpushare-schd-extender   39s
    
    NAME                             TYPE       CLUSTER-IP    EXTERNAL-IP   PORT(S)           AGE
    service/gpushare-schd-extender   NodePort   10.6.13.125   <none>        12345:32766/TCP   39s
    
    NAME                                       DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR    AGE
    daemonset.apps/cgpu-installer              4         4         4       4            4           cgpu=true        39s
    daemonset.apps/device-plugin-evict-ds      4         4         4       4            4           cgpu=true        39s
    daemonset.apps/device-plugin-recover-ds    0         0         0       0            0           cgpu=false   39s
    daemonset.apps/gpushare-device-plugin-ds   4         4         4       4            4           cgpu=true        39s
    
    NAME                                     READY   UP-TO-DATE   AVAILABLE   AGE
    deployment.apps/gpushare-schd-extender   1/1     1            1           38s
    
    NAME                           COMPLETIONS   DURATION   AGE
    job.batch/gpushare-installer   3/1 of 3      3s         38s