Container Service for Kubernetes (ACK) provides graphics processing unit (GPU) sharing and scheduling capabilities. The NVIDIA kernel driver ensures the efficient distribution of the GPU memory that is used by each container. This topic describes how to implement GPU sharing and scheduling. In this case, you must first install a resource isolation module and a GPU scheduling inspection tool on a GPU node.

Prerequisites

Limits

Item Supported version
Kubernetes 1.16.06
Helm 3.0 and later
NVIDIA driver 418.87.01 and later
Docker 19.03.5
Operating system CentOS 7.6, CentOS 7.7, Ubuntu 16.04, and Ubuntu 18.04
Graphics card Telsa P4, Telsa P100, Telsa T4, and Telsa V100 (16 GB)

Step 1: Label the nodes on which GPU devices are installed

  1. Log on to the ACK console.
  2. In the left-side navigation pane, click Clusters.
  3. On the Clusters page, click the name of a cluster or click Details in the Actions column. The details page of the cluster appears.
  4. In the left-side navigation pane, click Nodes.
  5. On the Nodes page, click Manage Labels and Taints.
  6. On the Labels tab, select one or more nodes that you want to manage and click Add Label.
    Add labels
  7. In the Add dialog box, set Name and Value.
    Notice
    • You must set Name to cgpu and set Value to true.
    • If you delete the cgpu label, the shared GPU cannot be disabled. You can set Name to cgpu and set Value to false to disable the shared GPU.
    Add labels
  8. Click OK.

Step 2: Install a shared cGPU on the labeled nodes

  1. Log on to the ACK console.
  2. In the left-side navigation pane, choose Marketplace > App Catalog.
  3. On the App Catalog page, click ack-cgpu.
  4. In the Deploy section on the App Catalog - ack-cgpu page, select the required cluster, and click Create.
    You do not need to set Namespace or Release Name. The default values of these parameters are used.cgpu
    You can run the helm get manifest cgpu -n kube-system | kubectl get -f - command to check whether the cGPU is installed. The following command output indicates that the cGPU is installed.
    helm get manifest cgpu -n kube-system | kubectl get -f -
    NAME                                    SECRETS   AGE
    serviceaccount/gpushare-device-plugin   1         39s
    serviceaccount/gpushare-schd-extender   1         39s
    
    NAME                                                           AGE
    clusterrole.rbac.authorization.k8s.io/gpushare-device-plugin   39s
    clusterrole.rbac.authorization.k8s.io/gpushare-schd-extender   39s
    
    NAME                                                                  AGE
    clusterrolebinding.rbac.authorization.k8s.io/gpushare-device-plugin   39s
    clusterrolebinding.rbac.authorization.k8s.io/gpushare-schd-extender   39s
    
    NAME                             TYPE       CLUSTER-IP    EXTERNAL-IP   PORT(S)           AGE
    service/gpushare-schd-extender   NodePort   10.6.13.125   <none>        12345:32766/TCP   39s
    
    NAME                                       DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR    AGE
    daemonset.apps/cgpu-installer              4         4         4       4            4           cgpu=true        39s
    daemonset.apps/device-plugin-evict-ds      4         4         4       4            4           cgpu=true        39s
    daemonset.apps/device-plugin-recover-ds    0         0         0       0            0           cgpu=false   39s
    daemonset.apps/gpushare-device-plugin-ds   4         4         4       4            4           cgpu=true        39s
    
    NAME                                     READY   UP-TO-DATE   AVAILABLE   AGE
    deployment.apps/gpushare-schd-extender   1/1     1            1           38s
    
    NAME                           COMPLETIONS   DURATION   AGE
    job.batch/gpushare-installer   3/1 of 3      3s         38s