All Products
Search
Document Center

:Install the ack-cgpu component

Last Updated:Feb 18, 2024

Container Service for Kubernetes (ACK) provides GPU sharing capabilities for prediction models that use shared GPU resources. You can also use these capabilities to isolate GPU memory in NVIDIA kernel mode. This topic describes how to install the ack-cgpu component, which can be used to share GPUs, isolate GPU memory, and query GPU allocation information.

Prerequisites

Limits

  • You cannot set the CPU policy to static for nodes that support GPU sharing.

  • cGPU 1.5.0 and earlier versions are incompatible with NVIDIA driver versions that start with 5, such as 510.47.03.

The following table describes other limits.

Item

Supported version

Kubernetes

Kubernetes 1.12.6 and later

OS

CentOS 7.x, Alibaba Cloud Linux 2.x, Alibaba Cloud Linux 3.x, Ubuntu 16.04, and Ubuntu 18.04

GPU

Tesla P4, Tesla P100, Tesla T4, and Tesla v100

Step 1: Add labels to GPU-accelerated nodes

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, click the name of the cluster that you want to manage and choose Nodes > Nodes in the left-side navigation pane.

  3. On the Nodes page, click Manage Labels and Taints in the upper-right corner.

  4. On the Labels tab of the Manage Labels and Taints page, select the nodes that you want to manage and click Add Label.

  5. In the Add dialog box, set Name and Value and then click OK.

    To enable cGPU, you must set Name to cgpu and Value to true.

Important

To disable cGPU, set Name to cgpu and Value to false. You cannot disable cGPU by deleting the cgpu label.

Step 2: Install the ack-cgpu component on the labeled nodes

  1. Log on to the ACK console. In the left-side navigation pane, choose Marketplace > Marketplace.

  2. On the Marketplace page, search for ack-cgpu and click the ack-cgpu card.

  3. On the ack-cgpu page, click Deploy. On the Basic Information wizard page, specify Cluster, Namespace, and Release Name, and then click Next.

  4. On the Parameters wizard page, set the parameters and click OK.

  5. Log on to a control plane and run the following command to check whether the ack-cgpu component is installed. For more information about how to log on to a control plane, see Connect to an instance by using VNC.

    helm get manifest cgpu -n kube-system | kubectl get -f -

    If the following output is returned, the ack-cgpu component is installed:

    NAME                                    SECRETS   AGE
    serviceaccount/gpushare-device-plugin   1         39s
    serviceaccount/gpushare-schd-extender   1         39s
    
    NAME                                                           AGE
    clusterrole.rbac.authorization.k8s.io/gpushare-device-plugin   39s
    clusterrole.rbac.authorization.k8s.io/gpushare-schd-extender   39s
    
    NAME                                                                  AGE
    clusterrolebinding.rbac.authorization.k8s.io/gpushare-device-plugin   39s
    clusterrolebinding.rbac.authorization.k8s.io/gpushare-schd-extender   39s
    
    NAME                             TYPE       CLUSTER-IP    EXTERNAL-IP   PORT(S)           AGE
    service/gpushare-schd-extender   NodePort   10.6.13.125   <none>        12345:32766/TCP   39s
    
    NAME                                       DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR    AGE
    daemonset.apps/cgpu-installer              4         4         4       4            4           cgpu=true        39s
    daemonset.apps/device-plugin-evict-ds      4         4         4       4            4           cgpu=true        39s
    daemonset.apps/device-plugin-recover-ds    0         0         0       0            0           cgpu=false   39s
    daemonset.apps/gpushare-device-plugin-ds   4         4         4       4            4           cgpu=true        39s
    
    NAME                                     READY   UP-TO-DATE   AVAILABLE   AGE
    deployment.apps/gpushare-schd-extender   1/1     1            1           38s
    
    NAME                           COMPLETIONS   DURATION   AGE
    job.batch/gpushare-installer   3/1 of 3      3s         38s