Container Service for Kubernetes (ACK) provides GPU sharing capabilities for prediction models that use shared GPU resources. You can also use these capabilities to isolate GPU memory in NVIDIA kernel mode. This topic describes how to install the ack-cgpu component, which can be used to share GPUs, isolate GPU memory, and query GPU allocation information.
Prerequisites
An ACK dedicated cluster that contains GPU-accelerated nodes is created. For more information, see Create an ACK cluster with GPU-accelerated nodes.
A kubectl client is connected to the cluster. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.
Limits
You cannot set the CPU policy to
static
for nodes that support GPU sharing.cGPU 1.5.0 and earlier versions are incompatible with NVIDIA driver versions that start with 5, such as 510.47.03.
The following table describes other limits.
Item | Supported version |
Kubernetes | Kubernetes 1.12.6 and later |
OS | CentOS 7.x, Alibaba Cloud Linux 2.x, Alibaba Cloud Linux 3.x, Ubuntu 16.04, and Ubuntu 18.04 |
GPU | Tesla P4, Tesla P100, Tesla T4, and Tesla v100 |
Step 1: Add labels to GPU-accelerated nodes
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the cluster that you want to manage and choose in the left-side navigation pane.
On the Nodes page, click Manage Labels and Taints in the upper-right corner.
On the Labels tab of the Manage Labels and Taints page, select the nodes that you want to manage and click Add Label.
In the Add dialog box, set Name and Value and then click OK.
To enable cGPU, you must set Name to cgpu and Value to true.
To disable cGPU, set Name to cgpu and Value to false. You cannot disable cGPU by deleting the cgpu label.
Step 2: Install the ack-cgpu component on the labeled nodes
Log on to the ACK console. In the left-side navigation pane, choose .
On the Marketplace page, search for ack-cgpu and click the ack-cgpu card.
On the ack-cgpu page, click Deploy. On the Basic Information wizard page, specify Cluster, Namespace, and Release Name, and then click Next.
On the Parameters wizard page, set the parameters and click OK.
Log on to a control plane and run the following command to check whether the ack-cgpu component is installed. For more information about how to log on to a control plane, see Connect to an instance by using VNC.
helm get manifest cgpu -n kube-system | kubectl get -f -
If the following output is returned, the ack-cgpu component is installed:
NAME SECRETS AGE serviceaccount/gpushare-device-plugin 1 39s serviceaccount/gpushare-schd-extender 1 39s NAME AGE clusterrole.rbac.authorization.k8s.io/gpushare-device-plugin 39s clusterrole.rbac.authorization.k8s.io/gpushare-schd-extender 39s NAME AGE clusterrolebinding.rbac.authorization.k8s.io/gpushare-device-plugin 39s clusterrolebinding.rbac.authorization.k8s.io/gpushare-schd-extender 39s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/gpushare-schd-extender NodePort 10.6.13.125 <none> 12345:32766/TCP 39s NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/cgpu-installer 4 4 4 4 4 cgpu=true 39s daemonset.apps/device-plugin-evict-ds 4 4 4 4 4 cgpu=true 39s daemonset.apps/device-plugin-recover-ds 0 0 0 0 0 cgpu=false 39s daemonset.apps/gpushare-device-plugin-ds 4 4 4 4 4 cgpu=true 39s NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/gpushare-schd-extender 1/1 1 1 38s NAME COMPLETIONS DURATION AGE job.batch/gpushare-installer 3/1 of 3 3s 38s