Container Service for Kubernetes (ACK) provides GPU sharing based on cGPU. You can use cGPU to share one GPU in model prediction scenarios. In addition, the NVIDIA kernel driver ensures GPU memory isolation among containers. This topic describes how to install the resource isolation module and an inspection tool in a dedicated Kubernetes cluster that contains GPU-accelerated nodes. This enables GPU sharing and memory isolation.
- Only dedicated Kubernetes clusters that contain GPU-accelerated nodes support the ack-cgpu component. Managed Kubernetes clusters that contain GPU-accelerated nodes do not support the ack-cgpu component.
- If you want to install ack-cgpu in professional Kubernetes clusters, see Install and use ack-ai-installer and the GPU scheduling inspection tool.
- The CPU policy of the node for which you want to enable GPU scheduling is not set
- A dedicated Kubernetes cluster that contains GPU-accelerated nodes is created. For more information, see Create a dedicated Kubernetes cluster with GPU-accelerated nodes.
- A kubectl client is connected to the created cluster. For more information, see Connect to ACK clusters by using kubectl.
|Kubernetes||1.12.6 and later. Only dedicated Kubernetes clusters are supported.|
|Helm||3.0 and later|
|NVIDIA driver||418.87.01 and later|
|Operating system||CentOS 7.x, Alibaba Cloud Linux 2.x, Ubuntu 16.04, and Ubuntu 18.04|
|GPU||Tesla P4, Tesla P100, Tesla T4, and Tesla V100|
Step 1: Add labels to GPU-accelerated nodes
- Log on to the ACK console.
- In the left-side navigation pane of the ACK console, click Clusters.
- On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.
- In the left-side navigation pane of the details page, choose .
- On the Nodes page, click Manage Labels and Taints in the upper-right corner.
- On the Labels tab of the Manage Labels and Taints page, select the nodes that you want to manage and click Add Label.
- In the Add dialog box, set Name and Value. Notice
- To enable cGPU, you must set Name to cgpu and Value to true.
- If you delete the cgpu label, cGPU is not disabled. To disable cGPU, set Name to cgpu and Value to false.
- Click OK.
Step 2: Install ack-cgpu on the labeled nodes
- Log on to the ACK console.
- In the left-side navigation pane of the ACK console, choose .
- On the App Catalog page, search for ack-cgpu and click ack-cgpu after it appears.
- On the App Catalog - ack-cgpu page, select the cluster that you want to manage in the Deploy section and click Create. You do not need to set Namespace or Release Name. The default values are used.You can run the
helm get manifest cgpu -n kube-system | kubectl get -f -command to check whether ack-cgpu is installed. If the following output is returned, it indicates that ack-cgpu is installed.
helm get manifest cgpu -n kube-system | kubectl get -f -
NAME SECRETS AGE serviceaccount/gpushare-device-plugin 1 39s serviceaccount/gpushare-schd-extender 1 39s NAME AGE clusterrole.rbac.authorization.k8s.io/gpushare-device-plugin 39s clusterrole.rbac.authorization.k8s.io/gpushare-schd-extender 39s NAME AGE clusterrolebinding.rbac.authorization.k8s.io/gpushare-device-plugin 39s clusterrolebinding.rbac.authorization.k8s.io/gpushare-schd-extender 39s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/gpushare-schd-extender NodePort 10.6.13.125 <none> 12345:32766/TCP 39s NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/cgpu-installer 4 4 4 4 4 cgpu=true 39s daemonset.apps/device-plugin-evict-ds 4 4 4 4 4 cgpu=true 39s daemonset.apps/device-plugin-recover-ds 0 0 0 0 0 cgpu=false 39s daemonset.apps/gpushare-device-plugin-ds 4 4 4 4 4 cgpu=true 39s NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/gpushare-schd-extender 1/1 1 1 38s NAME COMPLETIONS DURATION AGE job.batch/gpushare-installer 3/1 of 3 3s 38s