Container Service for Kubernetes (ACK) provides GPU sharing based on cGPU. You can
use cGPU to share one GPU in model prediction scenarios. In addition, the NVIDIA kernel
driver ensures GPU memory isolation among containers. This topic describes how to
install the resource isolation module and an inspection tool in an ACK dedicated cluster
that contains GPU-accelerated nodes. This enables GPU sharing and memory isolation.
Scenarios
- Only ACK dedicated clusters that contain GPU-accelerated nodes support the ack-cgpu
component. ACK managed clusters that contain GPU-accelerated nodes do not support
the ack-cgpu component.
- If you want to install ack-cgpu in ACK Pro clusters, see Install and use ack-ai-installer and the GPU inspection tool.
Considerations
Component |
Supported version |
Kubernetes |
1.12.6 and later. Only ACK dedicated clusters are supported. |
Helm |
3.0 and later |
NVIDIA driver |
418.87.01 and later |
Docker |
19.03.5 |
Operating system |
CentOS 7.x, Alibaba Cloud Linux 2.x, Ubuntu 16.04, and Ubuntu 18.04 |
GPU |
Tesla P4, Tesla P100, Tesla T4, and Tesla V100 |
Step 1: Add labels to GPU-accelerated nodes
- Log on to the ACK console.
- In the left-side navigation pane of the ACK console, click Clusters.
- On the Clusters page, find the cluster that you want to manage and click the name of the cluster
or click Details in the Actions column. The details page of the cluster appears.
- In the left-side navigation pane of the details page, choose .
- On the Nodes page, click Manage Labels and Taints in the upper-right corner.
- On the Labels tab of the Manage Labels and Taints page, select the nodes that you want to manage and click Add Label.
- In the Add dialog box, set Name and Value.
Notice
- To enable cGPU, you must set Name to cgpu and Value to true.
- If you delete the cgpu label, cGPU is not disabled. To disable cGPU, set Name to cgpu and Value to false.
- Click OK.
Step 2: Install ack-cgpu on the labeled nodes
- Log on to the ACK console.
- In the left-side navigation pane of the ACK console, choose .
- On the Marketplace page, click the App Catalog tab. Find and click ack-cgpu.
- On the ack-cgpu page, click Deploy.
- In the Deploy wizard, select a cluster and namespace, and then click Next.
- On the Parameters wizard page, set the parameters and click OK.
You can run the
helm get manifest cgpu -n kube-system | kubectl get -f -
command to check whether ack-cgpu is installed. If the following output is returned,
ack-cgpu is installed.
helm get manifest cgpu -n kube-system | kubectl get -f -
NAME SECRETS AGE
serviceaccount/gpushare-device-plugin 1 39s
serviceaccount/gpushare-schd-extender 1 39s
NAME AGE
clusterrole.rbac.authorization.k8s.io/gpushare-device-plugin 39s
clusterrole.rbac.authorization.k8s.io/gpushare-schd-extender 39s
NAME AGE
clusterrolebinding.rbac.authorization.k8s.io/gpushare-device-plugin 39s
clusterrolebinding.rbac.authorization.k8s.io/gpushare-schd-extender 39s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/gpushare-schd-extender NodePort 10.6.13.125 <none> 12345:32766/TCP 39s
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/cgpu-installer 4 4 4 4 4 cgpu=true 39s
daemonset.apps/device-plugin-evict-ds 4 4 4 4 4 cgpu=true 39s
daemonset.apps/device-plugin-recover-ds 0 0 0 0 0 cgpu=false 39s
daemonset.apps/gpushare-device-plugin-ds 4 4 4 4 4 cgpu=true 39s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/gpushare-schd-extender 1/1 1 1 38s
NAME COMPLETIONS DURATION AGE
job.batch/gpushare-installer 3/1 of 3 3s 38s