GPU sharing is a resource management policy that allows multiple tasks or processes to share one GPU. You can use GPU sharing in a registered cluster to avoid resource waste of traditional GPU scheduling and improve GPU utilization.
Prerequisites
A registered cluster is created and an external cluster is connected to the registered cluster. For more information, see Create a registered cluster.
A kubectl client is connected to the registered cluster. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.
The following table lists the versions of the system components that are required.
Component
Version requirement
Kubernetes
1.22 or later
Operating system
Ubuntu 16.04
Ubuntu 18.04
Alibaba Cloud Linux 3
Billing
The cloud-native AI suite is activated before you use GPU sharing. For more information about the cloud-native AI suite and how it is billed, see Overview of the cloud-native AI suite and Billing of the cloud-native AI suite.
Limits
Do not set the
CpuPolicyparameter tostaticfor nodes that have GPU sharing enabled.The pods managed by the DaemonSet of the shared GPU do not enjoy the highest priority. Therefore, the resources may be scheduled to pods that have higher priority and the node may evict the pods managed by the DaemonSet. To prevent this issue, you can modify the actual DaemonSet of the shared GPU. For example, you can modify the
gpushare-device-plugin-dsDaemonSet used to share GPU memory and specifypriorityClassName: system-node-criticalto ensure the priority of the pods managed by the DaemonSet.
Step 1: Install components
Install the ack-ai-installer component in the registered cluster. This component implements scheduling capabilities such as GPU sharing (including GPU memory isolation) and topology-aware GPU scheduling.
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose .
On the Helm page, click Create. Search for and install the ack-ai-installer component.
Install the ack-co-scheduler component in the registered cluster. This component allows you to create ResourcePolicy custom resources (CRs) to use the multilevel resource scheduling feature.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose .
On the Add-ons page, search for the ack-co-scheduler component, and click Install at the bottom right of the card.
Step 2: Install and use the GPU inspection tool
Download kubectl-inspect-cgpu. The executable file must be downloaded to a directory included in the PATH environment variable. This section uses
/usr/local/bin/as an example.If you use Linux, run the following command to download kubectl-inspect-cgpu:
wget http://aliacs-k8s-cn-beijing.oss-cn-beijing.aliyuncs.com/gpushare/kubectl-inspect-cgpu-linux -O /usr/local/bin/kubectl-inspect-cgpuIf you use macOS, run the following command to download kubectl-inspect-cgpu:
wget http://aliacs-k8s-cn-beijing.oss-cn-beijing.aliyuncs.com/gpushare/kubectl-inspect-cgpu-darwin -O /usr/local/bin/kubectl-inspect-cgpu
Run the following command to grant the execute permissions to kubectl-inspect-cgpu:
chmod +x /usr/local/bin/kubectl-inspect-cgpu
Step 3: Create GPU-accelerated nodes
Create an Elastic GPU Service, and install the driver and nvidia-container-runtime. For more information, see Create and manage a node pool.
Skip this step if you have added GPU-accelerated nodes to the node pool and configured the environment when you created the node pool.
For more information about the driver installation script, see Manually update the NVIDIA driver of a node.
Nodes that have GPU sharing enabled must be labeled with
ack.node.gpu.schedule=share. You can manually add this label to on-premises nodes. For cloud nodes, you can update the label value toack.node.gpu.schedule=cgpuusing the labeling feature provided by the node pool, enabling GPU memory isolation. For more information, see Labels for enabling GPU scheduling policies.
Step 4: Work with GPU sharing
Run the following command to query the GPU usage of the cluster:
kubectl inspect cgpuExpected output:
NAME IPADDRESS GPU0(Allocated/Total) GPU Memory(GiB) cn-zhangjiakou.192.168.66.139 192.168.66.139 0/15 0/15 --------------------------------------------------------------------------- Allocated/Total GPU Memory In Cluster: 0/15 (0%)Create a file named GPUtest.yaml and copy the following content to the file.
apiVersion: batch/v1 kind: Job metadata: name: gpu-share-sample spec: parallelism: 1 template: metadata: labels: app: gpu-share-sample spec: schedulerName: ack-co-scheduler containers: - name: gpu-share-sample image: registry.cn-hangzhou.aliyuncs.com/ai-samples/gpushare-sample:tensorflow-1.5 command: - python - tensorflow-sample-code/tfjob/docker/mnist/main.py - --max_steps=100000 - --data_dir=tensorflow-sample-code/data resources: limits: # The unit is GiB, and this pod requests a total of 3 GiB of video memory. aliyun.com/gpu-mem: 3 # Set the GPU video memory size. workingDir: /root restartPolicy: NeverRun the following command to deploy a sample application that has GPU sharing enabled and requests 3 GiB of GPU memory for the application:
kubectl apply -f GPUtest.yamlRun the following command to query the memory usage of the GPU:
kubectl inspect cgpuExpected output:
NAME IPADDRESS GPU0(Allocated/Total) GPU Memory(GiB) cn-zhangjiakou.192.168.66.139 192.168.66.139 3/15 3/15 --------------------------------------------------------------------------- Allocated/Total GPU Memory In Cluster: 3/15 (20%)The output shows that the total GPU memory of the cn-zhangjiakou.192.168.66.139 node is 15 GiB and 3 GiB of GPU memory is allocated.
References
For more information about GPU sharing, see GPU sharing.