Configure a cGPU computing power allocation policy for GPU sharing - Container Service for Kubernetes

GPU sharing in ACK Pro clusters is implemented based on cGPU. GPU sharing supports multiple computing power allocation policies. To select a computing power allocation policy, specify the policy that is used by the cGPU component in your ACK Pro cluster. This topic describes how to configure a proper computing power allocation policy for GPU sharing based on your business requirement.

For more information about cGPU, see What is cGPU?

Prerequisites

An ACK Pro cluster is created and the Kubernetes version of the cluster is 1.18.8 or later. For more information about how to update the Kubernetes version, see Update an ACK cluster.
cGPU 1.0.6 or later is used. For more information about how to update cGPU, see Update the cGPU version on a node.

Precautions

If the cGPU isolation module is installed on a node before you install the cGPU component, you must restart the node to make the cGPU policy take effect. For more information about how to restart a node, see Restart instances.
Note
To check whether the cGPU isolation module is installed on a node, log on to the node and run the cat /proc/cgpu_km/version command. If the system returns a cGPU version number, the cGPU isolation module is installed.
If the cGPU isolation module is not installed or the module has been uninstalled, you must install the module to make the cGPU policy take effect.
The nodes that have GPU sharing enabled in a cluster use the same cGPU policy.

Step 1: Check whether the cGPU component is installed

The operations that are required for configuring a computing power allocation policy vary based on whether the cGPU component is installed. You must check whether the cGPU component is installed before you configure a computing power allocation policy.

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the cluster that you want to manage and choose Applications > Helm in the left-side navigation pane.
On the Helm page, check whether the ack-ai-installer component exists.
If ack-ai-installer exists, the cGPU component is installed. If ack-ai-installer does not exist, the cGPU component is not installed.

Step 2: Configure a computing power allocation policy

The following section describes how to configure a computing power allocation policy for GPU sharing when the cGPU component is installed and when the component is not installed.

The cGPU component is not installed

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the cluster that you want to manage and choose Applications > Cloud-native AI Suite in the left-side navigation pane.
On the Cloud-native AI Suite page, click Deploy.
In the Scheduling section, select Scheduling Component (Batch Task Scheduling, GPU Sharing, Topology-aware GPU scheduling, and NPU scheduling, and then click Advanced.

On the Parameters page, modify the policy field, and click OK. Set the policy field based on the following description. For more information about time slices and scheduling policies, see Examples of computing power scheduling by using cGPU.

Value	Description
0	Fair-share scheduling. Each container occupies a fixed time slice. The proportion of the time slice is `1/max_inst`.
1	Preemptive scheduling. Each container occupies as many time slices as possible. The proportion of the time slices is `1/Number of containers`.
2	Weight-based preemptive scheduling. When ALIYUN_COM_GPU_SCHD_WEIGHT is set to a value greater than 1, weight-based preemptive scheduling is used.
3	Fixed percentage scheduling. Computing power is scheduled at a fixed percentage.
4	Soft scheduling. Compared with preemptive scheduling, soft scheduling isolates GPU resources in a softer manner.
5	Built-in scheduling. The built-in scheduling policy for the GPU driver.

In the lower part of the page, click Deploy Cloud-native AI Suite.

The cGPU component is installed

Run the following command to modify the DaemonSet in which the cGPU isolation module of the cGPU component runs:
```
kubectl edit daemonset cgpu-installer -nkube-system
```

Modify the DaemonSet in which the cGPU isolation module runs and save the changes.

View the image version of the DaemonSet in the image field.
Make sure that the image version is 1.0.6 or later. Example of the image field:
```
 image: registry-vpc.cn-hongkong.aliyuncs.com/acs/cgpu-installer:<Image version>
```

Modify the value field.

In the containers.env parameter, set the value field for the POLICY key.

#Other fields are omitted. 
spec:
  containers:
  - env:
    - name: POLICY
      value: "1"
#Other fields are omitted.

The following table describes the values of the value field.

Value	Description
0	Fair-share scheduling. Each container occupies a fixed time slice. The proportion of the time slice is `1/max_inst`.
1	Preemptive scheduling. Each container occupies as many time slices as possible. The proportion of the time slices is `1/Number of containers`.
2	Weight-based preemptive scheduling. When ALIYUN_COM_GPU_SCHD_WEIGHT is set to a value greater than 1, weight-based preemptive scheduling is used.
3	Fixed percentage scheduling. Computing power is scheduled at a fixed percentage.
4	Soft scheduling. Compared with preemptive scheduling, soft scheduling isolates GPU resources in a softer manner.
5	Built-in scheduling. The built-in scheduling policy for the GPU driver.

Restart the node that has GPU sharing enabled.
For more information about how to restart a node, see Restart instances.