All Products
Search
Document Center

:Use cGPU to configure a computing power allocation policy for GPU sharing

Last Updated:Feb 21, 2024

GPU sharing and GPU memory isolation are supported by Container Service for Kubernetes (ACK) dedicated clusters and implemented based on cGPU. For more information about cGPU, see What is cGPU? GPU sharing supports multiple computing power allocation policies. To select a computing power allocation policy, specify the policy that is used by the cGPU component in your ACK dedicated cluster. This topic describes how to configure a proper computing power allocation policy for GPU sharing based on your business requirement.

Prerequisites

An ACK dedicated cluster with GPU-accelerated nodes is created. For more information, see Create an ACK cluster with GPU-accelerated nodes.

Precautions

  • If GPU isolation is configured for the node where you want to install cGPU, you need to restart the node after cGPU is installed so that computing power allocation policy can take effect. If GPU isolation is not configured for the node where you want to install cGPU, the computing power allocation policy immediately takes effect after cGPU is installed. For more information about how to restart a node, see Restart instances.

    Note
    • To check whether GPU isolation is configured for the node, log on to the node and run the relevant command. If the system returns a cGPU version number, GPU isolation is configured.

    • Only cGPU 1.0.6 or later is supported. For more information about how to update the cGPU version, see Update the cGPU version on a node.

  • The nodes that have GPU sharing enabled in a cluster use the same cGPU policy.

Step 1: Check whether the cGPU component is installed

The operations that are required to configure a computing power allocation policy vary based on whether the cGPU component is installed. You must check whether the cGPU component is installed before you configure a computing power allocation policy.

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, click the name of the cluster that you want to manage and choose Applications > Helm in the left-side navigation pane.

  3. On the Helm page, check whether the cgpu component exists.

    If cgpu exists, the cGPU component is installed. If cgpu does not exist, the cGPU component is not installed.

Step 2: Configure a computing power allocation policy

The cGPU component is not installed

  1. Log on to the ACK console. In the left-side navigation pane, choose Marketplace > Marketplace.

  2. Find and click ack-cgpu. In the upper-right corner of the ack-cgpu page, click Deploy.

  3. On the Basic Information wizard page, set the Cluster, Namespace, and Release Name parameters. Then, click Next.

  4. On the Parameters wizard page, set the Chart Version parameter and set the policy field in the Parameters section. Then, click OK.

    Set the policy field based on the following description. For more information about time slices and scheduling policies, see Examples of computing power scheduling by using cGPU.

    Value

    Description

    0

    Fair-share scheduling. Each container occupies a fixed time slice. The proportion of the time slice is 1/max_inst.

    1

    Preemptive scheduling. Each container occupies as many time slices as possible. The proportion of the time slices is 1/Number of containers.

    2

    Weight-based preemptive scheduling. When ALIYUN_COM_GPU_SCHD_WEIGHT is set to a value greater than 1, weight-based preemptive scheduling is used.

    3

    Fixed percentage scheduling. Computing power is scheduled at a fixed percentage.

    4

    Soft scheduling. Compared with preemptive scheduling, soft scheduling isolates GPU resources in a softer manner.

    5

    Built-in scheduling. The built-in scheduling policy for the GPU driver.

The cGPU component is installed

  1. Run the following command to modify the DaemonSet in which the cGPU isolation module of the cGPU component runs:

    kubectl edit daemonset cgpu-installer -nkube-system
  2. Modify the DaemonSet in which the cGPU isolation module runs and save the changes.

    1. View the image version of the DaemonSet in the image field.

      Make sure that the image version is 1.0.6 or later. Example of the image field:

       image: registry-vpc.cn-hongkong.aliyuncs.com/acs/cgpu-installer:<Image version>
    2. Modify the value field.

      In the containers.env parameter, set the value field for the POLICY key. For more information about the value field, see Value description.

      #Other fields are omitted. 
      spec:
        containers:
        - env:
          - name: POLICY
            value: "1"
      #Other fields are omitted.
  3. Restart the node that has GPU sharing enabled.

    For more information about how to restart a node, see Restart instances.

References