All Products
Search
Document Center

Container Service for Kubernetes:Manage the shared GPU scheduling component

Last Updated:Jun 23, 2026

Container Service for Kubernetes (ACK) provides GPU sharing and scheduling capabilities for model inference scenarios that share a single GPU. It also uses the NVIDIA driver's kernel module to ensure GPU memory isolation. If you have already installed the shared GPU scheduling component in your cluster, you must upgrade the component to the latest version if the GPU driver or operating system on a node is incompatible with the installed cGPU version. This topic explains how to manage the shared GPU scheduling component on GPU nodes to enable GPU scheduling and isolation.

Prerequisites

Limitations

  • The cGPU memory isolation feature is supported only on ECS nodes. Do not add the ack.node.gpu.schedule=cgpu or ack.node.gpu.schedule=core_mem label to non-ECS nodes.

  • Do not set the CPU policy to static for nodes that use shared GPU scheduling.

  • If you need to specify a custom path for the KubeConfig file, run the export KUBECONFIG=<kubeconfig> command. The kubectl inspect cgpu command does not support the --kubeconfig parameter.

  • Because cGPU isolation does not support Unified Virtual Memory (UVM), you cannot call cudaMallocManaged() to allocate GPU memory. Instead, call cudaMalloc(). For more information, see the NVIDIA documentation.

  • The shared GPU DaemonSet pods do not have the highest priority. Node resources may be preempted by other higher-priority Pods, which can evict these DaemonSet pods. To prevent this, you can modify the DaemonSet you are using, such as gpushare-device-plugin-ds for shared GPU memory, and add priorityClassName: system-node-critical to ensure high priority.

  • For performance, you can create a maximum of 20 Pods on a single GPU card when using cGPU. If this limit is exceeded, subsequent Pods scheduled to that card will fail to run and return the error message Error occurs when creating cGPU instance: unknown.

  • You can install the shared GPU component in any region. However, the GPU memory isolation feature is available only in the following regions. Make sure your cluster is in one of these regions.

    Regions

    Region

    Region ID

    China (Beijing)

    cn-beijing

    China (Shanghai)

    cn-shanghai

    China (Hangzhou)

    cn-hangzhou

    China (Zhangjiakou)

    cn-zhangjiakou

    China (Wulanchabu)

    cn-wulanchabu

    China (Shenzhen)

    cn-shenzhen

    China (Chengdu)

    cn-chengdu

    China (Heyuan)

    cn-heyuan

    China (Hong Kong)

    cn-hongkong

    Japan (Tokyo)

    ap-northeast-1

    Indonesia (Jakarta)

    ap-southeast-5

    Singapore

    ap-southeast-1

    US (Virginia)

    us-east-1

    US (Silicon Valley)

    us-west-1

    Germany (Frankfurt)

    eu-central-1

  • Version compatibility.

    Component

    Supported versions

    Kubernetes version

    NVIDIA driver version

    418.87.01 or later

    Container runtime version

    • Docker: 19.03.5 or later

    • containerd: 1.4.3 or later

    Operating system

    Alibaba Cloud Linux 3.x (container-optimized edition requires ack-ai-installer v1.12.6 or later), Alibaba Cloud Linux 2.x, CentOS 7.6, CentOS 7.7, CentOS 7.9, Ubuntu 22.04

    Supported GPUs

    P-series, T-series, V-series, A-series, H-series

Install the shared GPU scheduling component

Step 1: Install the shared GPU component

AI Suite not deployed

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of your cluster. In the left navigation pane, click Applications > Cloud-native AI Suite.

  3. On the Cloud-native AI Suite page, click Deploy.

  4. On the Deploy Cloud-native AI Suite page, select Scheduling policies extension (Batch scheduling, GPU sharing, and GPU topology-aware scheduling).

  5. (Optional) To customize the cGPU policy, click Advanced to the right of Scheduling Policy Extension (Batch Task Scheduling, GPU Sharing, Topology-aware GPU Scheduling). In the Parameters dialog box, modify the policy field for cGPU and click OK.

    If you have no special requirements for cGPU compute power sharing, we recommend that you use the default policy: 5, which is native scheduling. For more information about the policies supported by cGPU, see Install and use the cGPU service.

  6. At the bottom of the Cloud-native AI Suite page, click Deploy Cloud-native AI Suite.

    After installation, the ack-ai-installer component appears in the component list on the Cloud-native AI Suite page.

AI Suite deployed

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of your cluster. In the left navigation pane, click Applications > Cloud-native AI Suite.

  3. Find the ack-ai-installer component and click Deploy in the Actions column.

  4. (Optional) In the Parameters dialog box that appears, modify the policy field for cGPU.

    If you have no special requirements for cGPU compute power sharing, we recommend that you use the default policy: 5, which is native scheduling. For more information about the policies supported by cGPU, see Install and use the cGPU service.

  5. After you finish modifying the parameters, click OK.

    After the component is installed, the Status of ack-ai-installer changes to Deployed.

Step 2: Enable shared GPU scheduling and isolation

  1. On the Clusters page, click the name of your cluster. In the left navigation pane, click Nodes > Node Pools.

  2. On the Node Pools page, click Create Node Pool. Configure the parameters as described in Create and manage a node pool.

  3. On the Create Node Pool page, configure the parameters for the node pool and then click Confirm. The following table describes the key parameters.

    Parameter

    Description

    Expected number of nodes

    Enter the initial number of nodes for the node pool. Use 0 if no nodes are needed initially.

    Node labels

    Set the label value based on your business requirements. For more information about node labels, see Enable scheduling.

    This example uses the label value cgpu. This value enables shared GPU scheduling on the node, where each Pod only needs to request GPU memory resources. Multiple Pods on a single GPU card share compute power and have isolated memory.

    Click 节点标签 next to Node Labels. Set the Key to ack.node.gpu.schedule and the Value to cgpu.

    Important
    • For notes on using the cGPU isolation feature, see cGPU FAQ.

    • After adding the shared GPU scheduling label, do not use the kubectl label nodes command to change the GPU scheduling label or use the label management feature on the Nodes page to change the node label. This can cause potential issues. For more information, see Enable scheduling. The recommended method is described in Enable scheduling.

Step 3: Add GPU nodes

Note

If you already created GPU nodes when you added the node pool, you can skip this step.

After creating the node pool, add GPU nodes to it. When doing so, ensure you specify a GPU cloud server as the instance type. For more information, see Add existing nodes to a node pool or Create and manage a node pool.

Step 4: Install the GPU query tool

  1. Download the kubectl-inspect-cgpu executable to a directory in your PATH, such as /usr/local/bin/.

    • If you are using Linux, run the following command to download kubectl-inspect-cgpu.

      wget http://aliacs-k8s-cn-beijing.oss-cn-beijing.aliyuncs.com/gpushare/kubectl-inspect-cgpu-linux -O /usr/local/bin/kubectl-inspect-cgpu
    • If you are using macOS, run the following command to download kubectl-inspect-cgpu.

      wget http://aliacs-k8s-cn-beijing.oss-cn-beijing.aliyuncs.com/gpushare/kubectl-inspect-cgpu-darwin -O /usr/local/bin/kubectl-inspect-cgpu
  2. Make the file executable:

    chmod +x /usr/local/bin/kubectl-inspect-cgpu
  3. Check the GPU usage in the cluster:

    kubectl inspect cgpu

    Expected output:

    NAME                       IPADDRESS      GPU0(Allocated/Total)  GPU Memory(GiB)
    cn-shanghai.192.168.6.104  192.168.6.104  0/15                   0/15
    ----------------------------------------------------------------------
    Allocated/Total GPU Memory In Cluster:
    0/15 (0%)

Upgrade the shared GPU scheduling component

Step 1: Determine the upgrade method

The upgrade method depends on how the ack-ai-installer component was originally installed. There are two methods:

  • Via the Cloud-native AI Suite (Recommended): The component was installed from the Cloud-native AI Suite page.

  • Installed from the App Catalog (no longer available): You installed the ack-ai-installer component from the App Catalog page in the Marketplace. This installation method is now closed. However, if you have existing components installed this way, you can still upgrade them using this method.

    Important

    If you uninstall a component that was installed using this method, you must activate the Cloud-native AI Suite and complete the installation to reinstall it.

Determine the installation method

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of your cluster. In the left navigation pane, click Applications > Cloud-native AI Suite.

  3. On the Cloud-native AI Suite page, check if the ack-ai-installer component is deployed.

    If the component is deployed, it was installed via the Cloud-native AI Suite. Otherwise, it was installed from the App Catalog.

Step 2: Upgrade the component

Cloud-native AI suite

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of your cluster. In the left navigation pane, click Applications > Cloud-native AI Suite.

  3. In the Components section, locate the ack-ai-installer component and click Upgrade in the Actions column.

App catalog

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of your cluster. In the left navigation pane, click Applications > Helm.

  3. On the Helm page, locate the ack-ai-installer release. In the Actions column, click Update. Follow the prompts to select the latest Chart version and update the component.

    Important

    If you need to customize the Chart configuration, confirm the component update after making your changes.

    After the update, check the Helm page to confirm that the Chart version of the ack-ai-installer release is the latest version.

Step 3: Upgrade existing nodes

Upgrading the ack-ai-installer component does not automatically upgrade the cGPU version on existing nodes. Use the following information to determine whether the cGPU isolation feature is enabled on your nodes.

  • If your cluster contains GPU nodes with the cGPU isolation feature enabled, you must also upgrade the cGPU version on those nodes. For more information, see Upgrade the cGPU version of a node.

  • If your cluster does not have any nodes with the cGPU isolation feature enabled, skip this step.

    Note
    • If a node has the label ack.node.gpu.schedule=cgpu or ack.node.gpu.schedule=core_mem, the cGPU isolation feature is enabled.

    • Upgrading the cGPU version on existing nodes requires stopping all business Pods on those nodes. Perform this operation during off-peak hours to minimize business disruption.