All Products
Search
Document Center

Container Service for Kubernetes:Manage the GPU sharing component

Last Updated:Jun 17, 2025

Container Service for Kubernetes (ACK) provides GPU sharing capabilities that allow multiple models to share one GPU and support GPU memory isolation based on the NVIDIA driver. If your cluster has the GPU sharing component installed but the GPU driver version or operating system version on the node is incompatible with the existing cGPU version in the cluster, you need to upgrade the GPU sharing component to the latest version. This topic describes how to manage the GPU sharing component on GPU-accelerated nodes to implement GPU scheduling and isolation capabilities.

Prerequisites

Limits

  • Do not set the CPU Policy to static for nodes where GPU sharing is enabled.

  • To specify a custom path for the KubeConfig file, run the export KUBECONFIG=<kubeconfig> command. Note that the kubectl inspect cgpu command does not support the --kubeconfig parameter.

  • If you use cGPU to isolate GPU resources, you cannot request GPU memory by using Unified Virtual Memory (UVM). Therefore, you cannot request GPU memory by calling cudaMallocManaged() of the Compute Unified Device Architecture (CUDA) API. You can request GPU memory by using other methods. For example, you can call cudaMalloc(). For more information, see Unified Memory for CUDA Beginners.

  • The pods managed by the DaemonSet of the shared GPU do not have the highest priority. Therefore, resources may be scheduled to pods with higher priority, and the node may evict the pods managed by the DaemonSet. To prevent this issue, you can modify the actual DaemonSet of the shared GPU. For example, you can modify the gpushare-device-plugin-ds DaemonSet used to share GPU memory and specify priorityClassName: system-node-critical to ensure the priority of the pods managed by the DaemonSet.

  • For performance optimization, a maximum of 20 pods can be created per GPU when using cGPU. If the number of created pods exceeds this limit, subsequent pods scheduled to the same GPU will fail to run and return the error: Error occurs when creating cGPU instance: unknown.

  • You can install the GPU sharing component without region limits. However, GPU memory isolation is supported only in the regions described in the following table. Make sure that your ACK cluster is deployed in one of these regions.

    Regions

    Region

    ID

    China (Beijing)

    cn-beijing

    China (Shanghai)

    cn-shanghai

    China (Hangzhou)

    cn-hangzhou

    China (Zhangjiakou)

    cn-zhangjiakou

    China (Ulanqab)

    cn-wulanchabu

    China (Shenzhen)

    cn-shenzhen

    China (Chengdu)

    cn-chengdu

    China (Heyuan)

    cn-heyuan

    China (Hong Kong)

    cn-hongkong

    Japan (Tokyo)

    ap-northeast-1

    Indonesia (Jakarta)

    ap-southeast-5

    Singapore

    ap-southeast-1

    US (Virginia)

    us-east-1

    US (Silicon Valley)

    us-west-1

    Germany (Frankfurt)

    eu-central-1

  • Version requirements.

    Configuration

    Version requirement

    Kubernetes version

    • If the ack-ai-installer component version is lower than 1.12.0, Kubernetes 1.18.8 or later is supported.

    • If the ack-ai-installer component version is 1.12.0 or later, only Kubernetes 1.20 or later is supported.

    NVIDIA driver version

    ≥ 418.87.01

    Container runtime version

    • Docker: 19.03.5 or later

    • containerd: 1.4.3 or later

    Operating system

    Alibaba Cloud Linux 3.x, Alibaba Cloud Linux 2.x, CentOS 7.6, CentOS 7.7, CentOS 7.9, Ubuntu 22.04

    GPU model

    NVIDIA P, NVIDIA T, NVIDIA V, NVIDIA A, and NVIDIA H series

Install the GPU sharing component

Step 1: Install the GPU sharing component

The cloud-native AI suite is not deployed

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose Applications > Cloud-native AI Suite.

  3. On the Cloud-native AI Suite page, click Deploy.

  4. On the Deploy Cloud-native AI Suite page, select Scheduling Policy Extension (Batch Task Scheduling, GPU Sharing, Topology-aware GPU Scheduling).

  5. (Optional) Click Advanced to the right of Scheduling Policy Extension (Batch Task Scheduling, GPU Sharing, Topology-aware GPU Scheduling). In the Parameters panel, modify the policy parameter of cGPU. After you complete the modification, click OK.

    If you do not have requirements for the computing power sharing feature provided by cGPU, we recommend that you use the default setting policy: 5, which is native scheduling. For more information about the policies supported by cGPU, see Install and use cGPU.image

  6. In the lower part of the Cloud-native AI Suite page, click Deploy Cloud-native AI Suite.

    After the component is installed, you can find the installed GPU sharing component ack-ai-installer in the component list on the Cloud-native AI Suite page.

The cloud-native AI suite is deployed

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose Applications > Cloud-native AI Suite.

  3. Find the ack-ai-installer component, click Deploy in the Actions column.

  4. (Optional) In the Parameters panel, modify the policy parameter of cGPU.

    If you do not have requirements for the computing power sharing feature provided by cGPU, we recommend that you use the default setting policy: 5, which is native scheduling. For more information about the policies supported by cGPU, see Install and use cGPU.image

  5. After you complete the modification, click OK.

    After the component is installed, the Status of ack-ai-installer changes to Deployed.

Step 2: Enable GPU sharing and GPU memory isolation

  1. On the Clusters page, find the cluster to manage and click its name. In the left-side navigation pane, choose Nodes > Node Pools.

  2. On the Node Pools page, click Create Node Pool. For more information about how to configure the node pool, see Create and manage a node pool.

  3. On the Create Node Pool page, configure the parameters to create a node pool and click Confirm. The following table describes the key parameters:

    Parameter

    Description

    Expected Nodes

    The initial number of nodes in the node pool. If you do not want to create nodes in the node pool, set this parameter to 0.

    Node Labels

    Add labels based on your business requirement. For more information about node labels, see Labels for enabling GPU scheduling policies.

    In the following example, the value of the label is set to cgpu, which indicates that the node has GPU sharing enabled. The pods on the node need to request only GPU memory. Multiple pods can share the same GPU to implement GPU memory isolation and computing power sharing.

    Click the 节点标签 icon next to the Node Label parameter, set the Key field to ack.node.gpu.schedule, and then set the Value field to cgpu.

    Important

Step 3: Add GPU-accelerated nodes

Note

If you have already added GPU-accelerated nodes to the node pool when you created the node pool, skip this step.

After the node pool is created, you can add GPU-accelerated nodes to the node pool. To add GPU-accelerated nodes, you need to set the architecture for the instance type to GPU-accelerated. For more information, see Add existing ECS instances or Create and manage a node pool.

Step 4: Install and use the GPU inspection tool

  1. Download kubectl-inspect-cgpu. The executable file must be downloaded to a directory included in the PATH environment variable. In this example, /usr/local/bin/ is used.

    • If you use Linux, run the following command to download kubectl-inspect-cgpu:

      wget http://aliacs-k8s-cn-beijing.oss-cn-beijing.aliyuncs.com/gpushare/kubectl-inspect-cgpu-linux -O /usr/local/bin/kubectl-inspect-cgpu
    • If you use macOS, run the following command to download kubectl-inspect-cgpu:

      wget http://aliacs-k8s-cn-beijing.oss-cn-beijing.aliyuncs.com/gpushare/kubectl-inspect-cgpu-darwin -O /usr/local/bin/kubectl-inspect-cgpu
  2. Run the following command to grant execute permissions to kubectl-inspect-cgpu:

    chmod +x /usr/local/bin/kubectl-inspect-cgpu
  3. Run the following command to query the GPU usage of the cluster:

    kubectl inspect cgpu

    Expected output:

    NAME                       IPADDRESS      GPU0(Allocated/Total)  GPU Memory(GiB)
    cn-shanghai.192.168.6.104  192.168.6.104  0/15                   0/15
    ----------------------------------------------------------------------
    Allocated/Total GPU Memory In Cluster:
    0/15 (0%)

Update the GPU sharing component

Step 1: Determine the update method for the GPU sharing component

You must select an update method based on how the GPU sharing component (ack-ai-installer) was installed in your cluster. There are two ways to install the GPU sharing component.

  • Use the cloud-native AI suite (recommended): Install the GPU sharing component ack-ai-installer on the Cloud-native AI Suite page.

  • Use the App Catalog (this method is no longer available): Install the GPU sharing component ack-ai-installer on the App Catalog page in the Marketplace. This installation method is no longer available. However, for components that were already installed using this method, you can still update them by using this method.

    Important

    If you uninstall a component that was installed using this method from your cluster, you must activate the cloud-native AI suite service and complete the installation when you reinstall the component.

How to determine the installation method of the GPU sharing component in your cluster?

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose Applications > Cloud-native AI Suite.

  3. Check whether the ack-ai-installer component is deployed on the Cloud-native AI Suite page.

    If it is deployed, the GPU sharing component was installed through the Cloud-native AI Suite. Otherwise, it was installed through the App Catalog.

Step 2: Update the component

Update through the cloud-native AI suite

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose Applications > Cloud-native AI Suite.

  3. In the Components section, find the ack-ai-installer component and click Upgrade in the Actions column.

Update through the App Catalog

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose Applications > Helm.

  3. In the Helm list, find the ack-ai-installer component and click Update in the Actions column. Follow the page instructions to select the latest chart version and complete the component update.

    Important

    If you want to customize the chart configuration, confirm the component update after you modify the configuration.

    After the update, check the Helm list to confirm that the chart version of the ack-ai-installer component is the latest version.

Step 3: Update existing nodes

After the ack-ai-installer component is updated, the cGPU version on existing nodes is not automatically updated. Refer to the following instructions to determine whether nodes have cGPU isolation enabled.

  • If your cluster contains GPU-accelerated nodes with cGPU isolation enabled, you must update the cGPU version on these existing nodes. For more information, see Update the cGPU version on a node.

  • If your cluster does not contain nodes with cGPU isolation enabled, skip this step.

    Note
    • If a node has the ack.node.gpu.schedule=cgpu or ack.node.gpu.schedule=core_mem label, cGPU isolation is enabled on the node.

    • Updating the cGPU version on existing nodes requires stopping all application pods on the nodes. Perform this operation during off-peak hours based on your business scenario.