All Products
Search
Document Center

Container Service for Kubernetes:Configure the GPU sharing component

Last Updated:Nov 01, 2023

Container Service for Kubernetes (ACK) provides GPU sharing capabilities to allow multiple models to share one GPU and support GPU memory isolation based on the NVIDIA kernel mode driver. This topic describes how to install the GPU sharing component and the GPU inspection tool on a GPU-accelerated node to support GPU sharing and GPU memory isolation.

Prerequisites

Limits

  • Do not set the CPU policy to static for nodes that have GPU sharing enabled.

  • cGPU is a GPU memory isolation module. cGPU versions earlier than 1.5.0 are incompatible with GPU drivers whose versions are 5XX, such as 510.47.03.

  • cGPU does not support CUDA API cudaMallocManaged(). This means that cGPU does not allow you to request GPU memory by using the Unified Virtual Memory (UVM) method. You need to use other methods, such as cudaMalloc(), to request GPU memory. For more information, see the NVIDIA official website.

  • The installation of the GPU sharing component does not have region limits. However, GPU memory isolation is supported only in the following regions. Make sure that your ACK cluster is deployed in one of the following regions.

    View regions

    Region name

    Region ID

    China (Beijing)

    cn-beijing

    China (Shanghai)

    cn-shanghai

    China (Hangzhou)

    cn-hangzhou

    China (Zhangjiakou)

    cn-zhangjiakou

    China (Ulanqab)

    cn-wulanchabu

    China (Shenzhen)

    cn-shenzhen

    China (Chengdu)

    cn-chengdu

    China (Heyuan)

    cn-heyuan

    China (Hong Kong)

    cn-hongkong

    Japan (Tokyo)

    ap-northeast-1

    Indonesia (Jakarta)

    ap-southeast-5

    Singapore

    ap-southeast-1

    US (Virginia)

    us-east-1

    US (Silicon Valley)

    us-west-1

  • Version requirements

    Item

    Version requirement

    Kubernetes

    ≥ 1.18.8

    NVIDIA driver

    ≥ 418.87.01 and < 520.x.x

    Container runtime

    • Docker: ≥ 19.03.5

    • Containerd: ≥ 1.4.3

    OS

    CentOS 7.6, CentOS 7.7, CentOS 7.9, Ubuntu 16.04, Ubuntu 18.04, Alibaba Cloud Linux 2.x, and Alibaba Cloud Linux 3.x

    GPU model

    Tesla P4, Tesla P100, Tesla T4, Tesla A10, and Tesla v100

Step 1: Install the GPU sharing component

The cloud-native AI suite is not deployed

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, click the name of the cluster that you want to manage and choose Applications > Cloud-native AI Suite in the left-side navigation pane.

  3. On the Cloud-native AI Suite page, click Deploy.

  4. On the Cloud-native AI Suite page, select Scheduling Component (Batch Task Scheduling, GPU Sharing, Topology-aware GPU scheduling, and NPU scheduling).1

  5. Optional. Click Advanced to the right of Scheduling Component (Batch Task Scheduling, GPU Sharing, Topology-aware GPU scheduling, and NPU scheduling). In the Parameters panel, modify the policy parameter of cGPU. Click OK.

    If you do not have requirements on the computing power sharing feature provided by cGPU, we recommend that you use the default setting policy: 5. For more information about the policies supported by cGPU, see Install and use cGPU on a Docker container.2.jpg

  6. In the lower part of the Cloud-native AI Suite page, click Deploy Cloud-native AI Suite.

    After the cloud-native AI suite is installed, you can find that ack-ai-installer is in the Deployed state on the Cloud-native AI Suite page.

The cloud-native AI suite is deployed

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, click the name of the cluster that you want to manage and choose Applications > Cloud-native AI Suite in the left-side navigation pane.

  3. Click Deploy in the Actions column of ack-ai-installer.

  4. Optional. In the Parameters panel, modify the policy parameter of cGPU.

    If you do not have requirements on the computing power sharing feature provided by cGPU, we recommend that you use the default setting policy: 5. For more information about the policies supported by cGPU, see Install and use cGPU on a Docker container.2.jpg

  5. After you complete the configuration, click OK.

    After ack-ai-installer is installed, the Status column of the component displays Deployed.

Step 2: Enable GPU sharing and GPU memory isolation

  1. On the Clusters page, click the name of the cluster that you want to manage and choose Nodes > Node Pools in the left-side navigation pane.

  2. In the upper-right corner of the Node Pools page, click Create Node Pool.

  3. In the Create Node Pool dialog box, configure the node pool parameters and click Confirm Order.

    The following table describes the key parameters. For more information about other parameters, see Create a node pool.

    Parameter

    Description

    Expected Nodes

    Specify the initial number of nodes in the node pool. If you do not want to create nodes in the node pool, set this parameter to 0.

    Node Label

    Add labels based on your business requirement. For more information about node labels, see Labels used by ACK to control GPUs.

    In the following example, the value of the label is set to cgpu, which indicates that the node has GPU sharing enabled. The pods on the node need to request only GPU memory. Multiple pods can share the same GPU to implement GPU memory isolation and computing power sharing.

    Click the 节点标签 icon next to Node Label, set Key to ack.node.gpu.schedule, and then set Value to cgpu.

Step 3: Add GPU-accelerated nodes

Note

If you have already added GPU-accelerated nodes to the node pool when you create the node pool, skip this step.

After the node pool is created, you can add GPU-accelerated nodes to the node pool. To add GPU-accelerated nodes, you need to select ECS instances that use the GPU-accelerated architecture. For more information, see Add existing ECS instances to an ACK cluster or Create a node pool.

Step 4: Install and use the GPU inspection tool

  1. Download kubectl-inspect-cgpu.

    • If you use Linux, run the following command to download kubectl-inspect-cgpu:

      wget http://aliacs-k8s-cn-beijing.oss-cn-beijing.aliyuncs.com/gpushare/kubectl-inspect-cgpu-linux -O /usr/local/bin/kubectl-inspect-cgpu
    • If you use macOS, run the following command to download kubectl-inspect-cgpu:

      wget http://aliacs-k8s-cn-beijing.oss-cn-beijing.aliyuncs.com/gpushare/kubectl-inspect-cgpu-darwin -O /usr/local/bin/kubectl-inspect-cgpu
  2. Run the following command to grant execute permissions to kubectl-inspect-cgpu:

    chmod +x /usr/local/bin/kubectl-inspect-cgpu
  3. Run the following command to query the GPU usage of the cluster:

    kubectl inspect cgpu

    Expected output:

    NAME                       IPADDRESS      GPU0(Allocated/Total)  GPU Memory(GiB)
    cn-shanghai.192.168.6.104  192.168.6.104  0/15                   0/15
    ----------------------------------------------------------------------
    Allocated/Total GPU Memory In Cluster:
    0/15 (0%)