All Products
Search
Document Center

Container Service for Kubernetes:GPU FAQ

Last Updated:Jul 02, 2024

This topic provides answers to some frequently asked questions about GPUs.

Does ACK support vGPU-accelerated instances?

A vGPU-accelerated instance can work as normal only when an NVIDIA GRID license is purchased and a GRID license server is set up. However, Alibaba Cloud does not provide GRID license servers. As a result, after a Container Service for Kubernetes (ACK) cluster that contains vGPU-accelerated instances is created, you cannot directly use the vGPU-accelerated instances in the cluster. Therefore, ACK no longer allows you to select vGPU-accelerated instances when you create clusters in the ACK console.

You cannot select the vGPU-accelerated Elastic Compute Service (ECS) instance types whose names are prefixed with ecs.vgn5i, ecs.vgn6i, ecs.vgn7i, or ecs.sgn7i in the ACK console. If your workloads are strongly reliant on vGPU-accelerated instances, you can purchase NVIDIA GRID licenses and set up GRID license servers on your own.

Note
  • GRID license servers are required for renewing the NVIDIA driver licenses of vGPU-accelerated instances.

  • You must purchase vGPU-accelerated ECS instances and familiarize yourself with the NVIDIA documentation about how to set up GRID license servers. For more information, see the NVIDIA official website.

After you have set up a GRID license server, perform the following steps to add a vGPU-accelerated instance to your ACK cluster.

Add a vGPU-accelerated instance to your ACK cluster

  1. Apply for permissions to use custom images in Quota Center.

  2. Create a custom image that is based on CentOS 7.X or Alibaba Cloud Linux 2. The custom image must be installed with the NVIDIA GRID driver and configured with an NVIDIA license. For more information, see Create a custom image from an instance and Install a GRID driver on a vGPU-accelerated Linux instance.

  3. Create a node pool. For more information, see Create a node pool.

  4. Add a vGPU-accelerated instance to the node pool that you created in Step 3. For more information, see Add existing ECS instances to an ACK cluster.

What to do next: Renew the NVIDIA driver license of a vGPU-accelerated instance in an ACK cluster

For more information about how to renew the NVIDIA driver license of a vGPU-accelerated instance in an ACK cluster, see Renew the NVIDIA driver license of a vGPU-accelerated instance in an ACK cluster.

How do I manually update the kernel version of GPU-accelerated nodes in a cluster?

To manually update the kernel version of GPU-accelerated nodes in a cluster, perform the following steps:

Note

The current kernel version is earlier than 3.10.0-957.21.3.

Confirm the kernel version to which you want to update. Proceed with caution when you perform the update.

The following procedure shows how to update the NVIDIA driver. Details about how to update the kernel version are not described.

  1. Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.

  2. Set the GPU-accelerated node that you want to manage to unschedulable. In this example, the node cn-beijing.i-2ze19qyi8votgjz12345 is used.

    kubectl cordon cn-beijing.i-2ze19qyi8votgjz12345
    
    node/cn-beijing.i-2ze19qyi8votgjz12345 already cordoned
  3. Migrate the pods on the GPU-accelerated node to other nodes.

    kubectl drain cn-beijing.i-2ze19qyi8votgjz12345 --grace-period=120 --ignore-daemonsets=true
    
    node/cn-beijing.i-2ze19qyi8votgjz12345 cordoned
    WARNING: Ignoring DaemonSet-managed pods: flexvolume-9scb4, kube-flannel-ds-r2qmh, kube-proxy-worker-l62sf, logtail-ds-f9vbg
    pod/nginx-ingress-controller-78d847fb96-5fkkw evicted
  4. Uninstall the current NVIDIA driver.

    Note

    In this example, NVIDIA driver 384.111 is uninstalled. If your driver version is not 384.111, download the driver installation package from the official NVIDIA website and replace 384.111 in the following sample code with the version of the NVIDIA driver.

    1. Log on to the GPU-accelerated node and run the nvidia-smi command to check the driver version.

      sudo nvidia-smi -a | grep 'Driver Version'
      Driver Version                      : 384.111
    2. Download the driver installation package.

      sudo cd /tmp/
      sudo curl -O https://cn.download.nvidia.cn/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run
      Note

      The installation package is required for uninstalling the NVIDIA driver.

    3. Uninstall the NVIDIA driver.

      sudo chmod u+x NVIDIA-Linux-x86_64-384.111.run
      sudo sh./NVIDIA-Linux-x86_64-384.111.run --uninstall -a -s -q
  5. Update the kernel version.

  6. Restart the GPU-accelerated node.

    sudo reboot
  7. Log on to the GPU-accelerated node and install the corresponding kernel-devel package.

    sudo yum install -y kernel-devel-$(uname -r)
  8. Go to the official NVIDIA website, download the required NVIDIA driver, and then install the driver on the GPU-accelerated node. In this example, NVIDIA driver 410.79 is downloaded and installed.

    sudo cd /tmp/
    sudo curl -O https://cn.download.nvidia.cn/tesla/410.79/NVIDIA-Linux-x86_64-410.79.run
    sudo chmod u+x NVIDIA-Linux-x86_64-410.79.run
    sudo sh ./NVIDIA-Linux-x86_64-410.79.run -a -s -q
    
    # warm up GPU
    sudo nvidia-smi -pm 1 || true
    sudo nvidia-smi -acp 0 || true
    sudo nvidia-smi --auto-boost-default=0 || true
    sudo nvidia-smi --auto-boost-permission=0 || true
    sudo nvidia-modprobe -u -c=0 -m || true
  9. Make sure that the /etc/rc.d/rc.local file includes the following configurations. Otherwise, add the following configurations to the file.

    sudo nvidia-smi -pm 1 || true
    sudo nvidia-smi -acp 0 || true
    sudo nvidia-smi --auto-boost-default=0 || true
    sudo nvidia-smi --auto-boost-permission=0 || true
    sudo nvidia-modprobe -u -c=0 -m || true
  10. Restart the kubelet and Docker.

    sudo service kubelet stop
    sudo service docker restart
    sudo service kubelet start
  11. Set the GPU-accelerated node to schedulable.

    kubectl uncordon cn-beijing.i-2ze19qyi8votgjz12345
    
    node/cn-beijing.i-2ze19qyi8votgjz12345 already uncordoned
  12. Run the following command in the nvidia-device-plugin container to check the version of the driver installed on the GPU-accelerated node.

    kubectl exec -n kube-system -t nvidia-device-plugin-cn-beijing.i-2ze19qyi8votgjz12345 nvidia-smi
    Thu Jan 17 00:33:27 2019
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: N/A      |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla P100-PCIE...  On   | 00000000:00:09.0 Off |                    0 |
    | N/A   27C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    Note

    If no container is started on the GPU-accelerated node after you run the docker ps command, resolve the issue by referring to What do I do if no container is started on a GPU-accelerated node?.

What do I do if no container is launched on a GPU-accelerated node?

For specific Kubernetes versions, after you restart the kubelet and Docker on GPU-accelerated nodes, no container is launched on the nodes.

sudo service kubelet stop
Redirecting to /bin/systemctl stop kubelet.service
sudo service docker stop
Redirecting to /bin/systemctl stop docker.service
sudo service docker start
Redirecting to /bin/systemctl start docker.service
sudo service kubelet start
Redirecting to /bin/systemctl start kubelet.service

sudo docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

Run the following command to check the cgroup driver.

sudo docker info | grep -i cgroup
Cgroup Driver: cgroupfs

The output shows that the cgroup driver is set to cgroupfs.

To resolve the issue, perform the following steps:

  1. Create a copy of /etc/docker/daemon.json. Then, run the following commands to update /etc/docker/daemon.json.

    sudo cat >/etc/docker/daemon.json <<-EOF
    {
        "default-runtime": "nvidia",
        "runtimes": {
            "nvidia": {
                "path": "/usr/bin/nvidia-container-runtime",
                "runtimeArgs": []
            }
        },
        "exec-opts": ["native.cgroupdriver=systemd"],
        "log-driver": "json-file",
        "log-opts": {
            "max-size": "100m",
            "max-file": "10"
        },
        "oom-score-adjust": -1000,
        "storage-driver": "overlay2",
        "storage-opts":["overlay2.override_kernel_check=true"],
        "live-restore": true
    }
    EOF
  2. Run the following commands to restart Docker and the kubelet:

    sudo service kubelet stop
    Redirecting to /bin/systemctl stop kubelet.service
    sudo service docker restart
    Redirecting to /bin/systemctl restart docker.service
    sudo service kubelet start
    Redirecting to /bin/systemctl start kubelet.service
  3. Run the following command to verify that the cgroup driver is set to systemd:

    sudo docker info | grep -i cgroup
    Cgroup Driver: systemd

What do I do if I fail to add ECS Bare Metal instances that are equipped with NVIDIA A100 GPUs?

ECS Bare Metal instance types that are equipped with NVIDIA A100 GPUs, such as the ecs.ebmgn7 family, support the Multi-Instance GPU (MIG) feature. You may fail to add an ECS instance to a cluster due to the MIG configuration that is retained on the instance. To prevent this issue, when ACK adds an ECS Bare Metal instance equipped with NVIDIA 100 GPUs to a cluster, ACK automatically resets the retained MIG configuration on the ECS instance. However, the reset may be time-consuming. In this case, the execution of the node initialization script times out.

If you fail to add an ECS Bare Metal instance of the ecs.ebmgn7 family, run the following command on the instance:

sudo cat /var/log/ack-deploy.log

Check whether the following error is included in the output:

command timeout: timeout 300 nvidia-smi --gpu-reset

If the preceding error is included in the output, the execution of the node initialization script timed out due to the reset of the MIG configuration. Add the node again. For more information, see Add existing ECS instances to an ACK cluster.

Why does the system prompt Failed to initialize NVML: Unknown Error when I run a pod that requests GPU resources on Alibaba Cloud Linux 3?

Issue

After you run the systemctl daemon-reload and systemctl daemon-reexec commands on Alibaba Cloud Linux 3, the pod cannot use GPUs as expected. If you run the nvidia-smi command in the pod, the following error is returned:

sudo nvidia-smi

Failed to initialize NVML: Unknown Error

Cause

When you use systemd on Alibaba Cloud Linux 3 and run the systemctl daemon-reload and systemctl daemon-reexec commands, cgroup configurations are updated. As a result, pods cannot use NVIDIA GPUs as expected. For more information, see issue 1671 and issue 48.

Solution

Perform the following operations to fix this issue:

  • Scenario 1: If the pod uses the environment variable NVIDIA_VISIBLE_DEVICES=all to request GPU resources, you can configure the pod to run in privileged mode. Example:

    apiVersion: v1
    kind: Pod
    metadata:
      name: test-gpu-pod
    spec:
      containers:
        - name: test-gpu-pod
          image: centos:7
          command:
          - sh
          - -c
          - sleep 1d
          securityContext: # Configure the pod to run in privileged mode.
            privileged: true
  • Scenario 2: If the pod has GPU sharing enabled, we recommend that you run the pod on Alibaba Cloud Linux 2 or CentOS 7.

  • Scenario 3: Recreate the pod. Evaluate the impact of pod recreation. However, this issue may persist after you recreate the pod.

  • Scenario 4: If this issue persists, you can use another operating system, such as Alibaba Cloud Linux 2 or CentOS 7.

What do I do if a GPU falls off the bus due to an XID 119 or XID 120 error?

  • Problem description

    A GPU has fallen off the bus. For example, an error message that indicates a GPU fails to be started appears when you use the GPU on a Linux machine. After you run the sh nvidia-bug-report.sh command, you can view XID 119 or XID 120 error messages in the log. The following figure shows an example of an XID 119 error page.

    报错信息.png

    Note

    For more information, see Common XID Errors in the NVIDIA official documentation.

  • Cause

    The preceding issue may be caused by the abnormal running status of the GPU System Processor (GSP) component of the GPU. NVIDIA does not provide a specific driver version to fix the issue. We recommend that you disable the GSP feature before you use the GPU.

    Note

    For more information about GSP, see Chapter 42. GSP Firmware in the official NVIDIA documentation.

  • Solution

    Disable the GPU System Processor (GSP) component based on different scenarios.

    Add new nodes

    You can create a node pool or modify the configurations of an existing node pool. Add the ack.aliyun.com/disable-nvidia-gsp label and set the label value to true. This way, when you add new nodes to the node pool in subsequent operations, the GSP component is automatically disabled for the newly added nodes.

    For more information about the operations and parameters, see Create a node pool and Modify a node pool.

    image

    Note

    If you disable the GSP component, the time required to scale out nodes may increase.

    Add existing nodes

    1. You can create a node pool or modify the configurations of an existing node pool to which you want to add nodes. Add theack.aliyun.com/disable-nvidia-gsp label and set the label value to true. This way, when you add existing nodes to the node pool in subsequent operations, the GSP component is automatically disabled for the nodes.

      For more information about the operations and parameters, see Create a node pool and Modify a node pool.

      image

      Note

      If you disable the GSP component, the time required to scale out nodes may increase.

    2. Add existing nodes to the node pool. For more information about the operations and usage notes, see Add existing ECS instances to an ACK cluster.

    Manage existing nodes

    You can use one of the following methods to disable GSP for an existing node:

    Disable GSP by adding a node pool label

    1. Add the ack.aliyun.com/disable-nvidia-gsp label to the node pool to which the node belongs and set the label value to true.

      For more information about the operations and parameters, see Modify a node pool.

      image

    2. Remove the node from the cluster without releasing the ECS instance. For more information, see Remove nodes.

    3. Re-add the removed node to the cluster. For more information about the operations and usage notes, see Add existing ECS instances to an ACK cluster.

    Log on to the node and manually disable GSP

    If GSP cannot be disabled for the node after you remove the node from the cluster and then re-add it to the cluster, you can log on to the node and manually disable GSP. For more information, see What do I do if a GPU falls off the bus due to an XID 119 or XID 120 error?

    Note

    GSP is introduced in NVIDIA driver 510 and later. If you log on to the node and update an NVIDIA driver from 470 to 525, you do not need to disable GSP before the update. However, GSP-related issues may occur after the update. In this case, you must manually disable GSP after the update. For more information, see What do I do if a GPU falls off the bus due to an XID 119 or XID 120 error?.

How do I isolate a faulty GPU?

When GPU sharing is used, faulty GPUs can cause task failures. To avoid repeatedly scheduling a task to a faulty GPU, you can manually mark the faulty GPU. Then, the scheduler no longer schedules pods to the faulty GPU. This way, the faulty GPU is isolated.

Note
  • To use this feature, make sure that the scheduler and cluster meet the following requirements.

    • If the cluster runs Kubernetes 1.24 or later, make sure that the scheduler version is 1.xx.x-aliyun-6.4.3.xxx or later.

    • If the cluster runs Kubernetes 1.22, make sure that the scheduler version is 1.22.15-aliyun-6.2.4.xxx or later.

  • GPU sharing is used in the cluster. For more information, see GPU sharing overview.

You can submit a special ConfigMap to the cluster to mark a faulty GPU. Example:

apiVersion: v1
kind: ConfigMap
metadata:
  name: <node-name>-device-status   # Replace <node-name> with the actual node name.
  namespace: kube-system
data:
  devices: |
    - deviceId: 0          #Run nvidia-smi to obtain the GPU serial number.
      deviceType: gpu
      healthy: false

The ConfigMap must belong to the kube-system namespace and its name must be in the following format: Name of the node of the faulty GPU + -device-status. Set deviceId in the data section to the GPU serial number returned by nvidia-smi. Set deviceType to gpu and healthy to false. After you submit the preceding ConfigMap to the cluster, the scheduler automatically isolates the faulty GPU.