FAQ about using GPUs in ACK clusters and solutions - Container Service for Kubernetes

This topic provides answers to some frequently asked questions about GPUs.

Does ACK support vGPU-accelerated instances?

A vGPU-accelerated instance can work as normal only when an NVIDIA GRID license is purchased and a GRID license server is set up. However, Alibaba Cloud does not provide GRID license servers. As a result, after a Container Service for Kubernetes (ACK) cluster that contains vGPU-accelerated instances is created, you cannot directly use the vGPU-accelerated instances in the cluster. Therefore, ACK no longer allows you to select vGPU-accelerated instances when you create clusters in the ACK console.

You cannot select the vGPU-accelerated Elastic Compute Service (ECS) instance types whose names are prefixed with ecs.vgn5i, ecs.vgn6i, ecs.vgn7i, or ecs.sgn7i in the ACK console. If your workloads are strongly reliant on vGPU-accelerated instances, you can purchase NVIDIA GRID licenses and set up GRID license servers on your own.

Note

GRID license servers are required for renewing the NVIDIA driver licenses of vGPU-accelerated instances.
You must purchase vGPU-accelerated ECS instances and familiarize yourself with the NVIDIA documentation about how to set up GRID license servers. For more information, see the NVIDIA official website.

After you have set up a GRID license server, perform the following steps to add a vGPU-accelerated instance to your ACK cluster.

Add a vGPU-accelerated instance to your ACK cluster

Apply for permissions to use custom images in Quota Center.
Create a custom image that is based on CentOS 7.X or Alibaba Cloud Linux 2. The custom image must be installed with the NVIDIA GRID driver and configured with an NVIDIA license. For more information, see Create a custom image from an instance and Install a GRID driver on a vGPU-accelerated Linux instance.
Create a node pool. For more information, see Create a node pool.
Add a vGPU-accelerated instance to the node pool that you created in Step 3. For more information, see Add existing ECS instances to an ACK cluster.

What to do next: Renew the NVIDIA driver license of a vGPU-accelerated instance in an ACK cluster

For more information about how to renew the NVIDIA driver license of a vGPU-accelerated instance in an ACK cluster, see Renew the NVIDIA driver license of a vGPU-accelerated instance in an ACK cluster.

How do I manually update the kernel version of GPU-accelerated nodes in a cluster?

To manually update the kernel version of GPU-accelerated nodes in a cluster, perform the following steps:

Note

The current kernel version is earlier than 3.10.0-957.21.3.

Confirm the kernel version to which you want to update. Proceed with caution when you perform the update.

The following procedure shows how to update the NVIDIA driver. Details about how to update the kernel version are not shown.

The kubeconfig file of the cluster is obtained and a kubectl client is connected to the cluster.
Set the GPU-accelerated node that you want to manage to unschedulable. In this example, the node cn-beijing.i-2ze19qyi8votgjz12345 is used.
```
kubectl cordon cn-beijing.i-2ze19qyi8votgjz12345

node/cn-beijing.i-2ze19qyi8votgjz12345 already cordoned
```

Migrate the pods on the GPU-accelerated node to other nodes.

kubectl drain cn-beijing.i-2ze19qyi8votgjz12345 --grace-period=120 --ignore-daemonsets=true

node/cn-beijing.i-2ze19qyi8votgjz12345 cordoned
WARNING: Ignoring DaemonSet-managed pods: flexvolume-9scb4, kube-flannel-ds-r2qmh, kube-proxy-worker-l62sf, logtail-ds-f9vbg
pod/nginx-ingress-controller-78d847fb96-5fkkw evicted

Uninstall nvidia-driver.
Note
In this example, the uninstalled driver version is 384.111. If your driver version is not 384.111, download the installation package of your driver from the official NVIDIA website and update the driver to 384.111 first.
1. Log on to the GPU-accelerated node and run the nvidia-smi command to check the driver version.
```
sudo nvidia-smi -a | grep 'Driver Version'
Driver Version                      : 384.111
```
2. Download the driver installation package.
```
sudo cd /tmp/
sudo curl -O https://cn.download.nvidia.cn/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run
```
  Note
  The installation package is required to uninstall the NVIDIA driver.
3. Uninstall the NVIDIA driver.
```
sudo chmod u+x NVIDIA-Linux-x86_64-384.111.run
sudo sh./NVIDIA-Linux-x86_64-384.111.run --uninstall -a -s -q
```
Update the kernel version.
Restart the GPU-accelerated node.
```
sudo reboot
```
Log on to the GPU-accelerated node and install the corresponding kernel-devel package.
```
sudo yum install -y kernel-devel-$(uname -r)
```

Go to the official NVIDIA website, download the required NVIDIA driver, and then install the driver on the GPU-accelerated node. In this example, the version of the NVIDIA driver is 410.79.

sudo cd /tmp/
sudo curl -O https://cn.download.nvidia.cn/tesla/410.79/NVIDIA-Linux-x86_64-410.79.run
sudo chmod u+x NVIDIA-Linux-x86_64-410.79.run
sudo sh ./NVIDIA-Linux-x86_64-410.79.run -a -s -q

# warm up GPU
sudo nvidia-smi -pm 1 || true
sudo nvidia-smi -acp 0 || true
sudo nvidia-smi --auto-boost-default=0 || true
sudo nvidia-smi --auto-boost-permission=0 || true
sudo nvidia-modprobe -u -c=0 -m || true

Make sure that the /etc/rc.d/rc.local file contains the following configurations. Otherwise, add the configurations to the file.

sudo nvidia-smi -pm 1 || true
sudo nvidia-smi -acp 0 || true
sudo nvidia-smi --auto-boost-default=0 || true
sudo nvidia-smi --auto-boost-permission=0 || true
sudo nvidia-modprobe -u -c=0 -m || true

Restart kubelet and Docker.

sudo service kubelet stop
sudo service docker restart
sudo service kubelet start

Set the GPU-accelerated node to schedulable.

kubectl uncordon cn-beijing.i-2ze19qyi8votgjz12345

node/cn-beijing.i-2ze19qyi8votgjz12345 already uncordoned

Run the following command in the nvidia-device-plugin container to check the version of the NVIDIA driver that is installed on the GPU-accelerated node:

kubectl exec -n kube-system -t nvidia-device-plugin-cn-beijing.i-2ze19qyi8votgjz12345 nvidia-smi
Thu Jan 17 00:33:27 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:00:09.0 Off |                    0 |
| N/A   27C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Note

If no container is launched on the GPU-accelerated node after you run the docker ps command, see What do I do if no container is launched on a GPU-accelerated node?

What do I do if no container is launched on a GPU-accelerated node?

For specific Kubernetes versions, after you restart kubelet and Docker on GPU-accelerated nodes, no container is launched on the nodes.

sudo service kubelet stop
Redirecting to /bin/systemctl stop kubelet.service
sudo service docker stop
Redirecting to /bin/systemctl stop docker.service
sudo service docker start
Redirecting to /bin/systemctl start docker.service
sudo service kubelet start
Redirecting to /bin/systemctl start kubelet.service

sudo docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

Run the following command to check the cgroup driver.

sudo docker info | grep -i cgroup
Cgroup Driver: cgroupfs

The output shows that the cgroup driver is set to cgroupfs.

To resolve the issue, perform the following steps:

Create a copy of /etc/docker/daemon.json. Then, run the following commands to update /etc/docker/daemon.json:

sudo cat >/etc/docker/daemon.json <<-EOF
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "exec-opts": ["native.cgroupdriver=systemd"],
    "log-driver": "json-file",
    "log-opts": {
        "max-size": "100m",
        "max-file": "10"
    },
    "oom-score-adjust": -1000,
    "storage-driver": "overlay2",
    "storage-opts":["overlay2.override_kernel_check=true"],
    "live-restore": true
}
EOF

Run the following commands in sequence to restart Docker and kubelet:

sudo service kubelet stop
Redirecting to /bin/systemctl stop kubelet.service
sudo service docker restart
Redirecting to /bin/systemctl restart docker.service
sudo service kubelet start
Redirecting to /bin/systemctl start kubelet.service

Run the following command to verify that the cgroup driver is set to systemd:
```
sudo docker info | grep -i cgroup
Cgroup Driver: systemd
```

What do I do if I fail to add ECS Bare Metal instances that are equipped with NVIDIA A100 GPUs?

ECS Bare Metal instance types that are equipped with NVIDIA A100 GPUs, such as the ecs.ebmgn7 family, support the Multi-Instance GPU (MIG) feature. You may fail to add an ECS instance to a cluster due to the MIG configuration that is retained on the instance. To prevent this issue, when ACK adds an ECS Bare Metal instance equipped with NVIDIA 100 GPUs to a cluster, ACK automatically resets the retained MIG configuration on the ECS instance. However, the reset may be time-consuming. In this case, the execution of the node initialization script times out.

If you fail to add an ECS Bare Metal instance of the ecs.ebmgn7 family, run the following command on the instance:

sudo cat /var/log/ack-deploy.log

Check whether the following error is included in the output:

command timeout: timeout 300 nvidia-smi --gpu-reset

If the preceding error is included in the output, the execution of the node initialization script timed out due to the reset of the MIG configuration. Add the node again. For more information, see Add existing ECS instances to an ACK cluster.

Why does the system prompt Failed to initialize NVML: Unknown Error when I run a pod that requests GPU resources on Alibaba Cloud Linux 3?

Issue

After you run the systemctl daemon-reload and systemctl daemon-reexec commands on Alibaba Cloud Linux 3, the pod cannot use GPUs as expected. If you run the nvidia-smi command in the pod, the following error is returned:

sudo nvidia-smi

Failed to initialize NVML: Unknown Error

Cause

When you use systemd on Alibaba Cloud Linux 3 and run the systemctl daemon-reload and systemctl daemon-reexec commands, cgroup configurations are updated. As a result, pods cannot use NVIDIA GPUs as expected. For more information, see issue 1671 and issue 48.

Solution

Perform the following operations to fix this issue:

Scenario 1: If the pod uses the environment variable NVIDIA_VISIBLE_DEVICES=all to request GPU resources, you can configure the pod to run in privileged mode. Example:

apiVersion: v1
kind: Pod
metadata:
  name: test-gpu-pod
spec:
  containers:
    - name: test-gpu-pod
      image: centos:7
      command:
      - sh
      - -c
      - sleep 1d
      securityContext: # Configure the pod to run in privileged mode.
        privileged: true

Scenario 2: If the pod has GPU sharing enabled, we recommend that you run the pod on Alibaba Cloud Linux 2 or CentOS 7.
Scenario 3: Recreate the pod. Evaluate the impact of pod recreation. However, this issue may persist after you recreate the pod.
Scenario 4: If this issue persists, you can use another operating system, such as Alibaba Cloud Linux 2 or CentOS 7.