This topic provides answers to some frequently asked questions about GPU/NPU.
- How do I update the kernel of a GPU node?
- How do I fix a container startup exception on a GPU node?
- Troubleshoot issues in GPU monitoring
- How do I fix the error that the number of available GPUs is less than the actual number of GPUs
- How do I fix errors that occur on GPU nodes when kubelet or Docker is restarted?
- Fix the issue that the IDs of GPUs are changed after a GPU-accelerated ECS instance is restarted or replaced
- Does ACK support vGPU-accelerated instances?
- How do I manually update the kernel version of GPU-accelerated nodes in a cluster?
- What do I do if no container is launched on a GPU-accelerated node?
- What do I do if I fail to add ECS Bare Metal instances that are equipped with NVIDIA A100 GPUs?
- Why does the system prompt Failed to initialize NVML: Unknown Error when I run a pod that requests GPU resources on Alibaba Cloud Linux 3?
Does ACK support vGPU-accelerated instances?
A vGPU-accelerated instance can work as normal only when an NVIDIA GRID license is purchased and a GRID license server is set up. However, Alibaba Cloud does not provide GRID license servers. As a result, after a Container Service for Kubernetes (ACK) cluster that contains vGPU-accelerated instances is created, you cannot directly use the vGPU-accelerated instances in the cluster. Therefore, ACK no longer allows you to select vGPU-accelerated instances when you create clusters in the ACK console.
You cannot select the vGPU-accelerated Elastic Compute Service (ECS) instance types whose names are prefixed with ecs.vgn5i, ecs.vgn6i, ecs.vgn7i, or ecs.sgn7i in the ACK console. If your workloads are highly dependent on vGPU-accelerated instances, you can purchase NVIDIA GRID licenses and set up GRID license servers on your own.
- GRID license servers are required to renew the NVIDIA driver licenses of vGPU-accelerated instances.
- You must purchase vGPU-accelerated ECS instances and familiarize yourself with the NVIDIA documentation about how to set up GRID license servers. For more information, see the NVIDIA official website.
After you have set up a GRID license server, perform the following steps to add a vGPU-accelerated instance to your ACK cluster.
Add a vGPU-accelerated instance to your ACK cluster
- Apply to be allowed to use custom images in Quota Center.
- Create a custom image that is based on CentOS 7.X or Alibaba Cloud Linux 2. The custom image must be installed with the NVIDIA GRID driver and configured with an NVIDIA license. For more information, see Create a custom image from an instance and Install a GRID driver on a Linux vGPU-accelerated instance.
- Create a node pool. For more information, see Create a node pool.
- Add a vGPU-instance to the node pool that you created in Step 3. For more information, see Add existing ECS instances to an ACK cluster.
What to do next: Renew the NVIDIA driver license of a vGPU-accelerated instance in an ACK cluster
For more information about how to renew the NVIDIA driver license of a vGPU-accelerated instance in an ACK cluster, see Renew the NVIDIA driver license of a vGPU-accelerated instance in an ACK cluster.
How do I manually update the kernel version of GPU-accelerated nodes in a cluster?
3.10.0-957.21.3
. Confirm the kernel version to which you want to update. Proceed with caution when you perform the update.
The following procedure shows how to update the NVIDIA driver. Details about how to update the kernel version are not shown.
- Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.
- Set the GPU-accelerated node that you want to manage as unschedulable. In this example, the node cn-beijing.i-2ze19qyi8votgjz12345 is used.
kubectl cordon cn-beijing.i-2ze19qyi8votgjz12345 node/cn-beijing.i-2ze19qyi8votgjz12345 already cordoned
- Migrate the pods on the GPU-accelerated node to other nodes.
kubectl drain cn-beijing.i-2ze19qyi8votgjz12345 --grace-period=120 --ignore-daemonsets=true node/cn-beijing.i-2ze19qyi8votgjz12345 cordoned WARNING: Ignoring DaemonSet-managed pods: flexvolume-9scb4, kube-flannel-ds-r2qmh, kube-proxy-worker-l62sf, logtail-ds-f9vbg pod/nginx-ingress-controller-78d847fb96-5fkkw evicted
- Uninstall nvidia-driver. Note In this example, the version of the uninstalled NVIDIA driver is 384.111. If the version of your NVIDIA driver is not 384.111, download the installation package of the NVIDIA driver from the official NVIDIA website and update the driver to
384.111
first.- Log on to the GPU-accelerated node and run the
nvidia-smi
command to check the driver version.nvidia-smi -a | grep 'Driver Version' Driver Version : 384.111
- Download the driver installation package.
cd /tmp/ curl -O https://cn.download.nvidia.cn/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run
Note The installation package is required to uninstall the NVIDIA driver. - Uninstall the NVIDIA driver.
chmod u+x NVIDIA-Linux-x86_64-384.111.run ./NVIDIA-Linux-x86_64-384.111.run --uninstall -a -s -q
- Log on to the GPU-accelerated node and run the
- Update the kernel version.
- Restart the GPU-accelerated node.
reboot
- Log on to the GPU-accelerated node and install the corresponding kernel-devel package.
yum install -y kernel-devel-$(uname -r)
- Go to the official NVIDIA website, download the required NVIDIA driver, and then install the driver on the GPU-accelerated node. In this example, the version of the NVIDIA driver is 410.79.
cd /tmp/ curl -O https://cn.download.nvidia.cn/tesla/410.79/NVIDIA-Linux-x86_64-410.79.run chmod u+x NVIDIA-Linux-x86_64-410.79.run sh ./NVIDIA-Linux-x86_64-410.79.run -a -s -q warm up GPU nvidia-smi -pm 1 || true nvidia-smi -acp 0 || true nvidia-smi --auto-boost-default=0 || true nvidia-smi --auto-boost-permission=0 || true nvidia-modprobe -u -c=0 -m || true
- Make sure that the /etc/rc.d/rc.local file contains the following configurations. Otherwise, add the configurations to the file.
nvidia-smi -pm 1 || true nvidia-smi -acp 0 || true nvidia-smi --auto-boost-default=0 || true nvidia-smi --auto-boost-permission=0 || true nvidia-modprobe -u -c=0 -m || true
- Restart kubelet and Docker.
service kubelet stop service docker restart service kubelet start
- Set the GPU-accelerated node as schedulable.
kubectl uncordon cn-beijing.i-2ze19qyi8votgjz12345 node/cn-beijing.i-2ze19qyi8votgjz12345 already uncordoned
- Run the following command in the nvidia-device-plugin container to check the version of the NVIDIA driver that is installed on the GPU-accelerated node.
kubectl exec -n kube-system -t nvidia-device-plugin-cn-beijing.i-2ze19qyi8votgjz12345 nvidia-smi Thu Jan 17 00:33:27 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: N/A | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... On | 00000000:00:09.0 Off | 0 | | N/A 27C P0 28W / 250W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Note If no container is launched on the GPU-accelerated node after you run thedocker ps
command, see Failed to start a container on the GPU node.
What do I do if no container is launched on a GPU-accelerated node?
service kubelet stop
Redirecting to /bin/systemctl stop kubelet.service
service docker stop
Redirecting to /bin/systemctl stop docker.service
service docker start
Redirecting to /bin/systemctl start docker.service
service kubelet start
Redirecting to /bin/systemctl start kubelet.service
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
docker info | grep -i cgroup
Cgroup Driver: cgroupfs
The output shows that the cgroup driver is set to cgroupfs. To resolve the issue, perform the following steps:
- Create a copy of /etc/docker/daemon.json. Then, run the following commands to update /etc/docker/daemon.json.
cat >/etc/docker/daemon.json <<-EOF { "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } }, "exec-opts": ["native.cgroupdriver=systemd"], "log-driver": "json-file", "log-opts": { "max-size": "100m", "max-file": "10" }, "oom-score-adjust": -1000, "storage-driver": "overlay2", "storage-opts":["overlay2.override_kernel_check=true"], "live-restore": true } EOF
- Run the following commands in sequence to restart Docker and kubelet.
service kubelet stop Redirecting to /bin/systemctl stop kubelet.service service docker restart Redirecting to /bin/systemctl restart docker.service service kubelet start Redirecting to /bin/systemctl start kubelet.service
- Run the following command to verify that the cgroup driver is set to systemd.
docker info | grep -i cgroup Cgroup Driver: systemd
What do I do if I fail to add ECS Bare Metal instances that are equipped with NVIDIA A100 GPUs?
ECS Bare Metal instance types that are equipped with NVIDIA A100 GPUs, such as the ecs.ebmgn7 family, support the Multi-Instance GPU (MIG) feature. You may fail to add an ECS instance to a cluster due to the MIG configuration that is retained on the instance. To prevent this issue, when ACK adds an ECS Bare Metal instance equipped with NVIDIA 100 GPUs to a cluster, ACK automatically resets the retained MIG configuration on the ECS instance. However, the reset may be time-consuming. In this case, the execution of the node initialization script times out.
If you fail to add an ECS Bare Metal instance of the ecs.ebmgn7 family, run the following command on the instance:
cat /var/log/ack-deploy.log
command timeout: timeout 300 nvidia-smi --gpu-reset
If the preceding error is included in the output, the execution of the node initialization script timed out due to the reset of the MIG configuration. Add the node again. For more information, see Add existing ECS instances to an ACK cluster.
Why does the system prompt Failed to initialize NVML: Unknown Error when I run a pod that requests GPU resources on Alibaba Cloud Linux 3?
Issue
systemctl daemon-reload
and systemctl daemon-reexec
commands on Alibaba Cloud Linux 3, the pod cannot use GPUs as expected. If you run the nvidia-smi
command in the pod, the following error is returned: nvidia-smi
Failed to initialize NVML: Unknown Error
Cause
When you use systemd on Alibaba Cloud Linux 3 and run the systemctl daemon-reload
and systemctl daemon-reexec
commands, cgroup configurations are updated. As a result, pods cannot use NVIDIA GPUs as expected. For more information, see 1671 and 48.
Solution
Perform the following operations to fix this issue:
- Scenario 1: If the pod uses the environment variable NVIDIA_VISIBLE_DEVICES=all to request GPU resources, you can configure the pod to run in privileged mode. Example:
apiVersion: v1 kind: Pod metadata: name: test-gpu-pod spec: containers: - name: test-gpu-pod image: centos:7 command: - sh - -c - sleep 1d securityContext: # Configure the pod to run in privileged mode. privileged: true
- Scenario 2: If the pod has GPU sharing enabled, we recommend that you run the pod on Alibaba Cloud Linux 2 or CentOS 7.
- Scenario 3: Recreate the pod. Evaluate the impact of pod recreation. However, this issue may persist after you recreate the pod.
- Scenario 4: If this issue persists, you can use another operating system, such as Alibaba Cloud Linux 2 or CentOS 7.