Overview
This document describes how to upgrade the kernel for GPU nodes in a Container Service for Kubernetes (ACK) cluster.
Details
- Log on to a GPU node. For more information about how to log on to a GPU node, see Use kubectl to connect to an ACK cluster.
- Run the following command to set the node to the unschedulable state:
kubectl cordon [$Node_ID]
Note: Replace [$Node_ID] with the ID of the node.
The following command output is returned:node/[$Node_ID] already cordoned
- Run the following commands to remove the GPU node for which you want to upgrade the driver from the cluster:
kubectl drain [$Node_ID] --grace-period=120 --ignore-daemonsets=true
The following command output is returned:node/cn-beijing.[$Node_ID] cordoned
WARNING: Ignoring DaemonSet-managed pods: flexvolume-9scb4, kube-flannel-ds-r2qmh, kube-proxy-worker-l62sf, logtail-ds-f9vbg
pod/nginx-ingress-controller-78d847fb96-5fkkw evicted - Uninstall the existing NVIDIA driver.
Note: In this example, NVIDIA 384.111 is uninstalled. If your driver version is not version 384.111, download the installation package of the actual version from the official NVIDIA website. Then, replace 384.111 with the actual version number in the commands to be run in this step.- Log on to the GPU node and run the following command to view the driver version:
nvidia-smi -a | grep 'Driver Version'
The following command output is returned:Driver Version : 384.111
- Run the following commands to download the installation package of the NVIDIA driver that you want to uninstall:
cd /tmp/
curl -O https://cn.download.nvidia.cn/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run - Run the following commands to uninstall the existing NVIDIA driver:
chmod u+x NVIDIA-Linux-x86_64-384.111.run
./NVIDIA-Linux-x86_64-384.111.run --uninstall -a -s -q
- Log on to the GPU node and run the following command to view the driver version:
- Run the following commands to upgrade the kernel:
yum clean all && yum makecache
yum update kernel -y - Run the
reboot
command to restart the GPU node. - Log on to the GPU node again and run the following command to install the kernel-devel package:
yum install -y kernel-devel-$(uname -r)
- Run the following commands to download the installation package of the driver version to which you want to upgrade from the official NVIDIA website and install this package. In this example, the installation package of NVIDIA 410.79 is downloaded and installed.
cd /tmp/
curl -O https://cn.download.nvidia.cn/tesla/410.79/NVIDIA-Linux-x86_64-410.79.run
chmod u+x NVIDIA-Linux-x86_64-410.79.run sh
./NVIDIA-Linux-x86_64-410.79.run -a -s -q - Run the following command to configure parameters as needed:
nvidia-smi -pm 1 || true
nvidia-smi -acp 0 || true
nvidia-smi --auto-boost-default=0 || true
nvidia-smi --auto-boost-permission=0 || true
nvidia-modprobe -u -c=0 -m || true - If you need to start the NVIDIA driver upon startup, check whether the /etc/rc.d/rc.local file includes the following configurations. If the file does not include the configurations, manually add the configurations to the file.
nvidia-smi -pm 1 || true
nvidia-smi -acp 0 || true
nvidia-smi --auto-boost-default=0 || true
nvidia-smi --auto-boost-permission=0 || true
nvidia-modprobe -u -c=0 -m || true - Run the following commands to restart the kubelet and Docker:
service kubelet stop
service docker restart
service kubelet start - Run the following command to set the GPU node to the schedulable state:
kubectl uncordon [$Node_ID]
The following command output is returned:node/[$Node_ID] already uncordoned
- Run the following command on the nvidia-device-plugin pod of the GPU node to check the driver version:
kubectl exec -n kube-system -t nvidia-device-plugin-[$Node_ID] nvidia-smi
The following command output is returned:
Thu Jan 17 00:33:27 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: N/A | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... On | 00000000:00:09.0 Off | 0 | | N/A 27C P0 28W / 250W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------
- Run the docker ps command to check whether the containers of the GPU node are started. For more information about how to resolve the issue that containers cannot be started, see What can I do if I fail to start a container on the GPU node?
Applicable scope
- ACK
Note: Make sure that the existing kernel version of the node for which you want to upgrade the kernel is earlier than 3.10.0-957.21.3.