This topic describes the common questions about node management.
How do I manually upgrade the kernel of the GPU nodes in a cluster?
You can manually upgrade the kernel of the GPU nodes in a cluster. Perform the following
steps:
Note The current kernel version is earlier than
3.10.0-957.21.3
.
Confirm the kernel version to which you want to upgrade. Proceed with caution.
You can perform the following steps to upgrade the kernel and NVIDIA driver.
- Use kubectl to connect to a cluster.
- Set the GPU node that you wan to manage to the unschedulable state. In this example,
the node cn-beijing.i-2ze19qyi8votgjz12345 is used.
kubectl cordon cn-beijing.i-2ze19qyi8votgjz12345
node/cn-beijing.i-2ze19qyi8votgjz12345 already cordoned
- Migrate the pods on the node to other nodes.
kubectl drain cn-beijing.i-2ze19qyi8votgjz12345 --grace-period=120 --ignore-daemonsets=true
node/cn-beijing.i-2ze19qyi8votgjz12345 cordoned
WARNING: Ignoring DaemonSet-managed pods: flexvolume-9scb4, kube-flannel-ds-r2qmh, kube-proxy-worker-l62sf, logtail-ds-f9vbg
pod/nginx-ingress-controller-78d847fb96-5fkkw evicted
- Uninstall the existing nvidia-driver.
Note In this example, the driver version 384.111 is uninstalled. If your driver version
is not version 384.111, download a driver from the official NVIDIA website and replace
version 384.111
with your actual version number.
- Log on to the GPU node. In the command-line interface (CLI), enter
nvidia-smi
to check the driver version.nvidia-smi -a | grep 'Driver Version'
Driver Version : 384.111
- Download the driver installation package.
cd /tmp/
curl -O https://cn.download.nvidia.cn/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run
Note The installation package is required to uninstall the driver.
- Uninstall the existing driver.
chmod u+x NVIDIA-Linux-x86_64-384.111.run
. /NVIDIA-Linux-x86_64-384.111.run --uninstall -a -s -q
- Upgrade the kernel.
Upgrade the kernel based on your requirements.
- Restart the GPU node.
- Log on to the GPU node to install the required kernel-devel package.
yum install -y kernel-devel-$(uname -r)
- Go to the official NVIDIA website, download the required driver, and then install
the driver on the node. In this example, version 410.79 is used.
cd /tmp/
curl -O https://cn.download.nvidia.cn/tesla/410.79/NVIDIA-Linux-x86_64-410.79.run
chmod u+x NVIDIA-Linux-x86_64-410.79.run
sh . /NVIDIA-Linux-x86_64-410.79.run -a -s -q
warm up GPU
nvidia-smi -pm 1 || true
nvidia-smi -acp 0 || true
nvidia-smi --auto-boost-default=0 || true
nvidia-smi --auto-boost-permission=0 || true
nvidia-modprobe -u -c=0 -m || true
- Check the /etc/rc.d/rc.local file to verify that the following configurations are included. If not, add the following
configurations to the file.
nvidia-smi -pm 1 || true
nvidia-smi -acp 0 || true
nvidia-smi --auto-boost-default=0 || true
nvidia-smi --auto-boost-permission=0 || true
nvidia-modprobe -u -c=0 -m || true
- Restart kubelet and Docker.
service kubelet stop
service docker restart
service kubelet start
- Set the GPU node to the schedulable state.
kubectl uncordon cn-beijing.i-2ze19qyi8votgjz12345
node/cn-beijing.i-2ze19qyi8votgjz12345 already uncordoned
- In the CLI, run the following command in the nvidia-device-plugin container to check
the version of the driver that is installed on the GPU node.
kubectl exec -n kube-system -t nvidia-device-plugin-cn-beijing.i-2ze19qyi8votgjz12345 nvidia-smi
Thu Jan 17 00:33:27 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... On | 00000000:00:09.0 Off | 0 |
| N/A 27C P0 28W / 250W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
How do I fix the issue that no container is started on a GPU node?
When you restart kubelet and Docker on GPU nodes where specific Kubernetes versions
are installed, you may find that no container is started on the nodes after the restart.
service kubelet stop
Redirecting to /bin/systemctl stop kubelet.service
service docker stop
Redirecting to /bin/systemctl stop docker.service
service docker start
Redirecting to /bin/systemctl start docker.service
service kubelet start
Redirecting to /bin/systemctl start kubelet.service
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
Run the following command to view the cgroup driver.
docker info | grep -i cgroup
Cgroup Driver: cgroupfs
The output shows that the cgroup driver is set to cgroupfs.
To fix the issue, perform the following steps:
- Create a copy of /etc/docker/daemon.json. In the CLI, run the following command to update /etc/docker/daemon.json.
cat >/etc/docker/daemon.json <<-EOF
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m",
"max-file": "10"
},
"oom-score-adjust": -1000,
"storage-driver": "overlay2",
"storage-opts":["overlay2.override_kernel_check=true"],
"live-restore": true
}
EOF
- In the CLI, run the following command to restart Docker and kubelet.
service kubelet stop
Redirecting to /bin/systemctl stop kubelet.service
service docker restart
Redirecting to /bin/systemctl restart docker.service
service kubelet start
Redirecting to /bin/systemctl start kubelet.service
- In the CLI, run the following command to check whether the cgroup driver is set to
systemd.
docker info | grep -i cgroup
Cgroup Driver: systemd