This topic provides answers to some frequently asked questions about node management.
- How do I manually update the kernel version of GPU-accelerated nodes in a cluster?
- What do I do if no container is launched on a GPU-accelerated node?
- FAQ about adding nodes to a cluster
- How do I fix the "drain-node job execute timeout" error that occurs when I remove a node?
- How do I change the hostname of a worker node in an ACK cluster?
- How do I change the operating system for a node pool?
- What are the differences between node pools that are configured with the Expected Nodes parameter and those that are not configured with this parameter?
- How do I add existing nodes to a cluster?
- How do I use preemptible instances in a node pool?
How do I manually update the kernel version of GPU-accelerated nodes in a cluster?
3.10.0-957.21.3
. Confirm the kernel version to which you want to update. Proceed with caution when you perform the update.
The following procedure shows how to update the NVIDIA driver. Details about how to update the kernel version are not shown.
- Connect to ACK clusters by using kubectl.
- Set the GPU-accelerated node that you want to manage as unschedulable. In this example, the node cn-beijing.i-2ze19qyi8votgjz12345 is used.
kubectl cordon cn-beijing.i-2ze19qyi8votgjz12345 node/cn-beijing.i-2ze19qyi8votgjz12345 already cordoned
- Migrate the pods on the GPU-accelerated node to other nodes.
kubectl drain cn-beijing.i-2ze19qyi8votgjz12345 --grace-period=120 --ignore-daemonsets=true node/cn-beijing.i-2ze19qyi8votgjz12345 cordoned WARNING: Ignoring DaemonSet-managed pods: flexvolume-9scb4, kube-flannel-ds-r2qmh, kube-proxy-worker-l62sf, logtail-ds-f9vbg pod/nginx-ingress-controller-78d847fb96-5fkkw evicted
- Uninstall nvidia-driver. Note In this example, the version of the uninstalled NVIDIA driver is 384.111. If the version of your NVIDIA driver is not 384.111, download the installation package of the NVIDIA driver from the official NVIDIA website and update the driver to
384.111
first. - Update the kernel version. Update the kernel version based on your business requirements.
- Restart the GPU-accelerated node.
reboot
- Log on to the GPU-accelerated node and install the corresponding kernel-devel package.
yum install -y kernel-devel-$(uname -r)
- Go to the official NVIDIA website, download the required NVIDIA driver, and then install the driver on the GPU-accelerated node. In this example, the version of the NVIDIA driver is 410.79.
cd /tmp/ curl -O https://cn.download.nvidia.cn/tesla/410.79/NVIDIA-Linux-x86_64-410.79.run chmod u+x NVIDIA-Linux-x86_64-410.79.run sh ./NVIDIA-Linux-x86_64-410.79.run -a -s -q warm up GPU nvidia-smi -pm 1 || true nvidia-smi -acp 0 || true nvidia-smi --auto-boost-default=0 || true nvidia-smi --auto-boost-permission=0 || true nvidia-modprobe -u -c=0 -m || true
- Make sure that the /etc/rc.d/rc.local file contains the following configurations. Otherwise, add the configurations to the file.
nvidia-smi -pm 1 || true nvidia-smi -acp 0 || true nvidia-smi --auto-boost-default=0 || true nvidia-smi --auto-boost-permission=0 || true nvidia-modprobe -u -c=0 -m || true
- Restart kubelet and Docker.
service kubelet stop service docker restart service kubelet start
- Set the GPU-accelerated node to schedulable.
kubectl uncordon cn-beijing.i-2ze19qyi8votgjz12345 node/cn-beijing.i-2ze19qyi8votgjz12345 already uncordoned
- Run the following command in the nvidia-device-plugin container to check the version of the NVIDIA driver that is installed on the GPU-accelerated node.
kubectl exec -n kube-system -t nvidia-device-plugin-cn-beijing.i-2ze19qyi8votgjz12345 nvidia-smi Thu Jan 17 00:33:27 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: N/A | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... On | 00000000:00:09.0 Off | 0 | | N/A 27C P0 28W / 250W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Note If no container is launched on the GPU-accelerated node after you run thedocker ps
command, see Failed to start a container on the GPU node.
What do I do if no container is launched on a GPU-accelerated node?
service kubelet stop
Redirecting to /bin/systemctl stop kubelet.service
service docker stop
Redirecting to /bin/systemctl stop docker.service
service docker start
Redirecting to /bin/systemctl start docker.service
service kubelet start
Redirecting to /bin/systemctl start kubelet.service
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
docker info | grep -i cgroup
Cgroup Driver: cgroupfs
The output shows that the cgroup driver is set to cgroupfs. To resolve the issue, perform the following steps:
- Create a copy of /etc/docker/daemon.json. Then, run the following commands to update /etc/docker/daemon.json.
cat >/etc/docker/daemon.json <<-EOF { "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } }, "exec-opts": ["native.cgroupdriver=systemd"], "log-driver": "json-file", "log-opts": { "max-size": "100m", "max-file": "10" }, "oom-score-adjust": -1000, "storage-driver": "overlay2", "storage-opts":["overlay2.override_kernel_check=true"], "live-restore": true } EOF
- Run the following commands in sequence to restart Docker and kubelet.
service kubelet stop Redirecting to /bin/systemctl stop kubelet.service service docker restart Redirecting to /bin/systemctl restart docker.service service kubelet start Redirecting to /bin/systemctl start kubelet.service
- Run the following command to verify that the cgroup driver is set to systemd.
docker info | grep -i cgroup Cgroup Driver: systemd
How do I change the hostname of a worker node in an ACK cluster?
- Remove the worker node.
- Log on to the ACK console.
- In the left-side navigation pane of the ACK console, click Clusters.
- In the left-side navigation pane of the details page, choose .
- On the Nodes page, find the worker node that you want to remove and choose in the Actions column.
- In the dialog box that appears, select I understand the above information and want to remove the node(s). and click OK.
- Add the worker node to the node pool again. For more information, see Manually add ECS instances. Then, the worker node is renamed based on the new node naming rule of the node pool.
How do I change the operating system for a node pool?
Update the operating system
If you want to update the operating system for a node pool, modify the node pool in the ACK console. For example, if you want a node pool to use the latest CentOS version, modify the node pool in the ACK console. After you modify the node pool, only nodes that are newly added to the node pool use the new operating system version. If you want to update the operating system of the existing nodes in the node pool, you must add intermediate nodes to the node pool first.
- Update the operating system.
- Log on to the ACK console.
- In the left-side navigation pane of the ACK console, click Clusters.
- In the left-side navigation pane of the details page, choose .
- Find the node pool that you want to modify and click Edit in the Actions column.
- In the dialog box that appears, select the operating system that you want to use from the Operating System list and click Confirm.
- Scale out the node pool. We recommend that you scale out the node pool and wait until the cluster is stable. Otherwise, the cluster may do not have sufficient nodes for the pods that are evicted from the nodes that you want to remove in the following steps.
- Remove nodes.
- In the left-side navigation pane of the details page, choose .
- On the Nodes page, select the nodes that you want to remove and click Batch Remove.
- In the dialog box that appears, select Drain the Node and I understand the above information and want to remove the node(s)., and then click OK.
- Add the nodes. Important If you add existing Elastic Compute Service (ECS) instances to a node pool in auto mode, the system automatically formats the system disks of the ECS instances. To avoid data loss, save the data to the data disks of the ECS instances.
- Repeat Step 3 and Step 4 to remove nodes and add the nodes again until all existing nodes in the node pool use the specified operating system.
- Remove the nodes that were added to the node pool in Step 2 and release the ECS instances.
- In the left-side navigation pane of the details page, choose .
- On the Nodes page, select the nodes that were added after you scale out the node pool, and then click Batch Remove.
- In the dialog box that appears, select Drain the Node, Release ECS Instance, and I understand the above information and want to remove the node(s)., and then click OK.
Change the type of operating system
If you want to change the type of operating system for a node pool, for example, from CentOS to Alibaba Cloud Linux, you must create a new node pool. This is because you cannot change the type of operating system used by the nodes in a node pool.
- Create a node pool.
- Scale out the newly created node pool and wait until the cluster is stable.
- Remove nodes.
- In the left-side navigation pane of the details page, choose .
- On the Nodes page, select the nodes that you want to remove and click Batch Remove.
- In the dialog box that appears, select Drain the Node and I understand the above information and want to remove the node(s)., and then click OK.
- Add the nodes. Important If you add existing ECS instances to a node pool in auto mode, the system automatically formats the system disks of the ECS instances. To avoid data loss, save the data to the data disks of the ECS instances.
- Repeat Step 2 and Step 4 until all nodes in the original node pool are removed and added to the newly created node pool.
What are the differences between node pools that are configured with the Expected Nodes parameter and those that are not configured with this parameter?
The Expected Nodes parameter specifies the number of nodes that you want to keep in a node pool. You can change the value of this parameter to adjust the number of nodes in the node pool. This feature is disabled for existing node pools that are not configured with the Expected Nodes parameter.
Node pools that are configured with the Expected Nodes parameter and those that are not configured with this parameter have different reactions to operations such as removing nodes and releasing ECS instances. The following table shows the details.
Operation | Node pool that is configured with the Expected Nodes parameter | Node pool that is not configured with the Expected Nodes parameter | Suggestion |
---|---|---|---|
Remove specified nodes in the ACK console or by calling the ACK API | The value of the Expected Nodes parameter automatically changes based on the number of nodes that you removed. For example, the value of the Expected Nodes parameter is 10 before you remove nodes. After you remove three nodes, the value is changed to 7. | The specified nodes are removed as expected. | To scale in a node pool, we recommend that you use this method. |
Remove nodes by running the kubectl delete node command. | The value of the Expected Nodes parameter remains unchanged | The nodes are not removed. | We recommend that you do not use this method to remove nodes. |
Manually release ECS instances in the ECS console or by calling the ECS API. | New ECS instances are automatically added to the node pool to keep the expected number of nodes. | The node pool does not respond to the operation. No ECS instances are added to the node pool. After the subscriptions of ECS instances expire, the nodes remain in the Unknown state before they are removed from the Nodes list of the node pool details page in the ACK console. | This operation may cause an inconsistency among the ACK console, Auto Scaling console, and the actual condition. We recommend that you do not use this method to remove nodes. To remove nodes, we recommend that you use the ACK console or call the ACK API. For more information, see Remove a node. |
The subscriptions of ECS instances expire. | New ECS instances are automatically added to the node pool to keep the expected number of nodes. | The node pool does not respond to the operation. No ECS instances are added to the node pool. After the subscriptions of ECS instances expire, the nodes remain in the Unknown state before they are removed from the Nodes list of the node pool details page in the ACK console. | This operation may cause an inconsistency among the ACK console, Auto Scaling console, and the actual condition. We recommend that you do not use this method to remove nodes. To remove nodes, we recommend that you use the ACK console or call the ACK API. For more information, see Remove a node. |
Manually enable the health check feature of Auto Scaling for ECS instances in a scaling group and the ECS instances fail to pass health checks due to reasons such as that the ECS instances are suspended. | New ECS instances are automatically added to the node pool to keep the expected number of nodes. | New ECS instances are automatically added to replace the ECS instances that are suspended. | We recommend that you do not perform operations on the scaling group of a node pool. |
Remove ECS instances from scaling groups in the Auto Scaling console without changing the value of the Expected Nodes parameter. | New ECS instances are automatically added to the node pool to keep the expected number of nodes. | No ECS instances are added to the node pool. | We recommend that you do not perform operations on the scaling group of a node pool. |
How do I add existing nodes to a cluster?
How do I use preemptible instances in a node pool?
spot-instance-advisor
command-line tool. For more information, see Best practices for preemptible instance-based node pools.