This topic provides answers to some frequently asked questions about node management.

How do I manually update the kernel version of GPU-accelerated nodes in a cluster?

To manually update the kernel version of GPU-accelerated nodes in a cluster, perform the following steps:
Note The current kernel version is earlier than 3.10.0-957.21.3.

Confirm the kernel version to which you want to update. Proceed with caution when you perform the update.

The following procedure shows how to update the NVIDIA driver. Details about how to update the kernel version are not shown.

  1. Connect to ACK clusters by using kubectl.
  2. Set the GPU-accelerated node that you want to manage as unschedulable. In this example, the node cn-beijing.i-2ze19qyi8votgjz12345 is used.
    kubectl cordon cn-beijing.i-2ze19qyi8votgjz12345
    
    node/cn-beijing.i-2ze19qyi8votgjz12345 already cordoned
  3. Migrate the pods on the GPU-accelerated node to other nodes.
    kubectl drain cn-beijing.i-2ze19qyi8votgjz12345 --grace-period=120 --ignore-daemonsets=true
    
    node/cn-beijing.i-2ze19qyi8votgjz12345 cordoned
    WARNING: Ignoring DaemonSet-managed pods: flexvolume-9scb4, kube-flannel-ds-r2qmh, kube-proxy-worker-l62sf, logtail-ds-f9vbg
    pod/nginx-ingress-controller-78d847fb96-5fkkw evicted
  4. Uninstall nvidia-driver.
    Note In this example, the version of the uninstalled NVIDIA driver is 384.111. If the version of your NVIDIA driver is not 384.111, download the installation package of the NVIDIA driver from the official NVIDIA website and update the driver to 384.111 first.
    1. Log on to the GPU-accelerated node and run the nvidia-smi command to check the driver version.
      nvidia-smi -a | grep 'Driver Version'
      Driver Version                      : 384.111
    2. Download the driver installation package.
      cd /tmp/
      curl -O https://cn.download.nvidia.cn/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run
      Note The installation package is required to uninstall the NVIDIA driver.
    3. Uninstall the NVIDIA driver.
      chmod u+x NVIDIA-Linux-x86_64-384.111.run
      ./NVIDIA-Linux-x86_64-384.111.run --uninstall -a -s -q
  5. Update the kernel version.
    Update the kernel version based on your business requirements.
  6. Restart the GPU-accelerated node.
    reboot
  7. Log on to the GPU-accelerated node and install the corresponding kernel-devel package.
    yum install -y kernel-devel-$(uname -r)
  8. Go to the official NVIDIA website, download the required NVIDIA driver, and then install the driver on the GPU-accelerated node. In this example, the version of the NVIDIA driver is 410.79.
    cd /tmp/
    curl -O https://cn.download.nvidia.cn/tesla/410.79/NVIDIA-Linux-x86_64-410.79.run
    chmod u+x NVIDIA-Linux-x86_64-410.79.run
    sh ./NVIDIA-Linux-x86_64-410.79.run -a -s -q
    
    warm up GPU
    nvidia-smi -pm 1 || true
    nvidia-smi -acp 0 || true
    nvidia-smi --auto-boost-default=0 || true
    nvidia-smi --auto-boost-permission=0 || true
    nvidia-modprobe -u -c=0 -m || true
  9. Make sure that the /etc/rc.d/rc.local file contains the following configurations. Otherwise, add the configurations to the file.
    nvidia-smi -pm 1 || true
    nvidia-smi -acp 0 || true
    nvidia-smi --auto-boost-default=0 || true
    nvidia-smi --auto-boost-permission=0 || true
    nvidia-modprobe -u -c=0 -m || true
  10. Restart kubelet and Docker.
    service kubelet stop
    service docker restart
    service kubelet start
  11. Set the GPU-accelerated node to schedulable.
    kubectl uncordon cn-beijing.i-2ze19qyi8votgjz12345
    
    node/cn-beijing.i-2ze19qyi8votgjz12345 already uncordoned
  12. Run the following command in the nvidia-device-plugin container to check the version of the NVIDIA driver that is installed on the GPU-accelerated node.
    kubectl exec -n kube-system -t nvidia-device-plugin-cn-beijing.i-2ze19qyi8votgjz12345 nvidia-smi
    Thu Jan 17 00:33:27 2019
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: N/A      |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla P100-PCIE...  On   | 00000000:00:09.0 Off |                    0 |
    | N/A   27C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    Note If no container is launched on the GPU-accelerated node after you run the docker ps command, see Failed to start a container on the GPU node.

What do I do if no container is launched on a GPU-accelerated node?

For specific Kubernetes versions, after you restart kubelet and Docker on GPU-accelerated nodes, no container is launched on the nodes.
service kubelet stop
Redirecting to /bin/systemctl stop kubelet.service
service docker stop
Redirecting to /bin/systemctl stop docker.service
service docker start
Redirecting to /bin/systemctl start docker.service
service kubelet start
Redirecting to /bin/systemctl start kubelet.service

docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
Run the following command to check the cgroup driver.
docker info | grep -i cgroup
Cgroup Driver: cgroupfs
The output shows that the cgroup driver is set to cgroupfs.

To resolve the issue, perform the following steps:

  1. Create a copy of /etc/docker/daemon.json. Then, run the following commands to update /etc/docker/daemon.json.
    cat >/etc/docker/daemon.json <<-EOF
    {
        "default-runtime": "nvidia",
        "runtimes": {
            "nvidia": {
                "path": "/usr/bin/nvidia-container-runtime",
                "runtimeArgs": []
            }
        },
        "exec-opts": ["native.cgroupdriver=systemd"],
        "log-driver": "json-file",
        "log-opts": {
            "max-size": "100m",
            "max-file": "10"
        },
        "oom-score-adjust": -1000,
        "storage-driver": "overlay2",
        "storage-opts":["overlay2.override_kernel_check=true"],
        "live-restore": true
    }
    EOF
  2. Run the following commands in sequence to restart Docker and kubelet.
    service kubelet stop
    Redirecting to /bin/systemctl stop kubelet.service
    service docker restart
    Redirecting to /bin/systemctl restart docker.service
    service kubelet start
    Redirecting to /bin/systemctl start kubelet.service
  3. Run the following command to verify that the cgroup driver is set to systemd.
    docker info | grep -i cgroup
    Cgroup Driver: systemd

How do I change the hostname of a worker node in an ACK cluster?

After a Container Service for Kubernetes (ACK) cluster is created, you cannot directly change the hostnames of worker nodes. If you want to change the hostname of a worker node, modify the node naming rule of the relevant node pool, remove the worker node from the node pool, and then add the worker node to the node pool again.
Note When you create an ACK cluster, you can modify the hostnames of worker nodes in the Custom Node Name section. For more information, see Create an ACK managed cluster.
  1. Remove the worker node.
    1. Log on to the ACK console.
    2. In the left-side navigation pane of the ACK console, click Clusters.
    3. In the left-side navigation pane of the details page, choose Nodes > Nodes.
    4. On the Nodes page, find the worker node that you want to remove and choose More > Remove in the Actions column.
    5. In the dialog box that appears, select I understand the above information and want to remove the node(s). and click OK.
  2. Add the worker node to the node pool again. For more information, see Manually add ECS instances.
    Then, the worker node is renamed based on the new node naming rule of the node pool.

How do I change the operating system for a node pool?

Update the operating system

Notice After you modify the Operating System parameter of a node pool, the change takes effect only on nodes that are newly added to the node pool.

If you want to update the operating system for a node pool, modify the node pool in the ACK console. For example, if you want a node pool to use the latest CentOS version, modify the node pool in the ACK console. After you modify the node pool, only nodes that are newly added to the node pool use the new operating system version. If you want to update the operating system of the existing nodes in the node pool, you must add intermediate nodes to the node pool first.

  1. Update the operating system.
    1. Log on to the ACK console.
    2. In the left-side navigation pane of the ACK console, click Clusters.
    3. In the left-side navigation pane of the details page, choose Nodes > Node Pools.
    4. Find the node pool that you want to modify and click Edit in the Actions column.
    5. In the dialog box that appears, select the operating system that you want to use from the Operating System list and click Confirm.
  2. Scales out the node pool.
    We recommend that you scale out the node pool and wait until the cluster is stable. Otherwise, the cluster may do not have sufficient nodes for the pods that are evicted from the nodes that you want to remove in the following steps.
    1. Log on to the ACK console.
    2. In the left-side navigation pane of the ACK console, click Clusters.
    3. In the left-side navigation pane of the details page, choose Nodes > Node Pools.
    4. Find the node pool that you want to scale out and click Scale in the Actions column.
    5. Enter the expected number of nodes and click Confirm.
      Note For best practice, make sure that the number of nodes added to the node pool equals the number of nodes that you want to remove or re-add each time in the following steps.
  3. Remove nodes.
    1. In the left-side navigation pane of the details page, choose Nodes > Nodes.
    2. On the Nodes page, select the nodes that you want to remove and click Batch Remove.
    3. In the dialog box that appears, select Drain the Node and I understand the above information and want to remove the node(s)., and then click OK.
  4. Add nodes.
    Notice If you add existing ECS instances to a node pool in auto mode, the system automatically formats the system disks of the ECS instances. To avoid data loss, save the data in the data disks of the ECS instances.
    1. In the left-side navigation pane of the details page, choose Nodes > Node Pools.
    2. On the Node Pools page, find the node pool that you want to manage and choose More > Add Existing Node in the Actions column.
    3. On the Select Existing ECS Instance page, set Mode to Auto. Select the nodes that were removed in the preceding step, and then click Next Step.
      Note Read the considerations for adding nodes on the page. If you add existing ECS instances to a node pool in auto mode, the system automatically formats the system disks of the ECS instances. To avoid data loss, save the data in the data disks of the ECS instances.
    4. On the Specify Instance Information page, specify the information about the ACK cluster. Then, click Next Step. In the message that appears, click OK.
      For more information, see Automatically add ECS instances.
  5. Repeat Step 3 and Step 4 to remove nodes and add the nodes again until all existing nodes in the node pool use the specified operating system.
  6. Remove the nodes that were added to the node pool in Step 2 and release the ECS instances.
    1. In the left-side navigation pane of the details page, choose Nodes > Nodes.
    2. On the Nodes page, select the nodes that were added after you scale out the node pool, and then click Batch Remove.
    3. In the dialog box that appears, select Drain the Node, Release ECS Instance, and I understand the above information and want to remove the node(s)., and then click OK.

Change the type of operating system

If you want to change the type of operating system for a node pool, for example, from CentOS to Alibaba Cloud Linux, you must create a new node pool. This is because you cannot change the type of operating system used by the nodes in a node pool.

  1. Create a node pool.
    1. Log on to the ACK console.
    2. In the left-side navigation pane of the ACK console, click Clusters.
    3. In the left-side navigation pane of the details page, choose Nodes > Node Pools.
    4. In the upper-right corner of the Node Pools page, click Create Node Pool.
    5. In the Create Node Pool dialog box, select the operating system that you want to use and configure other node pool parameters.
      For more information, see Create a node pool.
  2. Scale out the newly created node pool and wait until the cluster is stable.
    1. On the Node Pools page, find the newly created node pool and click Scale in the Actions column.
    2. Enter the expected number of nodes and click Confirm.
      Note For best practice, make sure that the number of nodes added to the node pool equals the number of nodes that you want to remove or re-add each time in the following steps.
  3. Remove nodes.
    1. In the left-side navigation pane of the details page, choose Nodes > Nodes.
    2. On the Nodes page, select the nodes that you want to remove and click Batch Remove.
    3. In the dialog box that appears, select Drain the Node and I understand the above information and want to remove the node(s)., and then click OK.
  4. Add nodes.
    Notice If you add existing ECS instances to a node pool in auto mode, the system automatically formats the system disks of the ECS instances. To avoid data loss, save the data to the data disks of the ECS instances.
    1. In the left-side navigation pane of the details page, choose Nodes > Node Pools.
    2. On the Node Pools page, find the node pool that you want to manage and choose More > Add Existing Node in the Actions column.
    3. On the Select Existing ECS Instance page, set Mode to Auto. Select the nodes that were removed in the preceding step, and then click Next Step.
      Note Read the considerations for adding nodes on the page. If you add existing ECS instances to a node pool in auto mode, the system automatically formats the system disks of the ECS instances. To avoid data loss, save the data in the data disks of the ECS instances.
    4. On the Specify Instance Information page, specify the information about the ACK cluster. Then, click Next Step. In the message that appears, click OK.
      For more information, see Automatically add ECS instances.
  5. Repeat Step 2 and Step 4 until all nodes in the original node pool are removed and added to the newly created node pool.

What are the differences between node pools that are configured with the Expected Nodes parameter and those that are not configured with this parameter?

The Expected Nodes parameter specifies the number of nodes that you want to keep in a node pool. You can change the value of this parameter to adjust the number of nodes in the node pool. This feature is disabled for existing node pools that are not configured with the Expected Nodes parameter.

Node pools that are configured with the Expected Nodes parameter and those that are not configured with this parameter have different reactions to operations such as removing nodes and releasing ECS instances. The following table shows the details.

Operation Node pool that is configured with the Expected Nodes parameter Node pool that is not configured with the Expected Nodes parameter Suggestion
Remove specified nodes in the ACK console or by calling the ACK API The value of the Expected Nodes parameter automatically changes based on the number of nodes that you removed. For example, the value of the Expected Nodes parameter is 10 before you remove nodes. After you remove three nodes, the value is changed to 7. The specified nodes are removed as expected. To scale in a node pool, we recommend that you use this method.
Remove nodes by running the kubectl delete node command. The value of the Expected Nodes parameter remains unchanged The nodes are not removed. We recommend that you do not use this method to remove nodes.
Manually release ECS instances in the ECS console or by calling the ECS API. New ECS instances are automatically added to the node pool to keep the expected number of nodes. No ECS instances are added to the node pool. After you release the ECS instances, the nodes remain in the Unknown state before they are removed from the Nodes list of the node pool details page in the ACK console. This operation may cause an inconsistency among the ACK console, Auto Scaling console, and the actual condition. We recommend that you do not use this method to remove nodes. To remove nodes, we recommend that you use the ACK console or call the ACK API. For more information, see Remove a node.
The subscriptions of ECS instances expire. New ECS instances are automatically added to the node pool to keep the expected number of nodes. No ECS instances are added to the node pool. After the subscriptions of ECS instances expire, the nodes remain in the Unknown state before they are removed from the Nodes list of the node pool details page in the ACK console. This operation may cause an inconsistency among the ACK console, Auto Scaling console, and the actual condition. We recommend that you do not use this method to remove nodes. To remove nodes, we recommend that you use the ACK console or call the ACK API. For more information, see Remove a node.
Manually enable the health check feature of Auto Scaling for ECS instances in a scaling group and the ECS instances fail to pass health checks due to reasons such as that the ECS instances are suspended. New ECS instances are automatically added to the node pool to keep the expected number of nodes. New ECS instances are automatically added to replace the ECS instances that are suspended. We recommend that you do not perform operations on the scaling group of a node pool.
Remove ECS instances from scaling groups in the Auto Scaling console without changing the value of the Expected Nodes parameter. New ECS instances are automatically added to the node pool to keep the expected number of nodes. No ECS instances are added to the node pool. We recommend that you do not perform operations on the scaling group of a node pool.