All Products
Search
Document Center

Container Service for Kubernetes:FAQ about nodes and node pools

Last Updated:May 31, 2024

This topic provides answers to some frequently asked questions (FAQ) about nodes and node pools. For example, you can obtain answers to questions such as how to change the maximum number of pods that are supported by a node, how to change the operating system for a node pool, and how to solve the timeout error related to a node.

How do I change the operating system for a node pool?

The method used to change the operating system for a node pool is similar to that used to update a node pool. To change the operating system for a node pool, perform the following steps:

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, click the name of the cluster that you want to manage and choose Nodes > Node Pools in the left-side navigation pane.

  3. On the Node Pools page, find the node pool that you want to modify and choose More > Upgrade in the Actions column.

  4. Select Change Operating System, select the image that is used to replace the original image, and then click Start Update.

    Note

    By default, Kubelet Update and Upgrade Node Pool by Replacing System Disk are selected when you change the operating system for a node pool. Select Create Snapshot before Update based on your business requirements.

Am I able to leave the Expected Nodes parameter empty when I create a node pool?

No, you cannot leave the Expected Nodes parameter empty when you create a node pool.

For more information about how to remove or release a node, see Remove a node. For more information about how to add a node, see Add existing ECS instances to an ACK cluster. After you remove nodes from or add existing nodes to a cluster, the value of the Expected Nodes parameter is automatically set to the actual number of nodes after the modification.

What are the differences between node pools that are configured with the Expected Nodes parameter and those that are not configured with this parameter?

The Expected Nodes parameter specifies the number of nodes that you want to keep in a node pool. You can change the value of this parameter to modify the number of nodes in the node pool. This feature is disabled for existing node pools that are not configured with the Expected Nodes parameter.

Node pools that are configured with the Expected Nodes parameter and those that are not configured with this parameter have different reactions to operations such as removing nodes and releasing Elastic Compute Service (ECS) instances. The following table describes the details.

Operation

Node pool that is configured with the Expected Nodes parameter

Node pool that is not configured with the Expected Nodes parameter

Suggestion

Decrease the expected number of nodes by calling the API operations of Container Service for Kubernetes (ACK) or using the ACK console.

After you decrease the expected number of nodes, the nodes in the node pool are reduced until the number of existing nodes in the node pool is equal to the specified expected number of nodes.

If the number of existing nodes in the node pool is greater than the expected number of nodes, the system reduces the nodes in the node pool until the number of existing nodes in the node pool is equal to the expected number of nodes. At the same time, the system enables the Expected Nodes feature.

N/A

Remove specific nodes in the ACK console or by calling the API operations of ACK.

The value of the Expected Nodes parameter automatically changes based on the number of nodes that you removed. For example, the value of the Expected Nodes parameter is 10 before you remove nodes. After you remove three nodes, the value of this parameter is changed to 7.

The specified nodes are removed as expected.

N/A

Remove nodes by running the kubectl delete node command.

The value of the Expected Nodes parameter remains unchanged.

The nodes are not removed.

We recommend that you do not use this method to remove nodes.

Manually release ECS instances in the ECS console or by calling the API operations of ECS.

New ECS instances are automatically added to the node pool to keep the expected number of nodes.

The node pool does not respond to the operation. No ECS instances are added to the node pool. After the subscriptions to ECS instances expire, the nodes remain in the Unknown state before they are removed from the Nodes list of the node pool details page in the ACK console.

We recommend that you use the recommended method instead of this method to remove nodes. Otherwise, the data of ACK and Auto Scaling may be inconsistent with that in the actual case. For more information, see Remove nodes.

The subscriptions to ECS instances expire.

New ECS instances are automatically added to the node pool to keep the expected number of nodes.

The node pool does not respond to the operation. No ECS instances are added to the node pool. After the subscriptions to ECS instances expire, the nodes remain in the Unknown state before they are removed from the Nodes list of the node pool details page in the ACK console.

We recommend that you use the recommended method instead of this method to remove nodes. Otherwise, the data of ACK and Auto Scaling may be inconsistent with that in the actual case. For more information, see Remove nodes.

Manually enable the health check feature of Auto Scaling for ECS instances in a scaling group and the ECS instances fail to pass health checks due to reasons such as that the ECS instances are suspended.

New ECS instances are automatically added to the node pool to keep the expected number of nodes.

New ECS instances are automatically added to replace the ECS instances that are suspended.

We recommend that you do not perform operations on the scaling group of a node pool.

Remove ECS instances from a scaling group by using Auto Scaling without modifying the expected number of nodes.

New ECS instances are automatically added to the node pool to keep the expected number of nodes.

No ECS instances are added to the node pool.

We recommend that you do not perform operations on the scaling group of a node pool.

How do I add existing nodes to a cluster?

If you want to add existing nodes to a cluster that contains no node pool, create a node pool that contains no node in the cluster. Then, add the existing ECS instances to the node pool. When you create the node pool, select the vSwitches that are used by the existing ECS instances that you want to add and set the Expected Nodes parameter to 0. For more information about how to manually add existing ECS instances to a cluster, see Add existing ECS instances to an ACK cluster.

Note

Each node pool corresponds to a scaling group. No fees are charged for node pools. However, you are charged for the cloud resources that are used by node pools, such as ECS instances.

How do I use preemptible instances in a node pool?

You can use preemptible instances when you create a node pool. You can also use preemptible instances in a node pool by using the spot-instance-advisor command-line tool. For more information, see Best practices for preemptible instance-based node pools.

Note

When you create a cluster, you cannot select preemptible instances for the node pool of the cluster.

How do I change the maximum number of pods that are supported by a node or increase the quota of pods on a node?

  • The maximum number of pods supported by a node is limited and varies based on the type of the cluster. You can increase the maximum number of nodes for clusters of specific types. For more information, see the Quotas section of the "Quotas and limits" topic.

  • The network plug-in used by an ACK cluster also has limits on the maximum number of pods supported by a node. You can go to the Basic Information tab of the ACK cluster in the ACK console to check the network plug-in used by the cluster.

    • If your cluster uses Flannel, you cannot change the maximum number of pods that can be allocated to each node after the cluster is created. If you require more pods, you can scale out node pools to add nodes, or recreate the cluster and reconfigure the pod CIDR block.

      For more information about how to scale out node pools, see Scale a node pool. For more information about how to create an ACK cluster, see Create an ACK managed cluster.

    • If your cluster uses Terway, the maximum number of pods that are supported by a node depends on the number of elastic network interfaces (ENIs) provided by the ECS instance type. The components that are used in different Terway modes are different. The maximum number of pods that are supported by a node also varies based on the components that are used. You can increase the maximum number of pods that are supported by a node by changing the instance type or scaling node pools to add nodes.

      For more information about the supported maximum number of pods on a node in different Terway modes, see the Compare Terway modes section of the "Work with Terway" topic. For more information about how to change the instance type, see Overview of instance configuration changes.

      Note
      • If you increase the supported maximum number of pods by adding nodes to a cluster, we recommend that you design and use a large cluster. If the cluster size is excessively large, the availability and performance of the cluster may be compromised. For more information, see Suggestions on using large ACK Pro clusters.

      • After you change the instance type, you need to set the nodes to the Unschedulable state, drain and restart the nodes, and then initiate pod scheduling. For more information, see Set node schedulability.

      • For more information about the maximum number of ENIs supported by each ECS instance type and the maximum number of private IP addresses supported by an ENI, see Overview of instance families.

How do I modify the configurations of a node?

To ensure smooth business operation and facilitate node management:

  • You cannot modify some of the configuration items such as the container runtime and the virtual private cloud (VPC) to which the node belongs after the node pool is created.

  • You can modify some of the configuration items but the modification is limited. For example, if you change the operating system of a node, you can only upgrade the original image to the latest version and you cannot change the image type.

  • You can modify some of the configuration items such as vSwitch, billing method, and instance type, and the modification is not limited.

In addition, modifications on specific configuration items such as public IP address and CloudMonitor plug-in take effect only for nodes that are newly added to the node pool. For more information, see Modify a node pool.

If you want to run a new node, we recommend that you create a node pool based on the configurations of the new node, set the nodes in the old node pool to the Unschedulable state, and then drain the old nodes. After you run your business on the new node, release the old nodes.

How do I release a specific ECS instance?

You can release an ECS instance by removing a node. After an ECS instance is released, the expected number of nodes automatically changes to the actual number of nodes after the release. You do not need to modify the expected number of nodes. In addition, you cannot release the specified ECS instance by modifying the expected number of nodes.

How do I upgrade the container runtime of a worker node that does not belong to a node pool?

Perform the following operations:

  1. Remove the worker node. When you remove the worker node, the system sets the node to the Unschedulable state and drains the node. If the node fails to be drained, the system stops removing the node. If the node is drained, the system continues to remove the node from the cluster.

  2. Add an existing node. You can add the node to an existing node pool. Alternatively, you can create a node pool that contains no node and add the node to the node pool. After the node is added to a node pool, the container runtime of the node automatically becomes the same as that of the node pool.

    Note

    Node pools are free of charge. However, you are charged for the cloud resources such as ECS instances that are used in node pools. For more information, see Cloud service billing.

What do I do if a timeout error occurs after I add an existing node?

Check whether the network of the node and the network of the Classic Load Balancer (CLB) instance of the API server are connected. Check whether the security groups meet the requirement. For more information about the limits on security groups, see Limits on security groups. For more information about other network connectivity issues, see FAQ about network management.

How do I change the hostname of a worker node in an ACK cluster?

After you create an ACK cluster, you cannot directly change the hostnames of worker nodes. If you want to change the hostname of a worker node, modify the node naming rule of the relevant node pool, remove the worker node from the node pool, and then add the worker node to the node pool again.

Note

When you create an ACK cluster, you can modify the hostnames of worker nodes in the Custom Node Name section. For more information, see Create an ACK managed cluster.

  1. Remove the worker node.

    1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

    2. In the left-side navigation pane of the details page, choose Nodes > Nodes.

    3. On the Nodes page, find the worker node that you want to remove and choose More > Remove in the Actions column.

    4. In the dialog box that appears, select I understand the above information and want to remove the node(s). and click OK.

  2. Add the worker node to the node pool again. For more information, see the Manually add nodes section of the "Add existing ECS instances to an ACK cluster" topic.

    Then, the worker node is renamed based on the new node naming rule of the node pool.

How do I manually update the kernel version of GPU-accelerated nodes in a cluster?

To manually update the kernel version of GPU-accelerated nodes in a cluster, perform the following steps:

Note

The current kernel version is earlier than 3.10.0-957.21.3.

Confirm the kernel version to which you want to update. Proceed with caution when you perform the update.

The following procedure shows how to update the NVIDIA driver. Details about how to update the kernel version are not shown.

  1. Obtain the kubeconfig file of the cluster and use kubectl to connect to the cluster.

  2. Set the GPU-accelerated node that you want to manage to the Unschedulable state. In this example, the node cn-beijing.i-2ze19qyi8votgjz12345 is used.

    kubectl cordon cn-beijing.i-2ze19qyi8votgjz12345
    
    node/cn-beijing.i-2ze19qyi8votgjz12345 already cordoned
  3. Migrate the pods on the GPU-accelerated node to other nodes.

    kubectl drain cn-beijing.i-2ze19qyi8votgjz12345 --grace-period=120 --ignore-daemonsets=true
    
    node/cn-beijing.i-2ze19qyi8votgjz12345 cordoned
    WARNING: Ignoring DaemonSet-managed pods: flexvolume-9scb4, kube-flannel-ds-r2qmh, kube-proxy-worker-l62sf, logtail-ds-f9vbg
    pod/nginx-ingress-controller-78d847fb96-5fkkw evicted
  4. Uninstall the existing nvidia-driver.

    Note

    In this example, the uninstalled driver version is 384.111. If your driver version is not 384.111, download the installation package of your driver from the official NVIDIA website and update the driver to 384.111 first.

    1. Log on to the GPU-accelerated node and run the nvidia-smi command to check the driver version.

      sudo nvidia-smi -a | grep 'Driver Version'
      Driver Version                      : 384.111
    2. Download the driver installation package.

      sudo cd /tmp/
      sudo curl -O https://cn.download.nvidia.cn/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run
      Note

      The installation package is required for uninstalling the NVIDIA driver.

    3. Uninstall the driver.

      sudo chmod u+x NVIDIA-Linux-x86_64-384.111.run
      sudo sh ./NVIDIA-Linux-x86_64-384.111.run --uninstall -a -s -q
  5. Update the kernel.

    Update the kernel version based on your business requirements.

  6. Restart the GPU-accelerated node.

    sudo reboot
  7. Log on to the GPU node and run the following command to install the kernel-devel package.

    sudo yum install -y kernel-devel-$(uname -r)
  8. Go to the official NVIDIA website to download the required driver and install it on the GPU-accelerated node. In this example, the driver version 410.79 is used.

    sudo cd /tmp/
    sudo curl -O https://cn.download.nvidia.cn/tesla/410.79/NVIDIA-Linux-x86_64-410.79.run
    sudo chmod u+x NVIDIA-Linux-x86_64-410.79.run
    sudo sh ./NVIDIA-Linux-x86_64-410.79.run -a -s -q
    
    warm up GPU
    sudo nvidia-smi -pm 1 || true
    sudo nvidia-smi -acp 0 || true
    sudo nvidia-smi --auto-boost-default=0 || true
    sudo nvidia-smi --auto-boost-permission=0 || true
    sudo nvidia-modprobe -u -c=0 -m || true
  9. Make sure that the /etc/rc.d/rc.local file includes the following configurations. Otherwise, add the following configurations to the file.

    sudo nvidia-smi -pm 1 || true
    sudo nvidia-smi -acp 0 || true
    sudo nvidia-smi --auto-boost-default=0 || true
    sudo nvidia-smi --auto-boost-permission=0 || true
    sudo nvidia-modprobe -u -c=0 -m || true
  10. Restart kubelet and Docker.

    sudo service kubelet stop
    sudo service docker restart
    sudo service kubelet start
  11. Set the GPU-accelerated node to schedulable.

    kubectl uncordon cn-beijing.i-2ze19qyi8votgjz12345
    
    node/cn-beijing.i-2ze19qyi8votgjz12345 already uncordoned
  12. Run the following command in the nvidia-device-plugin container to check the version of the driver installed on the GPU-accelerated node.

    kubectl exec -n kube-system -t nvidia-device-plugin-cn-beijing.i-2ze19qyi8votgjz12345 nvidia-smi
    Thu Jan 17 00:33:27 2019
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: N/A      |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla P100-PCIE...  On   | 00000000:00:09.0 Off |                    0 |
    | N/A   27C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    Note

    If no container is started on the GPU-accelerated node after you run the docker ps command, see the What do I do if no container is started on a GPU-accelerated node? section of this topic.

What do I do if no container is started on a GPU-accelerated node?

For specific Kubernetes versions, after you restart kubelet and Docker on GPU-accelerated nodes, no container is started on the nodes.

sudo service kubelet stop
Redirecting to /bin/systemctl stop kubelet.service
sudo service docker stop
Redirecting to /bin/systemctl stop docker.service
sudo service docker start
Redirecting to /bin/systemctl start docker.service
sudo service kubelet start
Redirecting to /bin/systemctl start kubelet.service

sudo docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

Run the following command to check the cgroup driver:

sudo docker info | grep -i cgroup
Cgroup Driver: cgroupfs

The output indicates that the cgroup driver is set to cgroupfs.

To resolve the issue, perform the following steps:

  1. Create a copy of /etc/docker/daemon.json. Then, run the following commands to update /etc/docker/daemon.json.

    sudo cat >/etc/docker/daemon.json <<-EOF
    {
        "default-runtime": "nvidia",
        "runtimes": {
            "nvidia": {
                "path": "/usr/bin/nvidia-container-runtime",
                "runtimeArgs": []
            }
        },
        "exec-opts": ["native.cgroupdriver=systemd"],
        "log-driver": "json-file",
        "log-opts": {
            "max-size": "100m",
            "max-file": "10"
        },
        "oom-score-adjust": -1000,
        "storage-driver": "overlay2",
        "storage-opts":["overlay2.override_kernel_check=true"],
        "live-restore": true
    }
    EOF
  2. Run the following commands to restart the Docker runtime and kubelet:

    sudo service kubelet stop
    Redirecting to /bin/systemctl stop kubelet.service
    sudo service docker restart
    Redirecting to /bin/systemctl restart docker.service
    sudo service kubelet start
    Redirecting to /bin/systemctl start kubelet.service
  3. Run the following command to check whether the cgroup driver is set to systemd.

    sudo docker info | grep -i cgroup
    Cgroup Driver: systemd

What is the path of the kubelet in an ACK cluster? Am I able to customize the path?

ACK does not allow you to customize the path of the kubelet. The default path of the kubelet is /var/lib/kubelet. Do not change the path.