When upgrading the cluster Kubernetes version, complete the node pool upgrade promptly after the control plane upgrade, during off-peak hours. A node pool upgrade includes upgrading the kubelet and the container runtime. Before the upgrade, ACK runs a pre-upgrade check to identify risk factors so that the upgrade can proceed smoothly.
Considerations
Pre-upgrade checks
Cluster upgrades use yum to download required software packages. If you manually modified node network configurations or used a custom operating system (OS) image, ensure that yum works correctly on your nodes. Run
yum makecacheto verify.ACK does not strictly validate custom OS images. Therefore, a successful upgrade cannot be guaranteed.
If you made configuration changes to the cluster, such as enabling the SWAP partition or modifying kubelet or runtime configurations through command-line operations, the cluster upgrade may fail or your custom configurations may be overwritten.
After you upgrade a cluster to version 1.18, ACK configures Node Resource Reservation Policy by default. If resource reservation is not configured and node resource usage is high, pods may not be scheduled promptly after being evicted. Reserve resources for your nodes. Keep CPU usage at or below 50% and memory usage at or below 70%.
In clusters running version 1.24 or earlier, if a workload's pods are configured only with a Startup Probe, the pods briefly enter the NotReady state after the kubelet restarts. Deploy workloads with multiple replicas across different nodes. This ensures that enough pods remain available if a node restarts.
Keep at least 20% of your disk space free. This prevents pod eviction caused by insufficient disk capacity during an upgrade.
Node pool upgrade constraints
Node pool upgrades support only scale-out operations. Scale-in operations are not supported.
If you have unmanaged worker nodes that do not belong to a node pool, migrate them. For more information, see Migrate unmanaged nodes to a node pool.
Upgrading Lingjun node pools is not supported during an ACK cluster upgrade.
When you upgrade nodes by replacing their disks, ACK reinitializes them based on the current node pool configuration. This includes the logon method, labels, taints, OS image, and runtime version. To update node pool configurations, see Edit a node pool. If you modified nodes in any other way, the upgrade overwrites your changes.
If a pod on a node references a HostPath that points to the system disk, the data in the HostPath directory will be lost after a disk replacement upgrade.
When you upgrade a node pool in a cluster of version 1.31 or earlier, the process also upgrades the NVIDIA Device Plugin and resets any of its non-standard configurations.
node scaling and scheduling
If the node scaling feature is enabled on the cluster, cluster-autoscaler is automatically updated to the latest version after a successful upgrade to ensure that the auto scaling feature is not affected. After the cluster is upgraded, confirm that the cluster-autoscaler version is correct. For more information, see Enable node autoscaling.
During a cluster upgrade, nodes with Scaling Mode set to Swift may fail to upgrade because they are shut down. If any nodes are not upgraded due to Swift after the upgrade is complete, we recommend that you manually remove them.
Networking and service availability
If a pod uses the SLB address of a
LoadBalancerService to access another pod on the same node, and the Service'sexternalTrafficPolicyis set toLocal, the two pods may no longer be on the same node after node rotation. This can cause the network connection to fail.When you upgrade nodes by replacing their disks, ACK drains the nodes. This process evicts pods to other active nodes while respecting the Pod Disruption Budget (PDB). To ensure high availability, deploy workloads with multiple replicas across different nodes. Also, configure a PDB for critical services to control the number of pods that can be disrupted simultaneously.
The default timeout for draining a node is 30 minutes. If pod migration is not completed within the timeout period, ACK terminates the upgrade to ensure service stability.
Features
A node pool upgrade includes upgrading the kubelet and the container runtime.
Kubelet upgrade: The kubelet on each node pool node is upgraded to match the control plane version. Default method: in-place upgrade.
Container runtime upgrade: Upgrade the container runtime on nodes when a new version is available.
Migrating from Docker to containerd replaces the system disk on node pool nodes, erasing all system disk data. Back up critical data before the upgrade. See Migrate the node container runtime from Docker to containerd.
Except for ContainerOS nodes, upgrading from one version of containerd to a newer one performs an in-place upgrade by default. The
/etc/containerd/config.tomlfile on the node is replaced with the new version provided by ACK.ImportantContainerOS nodes only support system disk replacement for containerd upgrades. See Upgrade ContainerOS versions earlier than 3.4 to the latest version.
During a container runtime upgrade, pod probes and lifecycle hooks may fail, and pods may restart in place.
In clusters running Kubernetes 1.24 or earlier, upgrading Docker replaces the system disk on node pool nodes by default, erasing all system disk data. Back up critical data before the upgrade.
Procedure
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click .
On the Node Pools page, find the target node pool, click the
icon in the Actions column, and select Kubelet Update. Configure the following parameters.Parameter
Description
Kubelet Update Information
View the current kubelet version and select the target version.
Runtime Update Information
View the current container runtime version and select the target version.
Migrating from Docker to containerd replaces the system disk on node pool nodes, erasing all system disk content. See Migrate the node container runtime from Docker to containerd.
Clusters on version 1.22 with containerd 1.6.34 do not support runtime upgrades.
Update Nodes
Select the nodes to upgrade: all or specific ones.
Upgrade method
Select In-place Upgrade or Upgrade by Replacing System Disk. See Reference: In-place upgrade and upgrade by replacing system disk.
In-place Upgrade: ACK updates components directly on existing nodes. An in-place upgrade does not replace the system disk or reinitialize the node, and data is not affected.
Upgrade by Replacing System Disk: ACK reinitializes nodes by replacing their system disk. Instance attributes such as name, ID, and IP address remain unchanged, but system disk data is deleted. Attached data disks are not affected.
Ignore Warnings
Whether to proceed if the precheck reports warnings. For example, a pod uses a
hostPathpointing to the system disk.Batch Update Policy
Maximum Number of Nodes per Batch
Nodes are upgraded in batches up to this maximum. See Reference: In-place upgrade and upgrade by replacing system disk.
Automatic Pause Policy
The pause policy for the upgrade process.
Interval Between Batches
Interval between batches when no automatic pause is configured. Valid values: 5 to 120 minutes.
Auto Snapshot
If a node's system disk contains important data, create snapshots before upgrading the node pool. Snapshots incur fees (see Snapshot billing), and the creation progress changes dynamically. After the upgrade, delete unneeded snapshots.
NoteIf you select Upgrade by replacing system disk, enable automatic snapshot creation. Snapshots incur costs. See Snapshot billing.
Click Precheck. After the precheck succeeds, follow the on-screen instructions to start the upgrade.
NoteIf the precheck fails or returns warnings, see Fixes for failed check items or view the Check Report to troubleshoot.
During the upgrade, you can:
Pause: Puts the node pool in an intermediate state. Avoid other cluster operations and complete the upgrade promptly. Upgrades paused for more than seven days are automatically terminated, and related events and logs are cleared.
You cannot roll back kubelet or container runtime versions on already-upgraded nodes.
Cancel: Cancels the upgrade. After clicking Cancel, you cannot roll back kubelet or container runtime versions on already-upgraded nodes.
To verify, go to the Nodes page, click a node name, and check the kubelet and container runtime versions on the Basic Information tab.
In-place and replacement upgrades
Upgrade processes
Both in-place upgrade and upgrade by replacing system disk follow this process. The node pool upgrade processes nodes in batches, starting at 1 and doubling (1, 2, 4, 8...) until reaching the maximum concurrent count. For example, with a maximum of 4: batch 1 upgrades 1 node, batch 2 upgrades 2 nodes, then all subsequent batches upgrade 4 nodes.
The following figure shows batch execution with N maximum concurrent nodes. Batch sizes are 1, 2, 4, 8... until reaching N.
In-place upgrade logic
Perform a pre-upgrade check. If a critical exception is found in a container (for example, ttrpc requests cannot be served, or container processes do not respond to signals), the upgrade is stopped.
Save the current state of containers and pods to a temporary directory.
Upgrade containerd, crictl, and related configuration files to the new versions provided by ACK, and then restart containerd. This action does not affect running containers. If you previously modified the
/etc/containerd/config.tomlconfiguration file on the node, your changes will be overwritten by this upgrade.Ensure that the kubelet is running properly and the node is ready.
Replacement upgrade logic
Perform node draining. If the node is schedulable, the system sets it to unschedulable.
Shut down the ECS instance, which stops the node.
Replace the system disk. The system disk ID changes, but the cloud disk type, instance IP address, and elastic network interface MAC address remain the same.
Re-initialize the node.
Restart the node. The node becomes ready and is set to schedulable.
If a node was unschedulable before the upgrade, it remains unschedulable afterward.
FAQ
Rollback after upgrade
You cannot roll back kubelet and container runtime versions after an upgrade. You can roll back the OS, but only if the node pool still supports the original image.
Service impact during upgrade
In-place upgrade: Pods are not restarted, so services are not affected.
Upgrade by replacing system disk: This method performs node draining. Services continue without interruption if pods implement graceful shutdown (see Graceful shutdown and zero downtime deployments in Kubernetes) and have multiple replicas across nodes. Set concurrent upgrades below your replica count to avoid upgrading multiple replicas simultaneously.
Upgrade batch duration
In-place upgrade: Less than 5 minutes.
Upgrade by replacing system disk: Typically under 8 minutes without snapshots. With snapshots, the upgrade starts after snapshot completion, and total time depends on snapshot creation time. The node pool allows up to 40 minutes for snapshot creation. If snapshot creation exceeds 40 minutes, the upgrade times out and fails. Skip snapshot creation if no business data is on the system disk.
Data loss during upgrade
When upgrading the container runtime by replacing the system disk, back up any important system disk data beforehand. Data disks are not affected.
IP address changes after replacement
When the system disk is replaced, its ID changes, but the cloud disk type, instance IP address, and elastic network interface MAC address remain the same. See Replace the system disk (change the OS).
Upgrading unmanaged nodes
Clusters created before the node pool feature was introduced may contain unmanaged nodes. You can migrate these unmanaged nodes to a node pool and then upgrade the node pool. See Migrate unmanaged nodes to a node pool.
Lingering Docker directory after migration
The Docker directory contains Kubernetes-managed files (containers, images, logs) and any custom paths you created. Delete it from the data disk after switching runtimes if no longer needed.
Restoring data from snapshots
When upgrading a node pool, you can create snapshots for nodes. Snapshots are retained for seven days by default but can be deleted sooner. In extreme cases such as data loss, restore data using the following methods.
For an in-place upgrade, such as upgrading only the kubelet version, you can restore data by rolling back the snapshot directly. See Roll back a cloud disk by using a snapshot.
For an upgrade by replacing system disk, such as upgrading the operating system or container runtime, you can restore data by creating a new cloud disk from the snapshot. See Create a data disk from a snapshot.
Resolving the negative dentry issue
Executing kubelet and containerd upgrades triggers systemctl daemon-reload. The systemd service monitors directories associated with .path units and their parents (by default /, /run, and /run/systemd). Many inodes in these directories can cause a kernel soft lockup, affecting node operation.
Since inode counts cannot be directly obtained, clear dentries during off-peak hours:
echo 2 > /proc/sys/vm/drop_cachesIf directories associated with the .path unit and their parents do not contain many inodes, skip this check.References
Enable automatic cluster upgrades to reduce maintenance overhead.
For the release history of containerd, see containerd runtime release notes.
ACK managed node pools support automatic OS CVE patching.
As of Kubernetes 1.24, Docker is no longer a supported container runtime. You must migrate the node container runtime from Docker to containerd. See Migrate the node container runtime from Docker to containerd.
Docker and containerd use different command-line tools. See Comparison of common commands for Docker and containerd.
Keep OS images updated. See Change the operating system.