Upgrade an OS image version or replace a node OS type - Container Service for Kubernetes

ACK regularly releases new operating system image versions that provide new features, optimizations, and bug fixes. You should promptly upgrade the operating system image version of your node pools. You can also switch the operating system type as needed, for example, to replace an operating system that has reached its end of life (EOL) with a supported one.

For more information about the operating system types, latest image versions that ACK supports, and the limitations of some operating systems, see Release notes for OS images.

Precautions

This operation updates the operating system in batches by replacing the system disks of nodes. Do not save important data on system disks, or make sure to back up the data in advance. Data disks are not affected during the upgrade. We recommend that you perform this operation during off-peak hours.
When you update a node by replacing system disks, ACK drains the node and evicts the pods from the node to other available nodes based on PodDisruptionBudget (PDB). To ensure high service availability, we recommend using a multi-replica deployment strategy to distribute workloads across multiple nodes. You can also configure PDB for key services to control the number of pods that are interrupted at the same time.
The default timeout period for node draining is 30 minutes. If the pod migration fails to be completed within the timeout period, ACK terminates the update to ensure service stability.
When you update a node by replacing the system disk, ACK reinitializes the node according to the current node pool configurations, including node logon methods, labels, taints, operating system images, and runtime versions. Normally, node pool configurations are updated by editing a node pool. If you made changes to the node in other ways, these changes will be overwritten during the update.
If pods on a node use hostPath volumes and the hostPath volumes points to a system disk, data in the hostPath volumes is lost after the node is updated by replacing system disks.
If your cluster uses other custom configurations, such as swap partitions, kubelet configurations modified by using the CLI, or runtime configurations, the cluster may fail to be updated or the custom configurations may be overwritten during the update.
Some ACK operating systems use cgroup v2 by default. For more information about the precautions for cgroup v2, see OS images.
If you have standalone nodes, which are worker nodes not managed by a node pool, you must migrate them to a node pool. For more information, see Migrate standalone nodes to a node pool.
In ContainerOS 3.4.0, the system disk is set to read-only mode. A data disk must be attached to ensure that the system can start. Therefore, when you upgrade to ContainerOS 3.4 or a later version, follow the procedure below. Other versions are not affected.
View the detailed procedure
Select an upgrade solution based on whether data disks are attached to the current node pool:
- A single data disk is attached: The system can start properly. Follow the Procedure below to complete the upgrade.
- Multiple data disks are attached: Create a new node pool for migration. To do this, create a node pool, select ContainerOS 3.4 or a later version, and attach one data disk. Then, scale out the required number of nodes. Gradually migrate applications to the new node pool by disabling scheduling for the old node pool or updating application workloads to be scheduled to the new node pool, for example, using labels. Finally, take the old node pool offline.
- No data disks are attached:
  - Keep the current node pool: Update the node pool configuration to attach one data disk and scale out new nodes. After the new nodes are running properly, gradually drain and remove the old nodes.
  - Create a new node pool for migration: The procedure is the same as when multiple data disks are attached.
For more information about how to create and manage node pools, see Create and manage a node pool. For more information about how to set a node to unschedulable, see Drain a node and manage its scheduling status. For more information about how to remove a node, see Remove a node.
If you customize the GPU driver version for nodes in a node pool by specifying a version number or using an OSS URL, the operating system and the driver version may be incompatible after you upgrade the OS image. See NVIDIA driver versions supported by ACK and select the latest compatible driver.

Procedure

Follow these steps to update the operating system image to the latest version or replace the operating system type. To avoid compatibility risks, run a precheck scan first.

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, find the cluster to manage and click its name. In the left navigation pane, choose Nodes > Node Pools.
In the Node Pools list, find the target node pool, and in the Actions column, select > Change Operating System.
Click Precheck to scan for potential risks of replacing the operating system image and view the check results.
- Normal: The precheck is successful. You can proceed to the next step.
- Abnormal: The current running status of the cluster is not affected. Follow the recommended solutions to fix the issues.

After the precheck is successful, configure the parameters as described in the following table and click Start Replacement.

Configuration Item		Description
Destination Version		Select the target image and version.
Current Version		The current operating system version.
Update Node		Specify the nodes whose operating systems you want to replace. You can select all nodes or some nodes.
Ignore Warnings		Specifies whether to ignore warning-level check items at the node pool level and continue with the upgrade. An example of a warning-level check item is that a pod in the node pool uses a HostPath that points to the system disk.
Batch Replace	Maximum Number of Nodes per Batch	The system updates nodes in sequence based on the specified maximum number of concurrent nodes.
	Automatic Pause Policy	The policy to pause the replacement of operating systems on nodes.
	Interval Between Batches	If you set Auto-pause Policy to Do Not Pause, you can specify an interval between batches. Valid values: 5 to 120 minutes.
	Auto Snapshot	The upgrade is performed by replacing system disks. If the system disks of nodes contain important business data, create snapshots for the nodes before you update the operating system. This lets you back up and restore data. Using snapshots incurs snapshot fees. If the snapshots are no longer needed after the upgrade, delete them promptly.

Important

To avoid incompatibility risks when you replace an operating system, see Release notes for OS images.

References

For more information about how to upgrade the kubelet and container runtime versions of a node pool, see Update a node pool.
For more information about the procedure for and logic behind upgrading nodes by replacing system disks, see Reference: In-place updates and updates by replacing system disks.