How to scale a node pool by adjusting the expected number of nodes - Container Service for Kubernetes

Container Service for Kubernetes (ACK) allows you to scale a node pool by modifying the expected number of nodes in the node pool. You can scale out node pools to meet the requirements of business development and scale in node pools to reduce resource costs. Node pool scaling can be automated to improve the O&M efficiency. This topic describes how to scale a node pool.

Prerequisites

An ACK cluster is created.
A kubectl client is connected to the ACK cluster as expected. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.
A node pool is created in the cluster. For more information, see Create a node pool.

Overview of node pool scaling

The expected number of nodes refers to the number of nodes to be retained in a node pool. It indicates the number of nodes in the node pool when the node pool reaches the final state. After you specify the expected number of nodes in a node pool, the nodes in the node pool are automatically scaled to the specified number.

Scale out the node pool: Set the expected number of nodes to a value that is greater than the current value. Then, the node pool is automatically scaled out. If you want to scale out a node pool, we recommend that you use this method. This way, the system can automatically retry when it fails to add nodes to the node pool.
Note
The scale-out configuration varies based on the node pool configuration. The instance type and zone of the nodes depend on the scaling policy that is used. For more information about node pool scaling policies, see Scaling policies.
The system performs the following steps to scale out a node pool:
1. Add ECS instances: Auto Scaling, the underlying service used by ACK to scale node pools, automatically creates ECS instances. After you modify the expected number of nodes, ACK automatically changes the expected number of instances in the scaling group of Auto Scaling to scale out the node pool. The status of the node pool changes to Expanding. After Auto Scaling creates ECS instances, the status of the node pool changes to Activated. For information about the expected number of instances, see Expected number of instances.
  Important
  Instances of GPU-accelerated ECS Bare Metal Instance families ebmgn7 and ebmgn7e cannot automatically delete the Multi-Instance GPU (MIG) configuration. When ACK adds instances of the preceding instance families, ACK automatically resets the MIG configuration retained on the instances. The reset may be time-consuming. In this case, you may fail to add the instances to a cluster.
  For more information about how to troubleshoot the issue, see What do I do if I fail to add an ecs.ebmgn7 instance?.
  For more information about the ebmgn7e instance family, see ebmgn7e, GPU-accelerated compute-optimized ECS Bare Metal Instance family.
2. Add the ECS instances to the cluster: After Auto Scaling creates ECS instances, the ECS instances automatically run the cloud-init script maintained by ACK to initialize the nodes and add the nodes to the node pool. The operational log is saved to the /var/log/messages file on each node. You can log on to a node and run the grep cloud-init /var/log/messages command to view the log.
  Note
  After a node is added to the node pool, the operational log in the /var/log/messages file is automatically deleted. Therefore, the log records only information about failures to add nodes to the node pool.
  If the system fails to add a node to the node pool, the relevant log data in the /var/log/messages file is synchronized to the task result. You can view the task details on the Cluster Tasks tab of the cluster details page.
Scale in the node pool: Set the expected number of nodes to a value that is smaller than the current value. Then, the node pool is automatically scaled in.
Note
- When the system scales in a node pool:
  - If the scaling policy is set to Priority, the newly created ECS instances are preferably removed from the scaling group.
  - If the scaling policy is set to Distribution Balancing, the zones where the ECS instances are deployed are filtered based on the policy. Then, the newly created ECS instances are preferably removed from the scaling group to ensure that the numbers of ECS instances in different zones of the scaling group are close or the same.
  - If the scaling policy is set to Cost Optimization, ECS instances are removed from the scaling group in the descending order of vCPU prices.
  If you want to remove specified nodes, see Remove a node.
- When a scale-in activity is triggered by changing the expected number of nodes, ACK can remove nodes without the need to drain the nodes first. If you want to drain the nodes before they are removed, refer to Remove a node to drain and remove the nodes.

Procedure

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the cluster that you want to manage and choose Nodes > Node Pools in the left-side navigation pane.
Find the node pool that you want to scale and click Scale in the Actions column.
Create the AliyunOOSLifecycleHook4CSRole role to grant permissions to access Operation Orchestration Service (OOS).
1. Click AliyunOOSLifecycleHook4CSRole.
  Note
  If the current account is an Alibaba Cloud account, click AliyunOOSLifecycleHook4CSRole.
  If the current account is a RAM user, make sure the Alibaba Cloud account is assigned the AliyunOOSLifecycleHook4CSRole role. Then, attach the AliyunRAMReadOnlyAccess system policy to the RAM user. For more information, see Grant permissions to RAM users.
2. On the Cloud Resource Access Authorization page, click Agree to Authorization.
Set the Expected Nodes parameter and click Confirm.
- If the status of the node pool in the node pool list displays Expanding, the system is scaling out the node pool. If the status of the node pool changes to Activated, the node pool is scaled out.
- If the status of the node pool in the node pool list displays Removing, the system is scaling in the node pool. If the status of the node pool changes to Activated, the node pool is scaled in.

Unrecommended operations and solutions

The expected number of nodes refers to the number of nodes retained in a node pool. Unrecommended operations may result in node pool scaling failures and cause business losses. The following table describes the unrecommended operations and suggestions on how to fix the issues caused by these operations.

Important

Do not perform the unrecommended operations in the following table.

Unrecommended operation	Node pool behavior	Suggestion
Remove nodes by running the `kubectl delete node` command.	ACK compares the expected number of nodes only with the number of ECS instances in the scaling group. It does not compare the expected number of nodes with the actual number of nodes in the cluster. If you use the API server to remove nodes, the ECS instances that host the nodes are not released. As a result, the actual number of nodes in the node pool does not change. However, the status of the nodes that are removed from the cluster changes to Unknown.	If you have performed this operation, you can click the name of the node pool and then remove the nodes on the Nodes tab to remove the nodes from the node pool. Note You do not need to select Drain the Node because the nodes are already removed from the cluster. You can select Release ECS Instance based on your business requirements. The ECS instances of the following nodes are not released after you perform the preceding operation. You need to log on to the ECS console and manually release these ECS instances. Nodes that are manually added to the cluster. Subscription nodes.
Manually release ECS instances in the ECS console or by calling the API.	Node pools are aware of the releases of ECS instances and can automatically create ECS instances to reach the expected number of nodes.	ACK compares the expected number of nodes with the actual number of nodes in the node pool to detect ECS instance releases and create new ECS instances. This helps avoid business losses. We recommend that you use the ACK console to remove nodes. For more information, see Remove a node. The ECS instances of the following nodes are not released after you perform the preceding operation. You need to log on to the ECS console and manually release these ECS instances. Nodes that are manually added to the cluster. Subscription nodes.
Remove ECS instances from the scaling group in the Auto Scaling console without changing the expected number of instances.	Node pools are aware of the releases of ECS instances and can automatically create ECS instances to reach the expected number of nodes.	Do not modify the scaling groups used by node pools in case the node pools cannot function as normal.
ECS instances are automatically released when the subscription expires.	Node pools are aware of the releases of ECS instances and can automatically create ECS instances to reach the expected number of nodes.	ACK compares the expected number of nodes with the actual number of nodes in the node pool to detect ECS instance releases and create new ECS instances. This helps avoid business losses. We recommend that you remove or renew subscription ECS instances that are about to expire at the earliest opportunity. For more information about how to remove a node, see Remove a node. For more information, see Renew ECS instances.
Use the Auto Scaling console or API to enable health checks for the scaling group.	After you enable health checks for the scaling group, the system automatically creates new ECS instances when identifying unhealthy ECS instances, such as suspended ECS instances.	By default, health checks are disabled for scaling groups used by ACK. ECS instances are added to ACK clusters only when nodes are released. Do not modify the scaling groups used by node pools in case the node pools cannot function as normal.

Error codes for scaling failures and solutions

Node pool scaling may fail due to reasons such as insufficient inventory. You can click the name of your ACK cluster on the Clusters page of the ACK console, click the Cluster Tasks tab, and then click View Cause to view the cause of a node pool scaling failure.

The following table describes the error codes for common node pool scaling failures.

Error code	Cause	Solution
RecommendEmpty.InstanceTypeNoStock	The inventory of ECS instances in the current zone is insufficient.	Modify the node pool by specifying vSwitches in different zones for the node pool and selecting multiple ECS instance types to improve the success rate of node creation. Note On the Node Pools page, click the name of the node pool that you want to manage. The scalability of the node pool is displayed next to Scaling Group on the Overview tab. You can determine the success rate of scaling the current node group based on the scalability.
NodepoolScaleFailed.FailedJoinCluster	Nodes fail to be added to the cluster.	You can log on to one of the nodes and run the `grep cloud-init /var/log/messages` command to view the operational log and check the error message.
InvalidAccountStatus.NotEnoughBalance	Your account does not have a sufficient balance.	Top up your account first.
InvalidParameter.NotMatch	The `Image bootMode BIOS does not match instanceType bootMode` error message indicates that the specified instance type does not support the specified OS image boot mode.	Select another instance type. You can click Details in the Actions column of the node pool that you want to manage on the Node Pools page and then click the Overview tab to view basic information about the node pool, such as the OS and image ID. You can call the DescribeImageSupportInstanceTypes operation to query the instance types supported by the OS images used in ACK. For more information about the OS images supported by ACK, see Overview of OS images.
QuotaExceed.ElasticQuota	The number of ECS instances created based on the specified instance type in the current region has exceeded the quota limit.	You can perform the following operations: Select another instance type. Reduce the number of existing ECS instances. Go to the Quota Center and request a quota increase.
InvalidResourceType.NotSupported	The specified instance type is not supported in the current zone or out of stock.	You can call the DescribeAvailableResource operation to query the instance types supported in the current zone and change the instance type used by the node pool.
InvalidImage.NotSupported	The `The specified image does not support vSGX instance.` error message indicates that the OS image of the node pool does not support security-enhanced instances.	Select another instance type. You can click Details in the Actions column of the node pool that you want to manage on the Node Pools page and then click the Overview tab to view basic information about the node pool, such as the OS and image ID. You can call the DescribeImageSupportInstanceTypes operation to query the instance types supported by the OS images used in ACK. For more information about the OS images supported by security-enhanced instances, see Create a security-enhanced instance in the ECS console.
InvalidParameter.NotMatch	The `The specified instanceType only support vTPM image.` error message indicates that the specified OS image does not support security-enhanced instances.	Select another instance type. You can click Details in the Actions column of the node pool that you want to manage on the Node Pools page and then click the Overview tab to view basic information about the node pool, such as the OS and image ID. You can call the DescribeImageSupportInstanceTypes operation to query the instance types supported by the OS images used in ACK. For more information about the OS images supported by security-enhanced instances, see Create a security-enhanced instance in the ECS console.
QuotaExceeded.PrivateIpAddress	The idle private IP addresses provided by the current vSwitch are insufficient.	Specify more vSwitches for the node pool and try again.
InvalidParameter.KmsNotEnabled	The specified Key Management Service (KMS) key is disabled.	Log on to the KMS console and enable the key.
InvalidInstanceType.NotSupported	The `The specified instanceType is not supported by the image architecture.` error message indicates that the current instance type does not support the specified OS image.	Select another instance type. You can click Details in the Actions column of the node pool that you want to manage on the Node Pools page and then click the Overview tab to view basic information about the node pool, such as the OS and image ID. You can call the DescribeImageSupportInstanceTypes operation to query the instance types supported by the OS images used in ACK. For more information about the OS images supported by ACK, see Overview of OS images.
InsufficientBalance.CreditPay	Your account does not have a sufficient balance.	Top up your account first.
ApiServer.InternalError	The `an error on the server (\"Get \\\"https://192.168.xxx.xxx:xxx/api/v1/nodes\\\": dial tcp 192.168.xxx.xxx:xxx: connect: connection refused\") has prevented the request from succeeding` error message indicates that access to the API server of the ACK cluster fails.	Check whether the API server is accessible or available. For more information, see ACK console troubleshooting (cluster access exceptions).
RecommendEmpty.InstanceTypeNotAuthorized	You do not have the permissions to use the specified instance type.	Submit a ticket to acquire the required permissions on ECS.
Account.Arrearage	Your account does not have a sufficient balance.	Top up your account first.
Err.QueryEndpoints	Access to the API server of the ACK cluster fails.	Check whether the API server is accessible or available. For more information, see ACK console troubleshooting (cluster access exceptions).
RecommendEmpty.DiskTypeNoStock	The inventory of disks is insufficient in the specified zone.	Specify more vSwitches for the node pool or select another disk type.
InvalidParameter.KMSKeyId.KMSUnauthorized	You do not have the permissions to access KMS.	Log on to the ECS console and assign the AliyunECSDiskEncryptDefaultRole role to ECS. For more information, see AliyunECSDiskEncryptDefaultRole.
InvalidParameter.Conflict	The `The specified disk category (xxxx) is not support the specified instance type.` error message indicates that the current instance type does not support the specified disk type.	Select another instance type or disk type.
NotSupportSnapshotEncrypted.DiskCategory	System disk encryption supports only Enhanced SSDs (ESSDs).	Select another disk type. For more information about disk types and disk encryption, see Create a node pool.
InvalidOperation.VpcHasEnabledAdvancedNetworkFeature	ECS instances of low specifications cannot be created in the virtual private cloud (VPC) because advanced features are enabled for the VPC.	For more information about instance types supported by VPCs, see Advanced VPC features.
ScalingActivityInProgress	Try again later because the node pool is being scaled.	To avoid scaling conflicts, do not scale node pools in the Auto Scaling console.
Instance.StartInstanceFailed	The ECS instances fail to start up.	Try again later. To troubleshoot the issue, Submit a ticket to the ECS team.
OperationDenied.NoStock	The current ECS instance type is out of stock in the specified zone.	Select another instance type. The scalability of a node pool dynamically changes based on the stock of ECS instances, which affects the success rate of node pool scaling. For more information, see Check the scalability of a node pool.
RecommendEmpty.InstanceTypeNoStock	The current ECS instance type is out of stock in the specified zone.	Select another instance type. The scalability of a node pool dynamically changes based on the stock of ECS instances, which affects the success rate of node pool scaling. For more information, see Check the scalability of a node pool.
NodepoolScaleFailed.WaitForDesiredSizeTimeout	The scale-out task times out.	Perform the following steps to view the task details: Log on to the ACK console and click Clusters in the left-side navigation pane. On the Clusters page, click the name of the cluster that you want to manage and choose Nodes > Node Pools in the left-side navigation pane. Click the name of the node pool that you want to manage and click the Scaling Activities tab to view the details of the scale-out task.
ApiServer.TooManyRequests	The task is throttled by the Kubernetes API server of the cluster.	Reduce the request frequency or try again later.
NodepoolScaleFailed.PartialSuccess	Some nodes failed to be created due to insufficient inventory.	Change the instance types used by the node pool and then try again. The scalability of a node pool dynamically changes based on the stock of ECS instances, which affects the success rate of node pool scaling. For more information about the supported SDKs, see Check the scalability of a node pool.

Container Service for Kubernetes:Scale a node pool

Prerequisites

Overview of node pool scaling

Procedure

Unrecommended operations and solutions

Error codes for scaling failures and solutions

References