Container Service for Kubernetes (ACK) allows you to scale a node pool by modifying the expected number of nodes in the node pool. You can scale out node pools to meet the requirements of business development and scale in node pools to reduce resource costs. Node pool scaling can be automated to improve the O&M efficiency. This topic describes how to scale a node pool.
Prerequisites
- An ACK cluster is created.
- A kubectl client is connected to the ACK cluster as expected. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.
- A node pool is created in the cluster. For more information, see Create a node pool.
Overview of node pool scaling
The expected number of nodes refers to the number of nodes to be retained in a node pool. It indicates the number of nodes in the node pool when the node pool reaches the final state. After you specify the expected number of nodes in a node pool, the nodes in the node pool are automatically scaled to the specified number.
- Scale out the node pool: Set the expected number of nodes to a value that is greater than the current value. Then, the node pool is automatically scaled out. If you want to scale out a node pool, we recommend that you use this method. This way, the system can automatically retry when it fails to add nodes to the node pool. Note The scale-out configuration varies based on the node pool configuration. The instance type and zone of the nodes depend on the scaling policy that is used. For more information about node pool scaling policies, see Scaling policies.The system performs the following steps to scale out a node pool:
- Add ECS instances: Auto Scaling, the underlying service used by ACK to scale node pools, automatically creates ECS instances. After you modify the expected number of nodes, ACK automatically changes the expected number of instances in the scaling group of Auto Scaling to scale out the node pool. The status of the node pool changes to Expanding. After Auto Scaling creates ECS instances, the status of the node pool changes to Activated. For information about the expected number of instances, see Expected number of instances. Important Instances of GPU-accelerated ECS Bare Metal Instance families ebmgn7 and ebmgn7e cannot automatically delete the Multi-Instance GPU (MIG) configuration. When ACK adds instances of the preceding instance families, ACK automatically resets the MIG configuration retained on the instances. The reset may be time-consuming. In this case, you may fail to add the instances to a cluster.
- For more information about how to troubleshoot the issue, see What do I do if I fail to add ECS Bare Metal instances that are equipped with NVIDIA A100 GPUs?.
- For more information about the ebmgn7e instance family, see ebmgn7e, GPU-accelerated compute-optimized ECS Bare Metal Instance family
- For more information about how to enable the MIG feature, see Use a node pool to partition an NVIDIA A100 GPU into multiple GPU instances
- Add the ECS instances to the cluster: After Auto Scaling creates ECS instances, the ECS instances automatically run the
cloud-init
script maintained by ACK to initialize the nodes and add the nodes to the node pool. The operational log is saved to the /var/log/messages file on each node. You can log on to a node and run thegrep cloud-init /var/log/messages
command to view the log.Note- After a node is added to the node pool, the operational log in the /var/log/messages file is automatically deleted. Therefore, the log records only information about failures to add nodes to the node pool.
- If the system fails to add a node to the node pool, the relevant log data in the /var/log/messages file is synchronized to the task result. You can view the task details on the Cluster Tasks tab of the cluster details page.
- Add ECS instances: Auto Scaling, the underlying service used by ACK to scale node pools, automatically creates ECS instances. After you modify the expected number of nodes, ACK automatically changes the expected number of instances in the scaling group of Auto Scaling to scale out the node pool. The status of the node pool changes to Expanding. After Auto Scaling creates ECS instances, the status of the node pool changes to Activated. For information about the expected number of instances, see Expected number of instances.
- Scale in the node pool: Set the expected number of nodes to a value that is smaller than the current value. Then, the node pool is automatically scaled in. Note
- When the system scales in a node pool:
- If the scaling policy is set to Priority, the newly created ECS instances are preferably removed from the scaling group.
- If the scaling policy is set to Distribution Balancing, the zones where the ECS instances are deployed are filtered based on the policy. Then, the newly created ECS instances are preferably removed from the scaling group to ensure that the numbers of ECS instances in different zones of the scaling group are close or the same.
- If the scaling policy is set to Cost Optimization, ECS instances are removed from the scaling group in the descending order of vCPU prices.
- When a scale-in activity is triggered by changing the expected number of nodes, ACK can remove nodes without the need to drain the nodes first. If you want to drain the nodes before they are removed, refer to Remove a node to drain and remove the nodes.
- When the system scales in a node pool:
Procedure
- Log on to the ACK console and click Clusters in the left-side navigation pane.
- On the Clusters page, click the name of a cluster and choose in the left-side navigation pane.
- Find the node pool that you want to scale and click Scale in the Actions column.
- Create the AliyunOOSLifecycleHook4CSRole role to grant permissions to access Operation Orchestration Service (OOS).
- Set the Expected Nodes parameter and click Confirm.
- If the status of the node pool in the node pool list displays Expanding, the system is scaling out the node pool. If the status of the node pool changes to Activated, the node pool is scaled out.
- If the status of the node pool in the node pool list displays Removing, the system is scaling in the node pool. If the status of the node pool changes to Activated, the node pool is scaled in.
Unrecommended operations and solutions
Unrecommended operation | Node pool behavior | Suggestion |
---|---|---|
Remove nodes by running the kubectl delete node command. | ACK compares the expected number of nodes only with the number of ECS instances in the scaling group. It does not compare the expected number of nodes with the actual number of nodes in the cluster. If you use the API server to remove nodes, the ECS instances that host the nodes are not released. As a result, the actual number of nodes in the node pool does not change. However, the status of the nodes that are removed from the cluster changes to Unknown. |
|
Manually release ECS instances in the ECS console or by calling the API. | Node pools are aware of the releases of ECS instances and can automatically create ECS instances to reach the expected number of nodes. |
|
Remove ECS instances from the scaling group in the Auto Scaling console without changing the expected number of instances. | Node pools are aware of the releases of ECS instances and can automatically create ECS instances to reach the expected number of nodes. | Do not modify the scaling groups used by node pools in case the node pools cannot function as normal. |
ECS instances are automatically released when the subscription expires. | Node pools are aware of the releases of ECS instances and can automatically create ECS instances to reach the expected number of nodes. | ACK compares the expected number of nodes with the actual number of nodes in the node pool to detect ECS instance releases and create new ECS instances. This helps avoid business losses. We recommend that you remove or renew subscription ECS instances that are about to expire at the earliest opportunity.
|
Use the Auto Scaling console or API to enable health checks for the scaling group. | After you enable health checks for the scaling group, the system automatically creates new ECS instances when identifying unhealthy ECS instances, such as suspended ECS instances. | By default, health checks are disabled for scaling groups used by ACK. ECS instances are added to ACK clusters only when nodes are released. Do not modify the scaling groups used by node pools in case the node pools cannot function as normal. |
Error codes for scaling failures and solutions
Node pool scaling may fail due to reasons such as insufficient inventory. You can click the name of your ACK cluster on the Clusters page of the ACK console, click the Cluster Tasks tab, and then click View Cause to view the cause of a node pool scaling failure.
Error code | Cause | Solution |
---|---|---|
RecommendEmpty.InstanceTypeNoStock | The inventory of ECS instances in the current zone is insufficient. | Modify the node pool by specifying vSwitches in different zones for the node pool and selecting multiple ECS instance types to improve the success rate of node creation. Note On the Node Pools page, click the name of the node pool that you want to manage. The scalability of the node pool is displayed next to Scaling Group on the Overview tab. You can determine the success rate of scaling the current node group based on the scalability. |
NodepoolScaleFailed.FailedJoinCluster | Nodes fail to be added to the cluster. | You can log on to one of the nodes and run the grep cloud-init /var/log/messages command to view the operational log and check the error message. |
InvalidAccountStatus.NotEnoughBalance | Your account does not have a sufficient balance. | Top up your account first. |
InvalidParameter.NotMatch | The | Select another instance type.
|
QuotaExceed.ElasticQuota | The number of ECS instances created based on the specified instance type in the current region has exceeded the quota limit. | You can perform the following operations:
|
InvalidResourceType.NotSupported | The specified instance type is not supported in the current zone or out of stock. | You can call the DescribeAvailableResource operation to query the instance types supported in the current zone and change the instance type used by the node pool. |
InvalidImage.NotSupported | The | Select another instance type.
|
InvalidParameter.NotMatch | The | Select another instance type.
|
QuotaExceeded.PrivateIpAddress | The idle private IP addresses provided by the current vSwitch are insufficient. | Specify more vSwitches for the node pool and try again. |
InvalidParameter.KmsNotEnabled | The specified Key Management Service (KMS) key is disabled. | Log on to the KMS console and enable the key. |
InvalidInstanceType.NotSupported | The | Select another instance type.
|
InsufficientBalance.CreditPay | Your account does not have a sufficient balance. | Top up your account first. |
ApiServer.InternalError | The | Check whether the API server is accessible or available. For more information, see ACK console troubleshooting (cluster access exceptions). |
RecommendEmpty.InstanceTypeNotAuthorized | You do not have the permissions to use the specified instance type. | Submit a ticket to acquire the required permissions on ECS. |
Account.Arrearage | Your account does not have a sufficient balance. | Top up your account first. |
Err.QueryEndpoints | Access to the API server of the ACK cluster fails. | Check whether the API server is accessible or available. For more information, see ACK console troubleshooting (cluster access exceptions). |
RecommendEmpty.DiskTypeNoStock | The inventory of disks is insufficient in the specified zone. | Specify more vSwitches for the node pool or select another disk type. |
InvalidParameter.KMSKeyId.KMSUnauthorized | You do not have the permissions to access KMS. | Log on to the ECS console and assign the AliyunECSDiskEncryptDefaultRole role to ECS. For more information, see AliyunECSDiskEncryptDefaultRole. |
InvalidParameter.Conflict | The | Select another instance type or disk type. |
NotSupportSnapshotEncrypted.DiskCategory | System disk encryption supports only Enhanced SSDs (ESSDs). | Select another disk type. For more information about disk types and disk encryption, see Create a node pool. |
InvalidOperation.VpcHasEnabledAdvancedNetworkFeature | ECS instances of low specifications cannot be created in the virtual private cloud (VPC) because advanced features are enabled for the VPC. | For more information about instance types supported by VPCs, see Advanced VPC features. |
ScalingActivityInProgress | Try again later because the node pool is being scaled. | To avoid scaling conflicts, do not scale node pools in the Auto Scaling console. |
Instance.StartInstanceFailed | The ECS instances fail to start up. | Try again later. To troubleshoot the issue, Submit a ticket to the ECS team. |
OperationDenied.NoStock | The current ECS instance type is out of stock in the specified zone. | Select another instance type. The scalability of a node pool dynamically changes based on the stock of ECS instances, which affects the success rate of node pool scaling. For more information, see Check the scalability of a node pool. |
RecommendEmpty.InstanceTypeNoStock | The current ECS instance type is out of stock in the specified zone. | Select another instance type. The scalability of a node pool dynamically changes based on the stock of ECS instances, which affects the success rate of node pool scaling. For more information, see Check the scalability of a node pool. |
NodepoolScaleFailed.WaitForDesiredSizeTimeout | The scale-out task times out. | Perform the following steps to view the task details:
|
ApiServer.TooManyRequests | The task is throttled by the Kubernetes API server of the cluster. | Reduce the request frequency or try again later. |
NodepoolScaleFailed.PartialSuccess | Some nodes failed to be created due to insufficient inventory. | Change the instance types used by the node pool and then try again. The scalability of a node pool dynamically changes based on the stock of ECS instances, which affects the success rate of node pool scaling. For more information about the supported SDKs, see Check the scalability of a node pool. |