ACK provides the auto scaling component (cluster-autoscaler) for nodes to automatically scale in and out. Regular instances, GPU-accelerated instances, and preemptible instances can be automatically added to or removed from an ACK cluster based on your requirements. This feature provides multiple scaling modes, supports various instance types and multi-zone instances, and meets your scaling requirements in different scenarios.
How it works
The auto scaling mechanism of Kubernetes is different from the traditional scaling model, which is based on resource usage. This is also the most difficult part when developers migrate workloads from traditional data centers or other orchestration systems (such as Swarm) to Kubernetes.
The traditional scaling model is designed based on resource usage. Assume that a cluster has three nodes. When the CPU usage or memory usage exceeds the scaling threshold, new nodes are added to the cluster. However, you need to consider the following issues when you use the traditional scaling model:
In a Kubernetes cluster, the resource usage of hot nodes is higher than that of the other nodes. In this case, scaling events may not be triggered in time if the average resource usage is used to set the scaling threshold. If the lowest node resource usage is used to set the scaling threshold, the newly added nodes may not be used. This may cause a waste of resources.
In a Kubernetes cluster, a pod is the smallest unit that runs an application. Pods are deployed on different nodes in a Kubernetes cluster. When auto scaling is triggered for a cluster or a node in the cluster, pods with high resource usage are not replicated to scale out the related applications, and the resource limits of these pods are not changed. In this case, the load on the nodes where these pods are deployed is not reduced because the workloads cannot be migrated to newly added nodes.
If the resource usage of a node is used to determine whether to trigger scale-in events for the node, pods with large resource requests and low resource usage may be evicted. If a Kubernetes cluster has a large number of such pods, the resources of the cluster will be fully occupied. As a result, some pods cannot be scheduled.
How does the auto scaling mechanism of Kubernetes fix this issue? The auto scaling mechanism of Kubernetes fixes this issue by decoupling the scheduling of pods from the scaling of computing resources in a cluster.
In simple terms, pod replicas are scaled based on resource usage. This is how pods are scaled. However, when the computing resources are insufficient for scheduling a pod, a scale-out event is triggered. After new nodes are added to the cluster, pending pods are automatically scheduled to these nodes. This way, the load on nodes is reduced. The following section describes the details of the auto scaling mechanism of Kubernetes:
The cluster-autoscaler component triggers scaling events based on the monitoring result of pending pods. A pod becomes pending when computing resources are insufficient for scheduling the pod. In this case, cluster-autoscaler will simulate the scheduling to determine which scaling group can provide the node where the pending pod can be deployed. This way, cluster-autoscaler finds out a scaling group that meets the requirement for scheduling the pending pod. If a scaling group meets the requirement, nodes from this scaling group are added to the cluster.
In simple terms, a scaling group is abstracted into a node in the simulation. Each instance specification of the scaling group corresponds to an equivalent amount of CPU, memory, and GPU resources. The labels and taints of the scaling group are also added to the abstract node. The abstract node is used to simulate the scheduling of the pending pod. If the pending pod can be scheduled to the abstract node, cluster-autoscaler will calculate the number of instances that need to be added from the scaling group to meet the scaling requirement.
Only nodes that are added by scaling events can be removed. The original existing nodes cannot be managed by cluster-autoscaler. Each node is separately evaluated to determine whether scale-in events need to be triggered. If the resource usage of a node drops below the scale-in threshold, the node is evaluated to determine whether it can be removed. In this case, cluster-autoscaler simulates the eviction of all workloads on the node to determine whether the node can be completely drained. If specific pods, such as pods of non-DaemonSet applications and pods that are controlled by PodDisruptionBudgets (PDBs), are deployed in the kube-system namespace of the node, the node will not be drained or removed. Another node without such pods will be scaled-in. A node is drained before it is removed. After pods on the node are evicted to other nodes, the node can be removed.
Different scaling groups represent abstract nodes of different specifications. A scoring mechanism similar to a scheduling policy is designed to select from multiple scaling groups. Abstract nodes are first filtered by the scheduling policy. Then, the abstract nodes that conform to the scheduling policy are selected based on the affinity settings. If no scheduling policy or affinity settings are configured, the least-waste mechanism is used as the default policy for selecting from multiple abstract nodes. The least-waste mechanism selects the scaling group that has the fewest resources after the scaling activity. By default, if a scaling group of CPU instances and another scaling group of GPU-accelerated instances both meet the requirements, the scaling group of CPU-accelerated instances is selected.
The success rate of auto scaling depends on the following conditions:
- Whether the scaling group conforms to the scheduling policy
When you create a scaling group, you must know the pod scheduling policies that the scaling group meets. If you do not know the pod scheduling policies that the scaling group meets, you can simulate the scaling activity by using node selectors of the pending pod and the labels of the scaling group.
- Whether computing resources are sufficient
After the scaling simulation is complete, an eligible scaling group is used for scaling. However, the scaling fails if the Elastic Compute Service (ECS) specifications of the scaling group are out of stock. To avoid this issue and improve the success rate of auto scaling, you can configure ECS instances of multiple zones and specifications for the scaling group.
- Method 1: Perform auto scaling in swift mode. After a scaling group has experienced a scale-in event and a scale-out event, the scaling group is in swift mode.
- Method 2: Use custom images that are developed based on Alibaba Cloud Linux 2 (previously know as Aliyun Linux 2). This allows you to speed up the delivery of Infrastructure as a Service (IaaS) resources by 50%.
Considerations
- For each account, the default CPU quota of pay-as-you-go instances in each region is 50 vCPUs. You can create up to 48 custom route entries in each route table in a virtual private cloud (VPC). To request a quota increase, submit a ticket.
- The stock availability of a specific ECS instance type greatly fluctuates. We recommend that you specify multiple instance types for a scaling group. This improves the success rate of auto scaling.
- In swift mode, when a node is shut down and reclaimed, it stops running and remains in the NotReady state. When a scale-out event is triggered, the state of the node is changed to Ready.
- When a node is shut down and reclaimed in swift mode, you are charged for only the storage costs of the disk. This rule does not apply to nodes that use local disks, such as the instance type of ecs.d1ne.2xlarge, where the computing costs are also charged. If the stock of the node resources is sufficient, nodes can be launched in a short period of time.
- If elastic IP addresses (EIPs) are bound to pods, we recommend that you do not delete the scaling group or the ECS nodes that are added from the scaling group in the ECS console. Otherwise, these EIPs cannot be automatically released.
Step 1: Go to the Configure Auto Scaling page
Step 2: Authorization
You need to perform authorization in the following scenarios:
The current account has limited permissions on nodes in the cluster
- Activate Auto Scaling (ESS).
- In the dialog box that appears, click the first hyperlink to go to the ESS console.
- Click Activate Auto Scaling to go to the Enable Service page.
- Select the I agree with Auto Scaling Agreement of Service check box and click Enable Now.
- On the Activated page, click Console to go to the ESS console.
- Click Go to Authorize to go to the Cloud Resource Access Authorization page, and then authorize ESS to access other cloud resources.
- Click Agree to Authorization.
- Assign the RAM role.
The current account has unlimited permissions on nodes in the cluster
An auto-scaling node pool in the cluster is required to be associated with an EIP
If an auto-scaling node pool in the cluster is required to be associated with an EIP, perform the following steps to grant permissions.
Step 3: Configure auto scaling
Check the results
FAQ
- Why does the auto scaling component fail to add nodes after a scale-out event is triggered?
Check for the following issues:
- Whether instance types that are configured for the scaling group meet the requested resources of pods. By default, system components are installed for each node. Therefore, the requested resources of pods on a node must be less than the resource capacity of the instance type of the node.
- Whether you have performed authorization steps as described. You must perform the authorization for each cluster that is involved in the scale-out event.
- Whether the cluster can access the Internet. Nodes in a scaling group require Internet access. This is because the auto scaling component needs to call Alibaba Cloud APIs over the Internet.
- Why does the auto scaling component fail to remove nodes after a scale-in event is
triggered?
Check for the following issues:
- Whether the ratio of the requested resources to the resource capacity of each node is higher than the scale-in threshold.
- Whether the pods in the kube-system namespace are running on nodes.
- Whether the pods are configured with a forcible scheduling policy, which forbids these pods to be scheduled to other nodes.
- Whether the pods on the nodes are configured with a PodDisruptionBudget and the number of pods has reached the specified minimum value.
For more frequently asked questions about the auto scaling component, visit the open source community.
- How does the system select from multiple scaling groups for a scale-out event?
When pods cannot be scheduled to nodes, the auto scaling component will simulate the scheduling of pods based on the configuration of the scaling group. The configuration includes the labels, taints, and instance specifications. If a scaling group that meets the requirements, this scaling group is selected for the scale-out event. If more than one scaling group meet the requirements, the system chooses the scaling group that has the fewest remaining resources after the simulation.