All Products
Search
Document Center

Container Service for Kubernetes:Auto scaling of nodes

Last Updated:Jan 15, 2024

Container Service for Kubernetes (ACK) provides the auto scaling component (cluster-autoscaler) to automatically scale nodes. Regular instances, GPU-accelerated instances, and preemptible instances can be automatically added to or removed from an ACK cluster to meet your business requirements. This component supports multiple scaling modes, various instance types, and instances that are deployed across zones. This component is applicable to diverse scenarios.

How auto scaling works

The auto scaling model of Kubernetes is different from the traditional scaling model that is based on the resource utilization threshold. Developers must understand the differences between the two scaling models before they migrate workloads from traditional data centers or other orchestration systems to Kubernetes. For example, developers migrate workloads from Swarm clusters to ACK clusters.

The traditional scaling model is based on resource utilization. For example, if a cluster contains three nodes and the CPU utilization or memory utilization of the nodes exceeds the scaling threshold, new nodes are automatically added to the cluster. However, you must consider the following issues when you use the traditional scaling model:

Issue 1: How is a resource utilization threshold specified and applied?

In a cluster, hot nodes may have high resource utilization and other nodes may have low resource utilization. If the average resource utilization is specified as the threshold, auto scaling may not be triggered in a timely manner. If the lowest node resource utilization is set as the scaling threshold, the newly added nodes may not be used. This may cause a waste of resources.

Issue 2: How are loads balanced after instances are added?

In Kubernetes, a pod is used as the smallest unit that runs an application on each node of a cluster. When auto scaling is triggered for a cluster or a node in the cluster, pods with high resource utilization are not replicated and the resource limits of these pods are not changed. As a result, the loads cannot be balanced to newly added nodes.

Issue 3: How is a scale-in activity triggered and implemented?

If scale-in activities are triggered based on resource utilization, pods that request large amounts of resources but have low resource utilization may be evicted. If the number of these pods is large within a Kubernetes cluster, resources may be exhausted and some pods may fail to be scheduled.

How does the auto scaling model of Kubernetes fix these issues? Kubernetes provides a two-layer scaling model that decouples pod scheduling from resource scaling.

Pods are scaled based on resource utilization. When pods enter the Pending state due to insufficient resources, a scale-out activity is triggered. After new nodes are added to the cluster, the pending pods are automatically scheduled to the newly added nodes. This way, loads of the application are balanced. The following section describes the auto scaling model of Kubernetes in detail:

1. How are nodes selected during a scale-out activity?

cluster-autoscaler is used to trigger auto scaling by detecting pending pods. When pods enter the Pending state due to insufficient resources, cluster-autoscaler simulates pod scheduling to decide the scaling group that can provide new nodes to accept the pending pods. If a scaling group meets the requirement, nodes from this scaling group are added to the cluster.

A scaling group is treated as a node during the simulation. The instance type of the scaling group specifies the CPU, memory, and GPU resources of the node. The labels and taints of the scaling group are also applied to the node. The node is used to simulate the scheduling of the pending pods. If the pending pods can be scheduled to the node, cluster-autoscaler calculates the number of nodes that are required to be added from the scaling group.

2. How is a scale-in activity triggered?

Only nodes added by scale-out activities can be removed in scale-in activities. Static nodes cannot be managed by cluster-autoscaler. Each node is separately evaluated to determine whether the node needs to be removed. If the resource utilization of a node drops below the scale-in threshold, a scale-in activity is triggered for the node. In this case, cluster-autoscaler simulates the eviction of all workloads on the node to determine whether the node can be completely drained. cluster-autoscaler does not drain the nodes that contain specific pods, such as non-DaemonSet pods in the kube-system namespace and pods that are controlled by PodDisruptionBudgets (PDBs). A node is drained before it is removed. After pods on the node are evicted to other nodes, the node can be removed.

3. How does cluster-autoscaler select among multiple scaling groups?

Each scaling group is regarded as an abstract node. cluster-autoscaler selects a scaling group for auto scaling based on a policy similar to the scheduling policy. Nodes are first filtered by the scheduling policy. Among the filtered nodes, the nodes that conform to policies, such as affinity settings, are selected. If no scheduling policy or affinity settings are configured, cluster-autoscaler selects a scaling group based on the least-waste policy. The least-waste policy selects the scaling group that has the fewest idle resources after simulation. If a scaling group of regular nodes and a scaling group of GPU-accelerated nodes both meet the requirements, the scaling group of regular nodes is selected by default.

4. How do I improve the success rate of auto scaling?

The result of auto scaling is dependent on the following factors:

  • Whether the scheduling policy is met

    After you configure a scaling group, you must be aware of the pod scheduling policies that the scaling group supports. If you are unaware of the pod scheduling policies, you can simulate a scaling activity by using the node selectors of pending pods and the labels of the scaling group.

  • Whether resources are sufficient

    After the scaling simulation is complete, a scaling group is selected. However, the scaling activity fails if the specified types of Elastic Compute Service (ECS) instances in the scaling group are out of stock. Therefore, you can configure multiple instance types and multiple zones for the scaling group to improve the success rate of auto scaling.

5. How do I accelerate auto scaling?

  • Method 1: Enable the swift mode to accelerate auto scaling. After a scaling group experiences a scale-out activity and a scale-in activity, the swift mode is enabled for this scaling group.

  • Method 2: Use custom images that are created from the base image of Alibaba Cloud Linux 2 (formerly known as Alibaba Cloud Linux 2). This ensures that the resources of Infrastructure as a Service (IaaS) are delivered 50% faster.

Considerations

  • You can add up to 200 custom routes to a route table of a VPC. To increase the quota limit, log on to the Quota Center console and submit an application. For more information about the quotas of other resources and how to increase the quota limits, see Quota limits on underlying cloud resources.

  • The stock of ECS instances may be insufficient for auto scaling if you specify only one ECS instance type for a scaling group. We recommend that you specify multiple ECS instance types with the same specification for a scaling group. This increases the success rate of auto scaling.

  • In swift mode, when a node is shut down and reclaimed, the node stops running and enters the NotReady state. When a scale-out activity is triggered, the status of the node changes to Ready.

  • If a node is shut down and reclaimed in swift mode, you are charged only for the disks. This rule does not apply to nodes that use local disks, such as the instance type of ecs.d1ne.2xlarge, for which you are also charged a computing fee. If the stock of nodes is sufficient, nodes can be launched within a short period of time.

  • If elastic IP addresses (EIPs) are bound to pods, we recommend that you do not delete the ECS nodes that are added from the scaling group in the ECS console. Otherwise, these EIPs cannot be automatically released.

  • Auto Scaling can recognize node labels and taints only after they are mapped to scaling group tags. The number of tags that can be added to a scaling group is also limited. Make sure that the total number of ECS labels, taints, and node labels configured for a node pool that has the auto scaling feature enabled is smaller than 12.

Step 1: Enable auto scaling for the cluster

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, click the name of the cluster that you want to manage and choose Nodes > Node Pools in the left-side navigation pane.

  3. On the Node Pools page, click Enable on the right side of Configure Auto Scaling to configure auto scaling.

    1.jpg

Step 2: Perform authorization

Note

The following procedure for activating Auto Scaling is for reference only. Follow the instructions on the page when you activate Auto Scaling.

  1. Activate Auto Scaling.

    1. In the dialog box that appears, click the hyperlink next to Auto Scaling to log on to the Auto Scaling console.

    2. Click Activate Auto Scaling to go to the Enable Service page.

    3. Select the I agree with Auto Scaling Agreement of Service check box and click Enable Now.

    4. On the Activated tab, click Console to go to the Auto Scaling console.

    5. Click Go to Authorize to go to the Cloud Resource Access Authorization page. Then, authorize Auto Scaling to access other cloud resources.

    6. Click Confirm Authorization Policy.

  2. Assign a RAM role to ACK.

    1. Click the hyperlink in the auto scaling configuration dialog box that appears in Step 1: Enable auto scaling for the cluster to complete authorization.

      Note

      For an ACK dedicated cluster, follow the instructions on the page to attach the AliyunCSManagedAutoScalerRolePolicy policy to the cluster.

    2. On the Cloud Resource Access Authorization page, click Confirm Authorization Policy.

Step 3: Configure auto scaling

  1. On the Configure Auto Scaling page, set the following parameters and click OK.

    Parameter

    Description

    Remarks

    Node Pool Scale-out Policy

    • Random Policy: If multiple node pools meet the requirement, this policy selects a random node pool for the scale-out activity.

    • Default Policy: If multiple node pools meet the requirement, this policy selects the node pool that will have the least idle resources after the scale-out activity is completed.

    • Priority-based Policy: If multiple node pools meet the requirement, this policy selects the node pool with the highest priority for the scale-out activity. For more information about how to specify scale-out priorities for node pools, see Configure a priority-based policy.

      Note

      You can specify a scale-out priority for a node pool only after the node pool is created.

    None

    Scan Interval

    You can set this parameter to configure the interval at which the cluster is evaluated for scaling. Valid values: 15s, 30s, 60s, 120s, 180s, and 300s. Default value: 60s.

    None

    Allow Scale-in

    Specify whether to allow the scale-in of nodes. If you turn off this option, scale-in configurations do not take effect. Proceed with caution.

    None

    Scale-in Threshold

    For a scaling group that is managed by cluster-autoscaler, set the value to the ratio of the requested resources per node to the total resources per node. The node is removed from the cluster only if the actual value is lower than the threshold.

    Note

    In auto scaling, a scale-out activity is automatically triggered based on node scheduling. Therefore, you need to set only scale-in parameters.

    • For nodes other than GPU-accelerated nodes, the following conditions are required for triggering scale-in activities. Nodes are removed only if all of the following conditions are met. No scale-in activity is triggered if any of the conditions is not met.

      • The ratio of the requested resources per node to the total resources per node in the scaling group managed by cluster-autoscaler is lower than the value of the Scale-in Threshold parameter.

        Note

        Node resources include CPU and memory resources. The utilization of these resources must be lower than the scale-in threshold.

      • The waiting period specified in the Defer Scale-in For parameter ends.

      • The amount of time that the system waits after performing a scale-out activity exceeds the value specified in the Cooldown parameter.

    • For GPU-accelerated nodes, the following conditions are required for triggering scale-in activities. Nodes are removed only if all of the following conditions are met. No scale-in activity is triggered if any of the conditions is not met.

      • The ratio of the requested resources per node to the total resources per node in the scaling group managed by cluster-autoscaler is lower than the value of the GPU Scale-in Threshold parameter.

        Note

        Node resources include CPU and memory resources. The utilization of these resources must be lower than the GPU scale-in threshold.

      • The waiting period specified in the Defer Scale-in For parameter ends.

      • The amount of time that the system waits after performing a scale-out activity exceeds the value specified in the Cooldown parameter.

    GPU Scale-in Threshold

    The scale-in threshold for GPU-accelerated nodes. GPU-accelerated nodes can be removed from the Kubernetes cluster only if the actual value is lower than the threshold.

    Defer Scale-in For

    The time to wait after the scale-in threshold is reached and before the scale-in activity starts. Unit: minutes. The default value is 10 minutes.

    Cooldown

    After the system performs a scale-out activity, the system waits for a cooldown period to end before it can perform scale-in activities. The system cannot perform scale-in activities within the cooldown period but can still check whether the nodes meet the scale-in conditions. After the cooldown period ends, if a node meets the scale-in conditions and the waiting period specified in the Defer Scale-in For parameter ends, the node is removed.

    For example, the Cooldown parameter is set to 10 minutes and the Defer Scale-in For parameter is set to 5 minutes. The system cannot perform scale-in activities within the 10-minute cooldown period after performing a scale-out activity. However, the system can still check whether the nodes meet the scale-in conditions within the cooldown period. When the cooldown period ends, the nodes that meet the scale-in conditions are removed after 5 minutes.

    Click Advanced Scale-in Settings and configure the parameters.

    Parameter

    Description

    Pod Termination Timeout

    The timeout period for terminating pods when cluster-autoscaler removes nodes. Unit: seconds.

    Minimum Number of Replicated Pods

    The minimum number of pods that must be kept for each ReplicaSet during node draining.

    Evict DaemonSet Pods

    Specify whether to evict DaemonSet pods. If you turn on this option, DaemonSet pods on the node are evicted during a scale-in activity.

    Skip Nodes Hosting Kube-system Pods

    Specify whether to remove nodes that host kube-system pods during a scale-in activity.

    Note

    If you turn on Skip Nodes Hosting Kube-system Pods, cluster-autoscaler does not remove nodes that host kube-system pods. Nodes that host DaemonSet pods and mirror pods are still removed.

  2. On the right side of the page, click Create Node Pool.

  3. In the Create Node Pool dialog box, set the parameters for the scaling group.

    For more information about the parameters, see Create an ACK managed cluster. The following table describes some of the parameters.

    Parameter

    Description

    Region

    The region where you want to deploy the scaling group. The scaling group and the Kubernetes cluster must be deployed in the same region. You cannot change the region after the scaling group is created.

    VPC

    The scaling group and the Kubernetes cluster must be deployed in the same VPC.

    vSwitch

    The vSwitches of the scaling group. You can specify the vSwitches of different zones. The vSwitches allocate pod CIDR blocks to the scaling group.

    Auto Scaling

    Select the node type based on your requirements. You can select Regular Instance, GPU Instance, Shared GPU Instance, or Preemptible Instance. The selected node type must be the same as the node type that you select when you create the cluster.

    Instance Type

    The instance types that are used by the scaling group.

    Selected Types

    The instance types that you selected. You can select at most 10 instance types.

    System Disk

    The system disk of the scaling group.

    Mount Data Disk

    Specify whether to mount data disks to the scaling group. By default, no data disk is mounted.

    Instances

    The number of instances contained in the scaling group.

    Note
    • Existing instances in the cluster are excluded.

    • By default, the minimum number of instances is 0. If you specify one or more instances, the system adds the instances to the scaling group. When a scale-out activity is triggered, the instances in the scaling group are added to the cluster to which the scaling group is associated.

    Operating System

    When you enable auto scaling, you can select an image based on Alibaba Cloud Linux, CentOS, Windows, or Windows Core.

    Note

    If you select an image based on Windows or Windows Core, the system automatically adds the taint { effect: 'NoSchedule', key: 'os', value: 'windows' } to nodes in the scaling group.

    Key Pair

    The key pair that is used to log on to the nodes in the scaling group. You can create key pairs in the ECS console.

    Note

    You can log on to the nodes only by using key pairs.

    RDS Whitelist

    The ApsaraDB RDS instances that can be accessed by the nodes in the scaling group after a scaling activity is triggered.

    Node Label

    Node labels are automatically added to nodes that are added to the cluster by scale-out activities.

    Scaling Policy

    • Priority: The system scales the node pool based on the priorities of the vSwitches that you select for the node pool. The vSwitches that you select are displayed in descending order of priority. If Auto Scaling fails to create ECS instances in the zone of the vSwitch with the highest priority, Auto Scaling attempts to create ECS instances in the zone of the vSwitch with a lower priority.

    • Cost Optimization: The system creates instances based on the vCPU unit prices in ascending order. Preemptible instances are preferentially created when multiple preemptible instance types are specified in the scaling configurations. If preemptible instances cannot be created due to reasons such as insufficient stocks, the system attempts to create pay-as-you-go instances.

      If you select Preemptible Instance for the Billing Method parameter, you must set the following parameters:

      • Percentage of Pay-as-you-go Instances: Specify the percentage of pay-as-you-go instances in the node pool. Valid values: 0 to 100.

      • Enable Supplemental Preemptible Instances: After you enable this feature, Auto Scaling automatically creates the same number of preemptible instances 5 minutes before the system reclaims the existing preemptible instances. The system sends a notification to Auto Scaling 5 minutes before it reclaims preemptible instances.

      • Enable Supplemental Pay-as-you-go Instances: After you enable this feature, Auto Scaling attempts to create pay-as-you-go ECS instances to meet the scaling requirement if Auto Scaling fails to create preemptible instances for reasons such as that the unit price is too high or preemptible instances are out of stock.

    • Distribution Balancing: The even distribution policy takes effect only when you select multiple vSwitches. This policy ensures that ECS instances are evenly distributed among the zones (the vSwitches) of the scaling group. If ECS instances are unevenly distributed across the zones due to reasons such as insufficient stocks, you can perform a rebalancing operation.

    Scaling Mode

    You can select Standard or Swift.

    • Standard: the standard mode. Auto scaling is implemented by creating and releasing ECS instances based on resource requests and usage.

    • Swift: the swift mode. Auto scaling is implemented by creating, stopping, and starting ECS instances. This mode accelerates scaling activities.

      Note

      If a stopped ECS instance fails to be restarted in swift mode, the ECS instance is not released. You can manually release the ECS instance.

    Taints

    After you add taints to a node, ACK no longer schedules pods to the node.

  4. Click Confirm Order to create the scaling group.

  5. Optional. Configure a priority-based policy. On the Node Pools page, click Edit on the right side of Configure Auto Scaling. Set Node Pool Scale-out Priority and click OK.

    Note

    The priority must be an integer from 1 to 100.

Expected result

  1. On the Node Pools page, you can view the newly created scaling group below Regular Instance.

    集群自动弹性伸缩..png

  2. On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.

  3. In the left-side navigation pane of the details page, choose Workloads > Deployments.

  4. On the Deployments page, select the kube-system namespace. You can find the cluster-autoscaler component. This indicates that the scaling group is created.

FAQ

Why does cluster-autoscaler fail to add nodes after a scale-out activity is triggered?

Check whether the following situations exist:

  • The instance types in the scaling group cannot fulfill the resource request from pods. Some resources provided by the specified ECS instance type are reserved or occupied for the following purposes:

  • Cross-zone scale-out activities cannot be triggered for pods that have limits on zones.

  • The RAM role does not have the permissions to manage the Kubernetes cluster. You must configure RAM roles for each Kubernetes cluster that is involved in the scale-out activity. For more information about the authorization, see Step 2: Perform authorization.

  • The following issues occur when you activate Auto Scaling:

    • The instance fails to be added to the cluster and a timeout error occurs.

    • The node is not ready and a timeout error occurs.

    To ensure that nodes can be accurately scaled, cluster-autoscaler does not perform any scaling activities before it fixes the abnormal nodes.

Why does cluster-autoscaler fail to remove nodes after a scale-in activity is triggered?

Check whether the following situations exist:

  • The requested resource threshold of each pod is higher than the specified scale-in threshold.

  • Pods that belong to the kube-system namespace are running on the node.

  • A scheduling policy forces the pods to run on the current node. Therefore, the pods cannot be scheduled to other nodes.

  • PodDisruptionBudget is set for the pods on the node and the minimum value of PodDisruptionBudget is reached.

For more information about FAQ, see open source component.

How does the system choose a scaling group for a scaling activity?

When pods cannot be scheduled to nodes, cluster-autoscaler simulates the scheduling of the pods based on the configurations of scaling groups. The configurations include labels, taints, and instance specifications. If a scaling group meets the requirements, this scaling group is selected for the scale-out activity. If more than one scaling group meet the requirements, the system selects the scaling group that has the fewest idle resources after simulation.

What types of pods can prevent cluster-autoscaler from removing nodes?

What scheduling policies does cluster-autoscaler use to determine whether the unschedulable pods can be scheduled to a node pool that has the auto scaling feature enabled?

The following list describes the scheduling policies used by cluster-autoscaler.

  • PodFitsResources

  • GeneralPredicates

  • PodToleratesNodeTaints

  • MaxGCEPDVolumeCount

  • NoDiskConflict

  • CheckNodeCondition

  • CheckNodeDiskPressure

  • CheckNodeMemoryPressure

  • CheckNodePIDPressure

  • CheckVolumeBinding

  • MaxAzureDiskVolumeCount

  • MaxEBSVolumeCount

  • ready

  • MatchInterPodAffinity

  • NoVolumeZoneConflict