All Products
Search
Document Center

Configure an auto scaling policy

Last Updated: Jul 01, 2022

Elastic High Performance Computing (E-HPC) provides the auto scaling feature that can dynamically allocate compute nodes based on the configured auto scaling policy. The system can automatically add or remove compute nodes based on real-time workloads to improve cluster availability and save costs. This topic describes how to configure an auto scaling policy.

Prerequisites

When you use the auto scaling feature, take note of the following information:

  • The operating system of all nodes in the cluster is Linux.

  • The scheduler is PBS, Slurm, or Deadline.

Benefits

The auto scaling feature provides the following benefits:

  • Adds compute nodes based on the real-time workloads of your cluster to improve cluster availability.

  • Reduces the number of compute nodes to save costs without compromising cluster availability.

  • Stops faulty nodes and creates nodes to improve fault tolerance.

Procedure

  1. 弹性高性能计算控制台Log on to the .

  2. In the top navigation bar, select a region.

  3. In the left-side navigation pane, choose Elasticity > Auto Scale.

  4. From the Cluster drop-down list on the Auto Scale page, select the cluster for which you want to configure the auto scaling policy.

  5. In the Global Configurations section, set the parameters.

    Parameter

    Description

    Enable Autoscale

    Enable Auto Grow and Auto Shrink for all queues in a cluster.

    Note

    If the settings in the Queue Configuration section are different from the settings in the Global Configurations section, the former prevails.

    Compute Nodes

    The range for the number of compute nodes that can be added to scale out the cluster. The upper limit is the sum of the maximum number of compute nodes configured for each queue in the cluster. The lower limit is the sum of the minimum number of compute nodes configured for each queue in the cluster.

    Scale-in Time (Minute)

    If the continuous idle duration of a compute node exceeds the scale-in duration, the node is released.

    The continuous idle duration is the scale-in time interval multiplied by the number of consecutive idle times. By default, the scale-in interval is 2 minutes. The consecutive idle times of a compute node are the number of consecutive times that the compute node is idle during the resource scale-in check.

    Image Type

    The image type of the compute nodes that you want to add to the cluster. Only the images that are compatible with the image of the original compute nodes in the cluster are supported.

    Exceptional Nodes

    Select the nodes that you want to exclude from auto scaling.

    If you want to retain a compute node, you can set the node as an exceptional node. Then, the node is not released even if it is idle.

  6. In the Queue Configuration section, click Edit to set the parameters.

    Parameter

    Description

    Auto Grow and Auto Shrink

    Specifies whether to enable Auto Grow and Auto Shrink. By default, both switches are turned off.

    Note

    If the settings in the Queue Configuration section are different from the settings in the Global Configurations section, the former prevails.

    Queue Compute Nodes

    The range of the number of compute nodes in the queue. Valid values:

    • Maximum Nodes: The maximum number of compute nodes that can be added ranges from 0 to 500.

    • Minimal Nodes: The minimum number of compute nodes that can be retained ranges from 0 to 50.

    Prefix of Hostnames

    The hostname prefix of the compute nodes. The prefix is used to distinguish between the nodes of different queues.

    Maximum Scale-out Nodes in Each Round

    The maximum number of compute nodes that can be added in each round of scale-out. The default value 0 indicates that the maximum number is not limited.

    We recommend that you specify the parameter to control your costs on compute nodes.

    If you set the parameter to A and you want to add B nodes, nodes are added based on the following rules:

    • If B is less than or equal to A, B nodes are added.

    • If B is greater than A, A nodes are added.

    Note

    In addition to the parameter, the number of nodes in a cluster is also limited by the specified maximum number of nodes that can be added in a single queue and the specified maximum number of nodes that can be added in the cluster.

    Minimum Scale-out Nodes in Each Round

    The minimum number of compute nodes that must be added in each round of scale-out. The default value 1 indicates that at least one node must be added.

    In some scenarios, you may need to add at least a certain number of nodes to ensure that the business can run as expected. Therefore, you can set the minimum number of nodes that must be added in each round. If the number of available ECS instances is less than the specified minimum number of nodes and the number of required nodes, the cluster is not scaled out to reduce waste.

    If you set the parameter to A and you want to add B nodes, nodes are added in the following scenarios:

    • Assume that B is less than or equal to A. If the number of available ECS instances is greater than or equal to B, B nodes are added. If the number of available ECS instances is less than B, the cluster is not scaled out.

    • Assume that B is greater than A. If the number of available ECS instances is greater than or equal to B, B nodes are added. If the number of available ECS instances is less than B and greater than or equal to A, A nodes are added. If the number of available ECS instances is less than A, the cluster is not scaled out.

    Hostname Suffix

    The suffix of the hostname. The suffix is used to distinguish between the nodes of different queues.

    Image Type

    The image type of the nodes that you want to add in a single queue. You can specify different image types for different queues.

    Image ID

    The ID of the image to which the nodes that you want to add in a single queue belong. You can specify different image IDs for different queues.

    Note

    This parameter is valid only for the current queue. If the image type or image ID is unspecified, the image type of the nodes that you want to add is the same as that specified in the global configurations. If the image type is unspecified in the global configurations, the image type of the nodes that you want to add is the same as the default image type of the cluster.

    Configuration List

    Each configuration list includes the configurations of the compute nodes that you want to add. The following configurations are displayed in this section:

    • Zone: a zone in the region where the cluster resides.

    • vSwitch ID: the vSwitch that is bound to the VPC of the cluster in the selected zone.

    • Instance Type: the instance type of the compute nodes that you want to add in a single queue.

      Note

      If multiple instance types are configured in the queue, the cluster is scaled out based on the available instance types, task quantity, and GPU quantity in order. For example, each node in a queue must have at least 16 cores to meet your business requirements. The queue has nodes with 8 cores, 16 cores, and 32 cores. Then, ECS instances with 16 cores are automatically added to the queue. If no ECS instance with 16 cores is available, instances with 32 cores are automatically added to the queue.

    • Bid Strategy: the bidding method configured for the nodes that you want to add.

    • Maximum Price per Hour: You must set a maximum hourly price only when Bid Strategy is set to Preemptible instance with maximum bid price.

  7. Read and select Alibaba Cloud International Website Product Terms of Service, and click OK.

  8. Optional. View the auto scaling diagram of the cluster.

    The auto scaling diagram shows the changes in the number of nodes over time during the auto scaling process based on the auto scaling policy that you configured. The diagram also shows the time consumed by node scale-in and scale-out at key points in time.

    Note

    You can set the number of simulated concurrent nodes in the auto scaling diagram to simulate the changes of compute nodes during auto scaling.