All Products
Search
Document Center

Elastic High Performance Computing:Configure auto scaling

Last Updated:Apr 17, 2024

Elastic High Performance Computing (E-HPC) provides the auto scaling feature that can dynamically allocate compute nodes based on the configured auto scaling policy. The system can automatically add or remove compute nodes based on real-time workloads to improve cluster availability and save costs. This topic describes how to configure auto scaling.

Benefits

  • Adds compute nodes based on the real-time workloads of your cluster to improve cluster availability.

  • Reduces the number of compute nodes to save costs without compromising cluster availability.

  • Stops faulty nodes and creates nodes to improve fault tolerance.

Limits

  • You can configure auto scaling only for clusters in which all nodes run Linux operating systems.

  • You can configure auto scaling only for clusters with PBS, Slurm, Deadline, or SGE schedulers.

  • E-HPC does not support auto scaling based on memory usage.

    Important

    To effectively implement auto scaling, we recommend that you specify the number of required vCPUs when you submit a job. Note that the memory size that you specify for the job cannot exceed the memory capacity of Elastic Compute Service (ECS) instances.

Usage notes

Before you use the auto scaling service, make sure that the scheduler service and the domain account service work as expected. After you enable auto scaling, the management node must be in the running state.

Important

If you need to shut down or restart the management node, perform the operation after idle nodes are released and no jobs are running on the compute nodes. In this case, we recommend that you disable auto scaling before you shut down or restart the management node, and enable the auto scaling after the management node is restarted.

Procedure

  1. Open the Auto Scale page.

    1. Log on to the E-HPC console.

    2. In the top navigation bar, select a region.

    3. In the left-side navigation pane, choose Elasticity > Auto Scale.

  2. From the Cluster drop-down list on the Auto Scale page, select the cluster for which you want to configure auto scaling.

  3. In the Global Configurations section, configure the parameters. The following table describes the parameters that you can configure.

    Parameter

    Description

    Enable Autoscale

    Enable Auto Grow and Auto Shrink for all queues in a cluster.

    Note

    If the settings in the Queue Configuration section are different from the settings in the Global Configurations section, the settings in the Queue Configuration section take precedence.

    Compute Nodes

    The range for the number of compute nodes in the cluster after the auto scaling. The upper limit is the sum of the maximum number of compute nodes configured for each queue in the cluster. The lower limit is the sum of the minimum number of compute nodes configured for each queue in the cluster.

    Scale-in Time (Minute)

    If the continuous idle duration of a compute node exceeds the scale-in duration, the node is released.

    The continuous idle duration is the scale-in interval multiplied by the number of consecutive idle times. By default, the scale-in interval is 2 minutes. The consecutive idle times of a compute node are the number of consecutive times that the compute node is idle during the resource scale-in check.

    Image Type

    The image type of the compute nodes that you want to add to the cluster. Only the images that are compatible with the image of the existing compute nodes in the cluster are supported.

    Exceptional Nodes

    Select the nodes that you want to exclude from auto scaling.

    If you want to retain a compute node, you can configure the node as an exceptional node. Then, the node is not released regardless of whether it is idle.

    Hyper-threading

    By default, Hyper-Threading (HT) is enabled for all ECS instances. For specific ECS instance types, you can disable HT for better performance. For more information, see Instance type limits and Disable HT for compute nodes.

  4. In the Queue Configuration section, select a queue and click Edit to configure the parameters.

    Parameter

    Description

    Auto Grow and Auto Shrink

    Specifies whether to enable Auto Grow and Auto Shrink. By default, both switches are turned off.

    Note

    If the settings in the Queue Configuration section are different from the settings in the Global Configurations section, the settings in the Queue Configuration section take precedence.

    Queue Compute Nodes

    The range of the number of compute nodes in the queue.

    • Maximum Nodes: The maximum number of compute nodes ranges from 0 to 5000. The value may affect the effect of the scale-out.

    • Minimum Nodes: The minimum number of compute nodes ranges from 0 to 1000. The value may affect the effect of the scale-in.

      Important

      If you specify the Minimal Nodes parameter to a non-zero value, the queue retains the number of nodes based on the value you specify during cluster scale-in. Idle nodes are not released. We recommend that you specify the Minimal Nodes parameter with caution to avoid a waste of resources and costs due to idle nodes in the queue.

    Prefix of Hostnames

    The hostname prefix of the compute nodes. The prefix is used to distinguish between the nodes of different queues.

    Maximum Nodes in Each Round of Scale-out

    The maximum number of compute nodes that can be added in each round of scale-out. The default value 0 specifies that the maximum number of compute nodes that can be added in each round of scale-out is not limited.

    We recommend that you configure this parameter to control your costs on compute nodes.

    If you set this parameter to A and you want to add B nodes, nodes are added based on the following rules:

    • If B is less than or equal to A, B nodes are added.

    • If B is greater than A, A nodes are added.

    Note

    In addition to this parameter, the number of nodes in a cluster is also limited by the specified maximum number of nodes that can be added in a single queue and the specified maximum number of nodes that can be added in the cluster.

    Minimum Scale-out Nodes in Each Round

    The minimum number of compute nodes that must be added in each round of scale-out. The default value 1 specifies that at least one node must be added.

    In specific scenarios, you may need to add at least a specific number of nodes to ensure that the business can run as expected. In this case, you can specify the minimum number of nodes that must be added in each round. If the number of available ECS instances is less than the specified minimum number of nodes and the number of required nodes, the cluster is not scaled out to avoid wasting resources.

    If you set this parameter to A and you want to add B nodes, nodes are added in the following scenarios:

    • For example, B is less than or equal to A. If the number of available ECS instances is greater than or equal to B, B nodes are added. If the number of available ECS instances is less than B, the cluster is not scaled out.

    • For example, B is greater than A. If the number of available ECS instances is greater than or equal to B, B nodes are added. If the number of available ECS instances is less than B and greater than or equal to A, A nodes are added. If the number of available ECS instances is less than A, the cluster is not scaled out.

    Automatic Configuration of the Minimum Node Number for Each Scale-out

    If you turn on this switch, the minimum number of nodes for each scale-out is equal to the number of nodes required by the job. The minimum node number cannot be greater than 99.

    Hostname Suffix

    The suffix of the hostname. The suffix is used to distinguish between the nodes of different queues.

    Image Type

    The image type of the nodes that you want to add in a queue. You can specify different image types for different queues.

    Image ID

    The ID of the image to which the nodes that you want to add in a queue belong. You can specify different image IDs for different queues.

    Note

    This parameter is valid only for the current queue. If you did not specify the image type or image ID, the image type of the nodes that you want to add is the same as the image type that is specified in the global configurations. If you did not specify the image type in the global configurations, the image type of the nodes that you want to add is the same as the default image type of the cluster.

    Whether instance types are unordered

    If you turn on this switch, the system selects instance types in descending order of the number of instances in stock during auto scaling to ensure the delivery of resources.

    Configuration List

    Configure the compute nodes that you want to add. Each configuration list includes the following configurations:

    • Zone: a zone in the region where the cluster resides.

    • vSwitch ID: the vSwitch that is bound to the VPC of the cluster in the selected zone.

    • Instance Type: the instance type of the compute nodes that you want to add in a queue.

      Note

      If multiple instance types are configured in the queue, the cluster is scaled out based on the available instance types, task quantity, and GPU quantity in sequence. For example, each node in a queue must have at least 16 cores to meet your business requirements. The queue has nodes with 8 cores, 16 cores, and 32 cores. ECS instances with 16 cores are automatically added to the queue. If no ECS instances with 16 cores are available, instances with 32 cores are automatically added to the queue.

    • Bid Strategy: the bidding method configured for the nodes that you want to add.

    • Maximum Price per Hour: You must set a maximum hourly price only when Bid Strategy is set to Preemptible instance with maximum bid price.

    System Disk

    The system disk of the compute nodes that you want to add.

    Data disk

    The data disk that is attached to the compute nodes that you want to add. Configure the type, size, and performance level of the data disk, and specify whether to release the data disk with the compute nodes and whether to encrypt the data disk based on your business requirements.

  5. In the upper-right corner of the page, read and select Alibaba Cloud International Website Product Terms of Service, and click OK.

  6. Optional. View the auto scaling diagram of the cluster.

    The auto scaling diagram shows the changes in the number of nodes over time during the auto scaling process based on the auto scaling policy that you configured. The diagram also shows the time consumed by node scale-in and scale-out at key points in time.

    Note

    You can specify the number of simulated concurrent nodes in the auto scaling diagram to simulate the changes of compute nodes during auto scaling.