Configure node auto scaling for an E-HPC cluster - Elastic High Performance Computing

You can configure auto scaling for an Elastic High Performance Computing (E-HPC) cluster to dynamically allocate compute nodes without the need for manual operations. The system can automatically add or remove compute nodes based on real-time workloads to improve cluster availability and save costs. This topic describes how to configure auto scaling.

Benefits

Adds compute nodes based on the real-time workloads of your cluster to improve cluster availability.
Reduces the number of compute nodes to save costs without compromising cluster availability.
Stops faulty nodes and creates nodes to improve fault tolerance.

Limits

Auto scaling is only supported for clusters where the operating system of all nodes is Linux.
All clusters except custom clusters support auto-scaling configuration.
Memory-based auto scaling is only supported for clusters with the Slurm scheduler.

Important

We recommend that you specify the required vCPU when submitting jobs to achieve auto scaling. Additionally, the memory size specified by the job cannot exceed the memory specifications of the ECS resources.

Usage notes

Before you use the auto scaling service, make sure that the scheduler service and the domain account service work as expected. After you enable auto scaling, the management node must be in the running state.

Important

If you need to shut down or restart the management node, perform the operation after idle nodes are released and no jobs are running on the compute nodes. In this case, we recommend that you disable auto scaling before you shut down or restart the management node, and enable auto scaling after the management node is restarted.

Procedure

Go to the Cluster List page.
1. Log on to the E-HPC console.
2. In the left part of the top navigation bar, select a region.
3. In the left-side navigation pane, click Cluster.
On the Cluster List page, find the cluster that you want to manage and click Auto Scale.

In the Cluster Auto Scaling dialog box, configure the parameters in the Cluster Global Configuration section.

Parameter	Description
Auto Grow/Auto Shrink	Enable Auto Grow and Auto Shrink for all queues in the cluster. Note If the configurations of a queue are different from the global configurations of a cluster, the configurations of the queue take precedence.
Scale-out Waiting Time	The estimated time required for the system to start the scale-out operation from the time when the job is submitted. The default value is 2 minutes.
Scale-in Waiting Time	The time threshold during which a node remains idle without receiving any job requests. After the time is exceeded, the system automatically releases the resources of the node. The default waiting time for scale-in is 4 minutes.
Maximum number of cluster nodes	The maximum number of nodes that can be created in the cluster.
Maximum number of cores in the cluster	The maximum number of cores that can be created in the cluster.

Configure auto scaling in the queue.

Click the target cluster, and click Nodes and Queues > Queue in the left navigation pane. Find the queue that you want to manage, and click Edit in the Actions column of the queue. On the Edit Queue page, configure the following parameters.

Basic Settings

Parameter

Description

Automatic queue scaling

Automatic queue scaling is turned off by default. After you turn on the switch, you can select Auto Grow and Auto Shrink based on your business requirements.

Note

If the configurations of a queue are different from the global configurations of a cluster, the configurations of the queue take precedence.

Queue Compute Nodes

The range of the number of compute nodes in the queue.

Minimum Nodes: The minimum number of compute nodes ranges from 0 to 1000. The value may affect the effect of the scale-in.
Maximum Nodes: The maximum number of compute nodes ranges from 0 to 5000. The value may affect the effect of the scale-out.

Important

If you set the Minimal Nodes parameter to a non-zero value, the queue retains the number of nodes based on the value that you specify during cluster scale-in. Idle nodes are not released. We recommend that you specify the Minimal Nodes parameter with caution to prevent resource waste and unnecessary costs due to idle nodes in the queue.
The maximum number of nodes in the queue cannot exceed the maximum number of nodes in the cluster.

Select Queue Node Configuration

If you enable Automatic queue scaling or set Initial Number of Nodes to a value larger than 0, you must configure the following parameters to enable the system to create compute nodes for the queue:

Parameter	Description
Inter-node interconnection	Select a mode to interconnect nodes. Valid values: VPCNetwork: The compute nodes communicate with each other over virtual private clouds (VPCs). eRDMANetwork: If the instance types of compute nodes support eRDMA interfaces (ERIs), the compute nodes communicate with each other over eRDMA networks. Note Only compute nodes of specific instance types support ERIs. For more information, see Overview and Configure eRDMA on an enterprise-level instance.
Use Preset Node Pool	Select a created reserved node pool. The system automatically selects IP addresses and host names from the unassigned reserved nodes in the pool to create compute nodes. Note You can quickly reuse pre-allocated resources when you scale out by using a reserved node pool. For more information, see Use reserved node pools in clusters.
Virtual Switch	Specify a vSwitch for the nodes to use. The system automatically assigns IP addresses to the compute nodes from the available vSwitch CIDR block.
Instance type Group	Click Add Instance and select an instance type in the panel that appears. If you do not enable Automatic queue scaling, you can add only one instance type. If you enable Automatic queue scaling, you can add multiple instance types.

Important

You can select multiple vSwitches and instance types as alternatives in case that instances fail to be created due to inventory issues. When you create a compute node, the system attempts to create the node in the sequence of specified instance type and zone. For example, the system first attempts to create a node based on the instance type that you specify in sequence in the zone where the first vSwitch resides. The specifications of a created instance may vary based on the inventory.

Auto Scale

Parameter	Description
Scaling Policy	Select a scaling policy. Currently, only Supply Priority Strategy is supported. This policy indicates that compute nodes that meet the specifications requirements are created in the specified zones in the order of the configured vSwitches.
Maximum number of single expansion nodes	Specify the number of nodes to be added or removed in each scale-out or scale-in cycle. The default value 0 specifies that the number is unlimited. We recommend that you configure this parameter to control your costs on compute nodes.
Prefix of Hostnames	Specify the hostname prefix for the compute nodes. The prefix is used to distinguish between the nodes of different queues.
Hostname Suffix	Specify the hostname suffix for the compute nodes. The suffix is used to distinguish between the nodes of different queues.
Instance RAM role	Bind a Resource Access Management (RAM) role to the nodes to enable the nodes to access Alibaba Cloud services. We recommend that you select the default role AliyunECSInstanceForEHPCRole.

Confirm the configurations and click Save.

Scaling policy

If multiple instance types are configured in the queue, the cluster is scaled out based on the available instance types, task quantity, and GPU quantity in sequence. For example, each node in a queue must have at least 16 cores to meet your business requirements. The queue has nodes with 8 cores, 16 cores, and 32 cores. ECS instances with 16 cores are automatically added to the queue. If no ECS instances with 16 cores are available, instances with 32 cores are automatically added to the queue.

Reference

After you configure auto scaling, we recommend that you check the health status and resource usage of the monitored cluster to evaluate whether the auto scaling configurations are appropriate. For more information, see View the monitoring information.

FAQ

The instance has been released, but why can't I delete the node in the console?

Problem description: You want to use spot instances for auto scaling. If an instance is reclaimed upon expiration while it still has unfinished computing tasks, the scheduler might fail to delete the instance, resulting in a node deletion failure.

Solution: In the auto-scaling scenario, nodes will be cleaned up after remaining for a certain period of time. After the scheduler status is updated, the nodes will exit the BusyNodes status and can be deleted normally.