You can configure auto scaling for an Elastic High Performance Computing (E-HPC) cluster to dynamically allocate compute nodes without the need for manual operations. The system can automatically add or remove compute nodes based on real-time workloads to improve cluster availability and save costs. This topic describes how to configure auto scaling.
Benefits
Adds compute nodes based on the real-time workloads of your cluster to improve cluster availability.
Reduces the number of compute nodes to save costs without compromising cluster availability.
Stops faulty nodes and creates nodes to improve fault tolerance.
Limits
Auto scaling is only supported for clusters where the operating system of all nodes is Linux.
All clusters except custom clusters support auto-scaling configuration.
Memory-based auto scaling is only supported for clusters with the Slurm scheduler.
We recommend that you specify the required vCPU when submitting jobs to achieve auto scaling. Additionally, the memory size specified by the job cannot exceed the memory specifications of the ECS resources.
Usage notes
Before you use the auto scaling service, make sure that the scheduler service and the domain account service work as expected. After you enable auto scaling, the management node must be in the running state.
If you need to shut down or restart the management node, perform the operation after idle nodes are released and no jobs are running on the compute nodes. In this case, we recommend that you disable auto scaling before you shut down or restart the management node, and enable auto scaling after the management node is restarted.
Procedure
Go to the Cluster List page.
Log on to the E-HPC console.
In the left part of the top navigation bar, select a region.
In the left-side navigation pane, click Cluster.
On the Cluster List page, find the cluster that you want to manage and click Auto Scale.
In the Cluster Auto Scaling dialog box, configure the parameters in the Cluster Global Configuration section.
Parameter
Description
Auto Grow/Auto Shrink
Enable Auto Grow and Auto Shrink for all queues in the cluster.
NoteIf the configurations of a queue are different from the global configurations of a cluster, the configurations of the queue take precedence.
Scale-out Waiting Time
The estimated time required for the system to start the scale-out operation from the time when the job is submitted. The default value is 2 minutes.
Scale-in Waiting Time
The time threshold during which a node remains idle without receiving any job requests. After the time is exceeded, the system automatically releases the resources of the node. The default waiting time for scale-in is 4 minutes.
Maximum number of cluster nodes
The maximum number of nodes that can be created in the cluster.
Maximum number of cores in the cluster
The maximum number of cores that can be created in the cluster.
Configure auto scaling in the queue.
Click the target cluster, and click in the left navigation pane. Find the queue that you want to manage, and click Edit in the Actions column of the queue. On the Edit Queue page, configure the following parameters.
Basic Settings
Parameter
Description
Automatic queue scaling
Automatic queue scaling is turned off by default. After you turn on the switch, you can select Auto Grow and Auto Shrink based on your business requirements.
NoteIf the configurations of a queue are different from the global configurations of a cluster, the configurations of the queue take precedence.
Queue Compute Nodes
The range of the number of compute nodes in the queue.
Minimum Nodes: The minimum number of compute nodes ranges from 0 to 1000. The value may affect the effect of the scale-in.
Maximum Nodes: The maximum number of compute nodes ranges from 0 to 5000. The value may affect the effect of the scale-out.
ImportantIf you set the Minimal Nodes parameter to a non-zero value, the queue retains the number of nodes based on the value that you specify during cluster scale-in. Idle nodes are not released. We recommend that you specify the Minimal Nodes parameter with caution to prevent resource waste and unnecessary costs due to idle nodes in the queue.
The maximum number of nodes in the queue cannot exceed the maximum number of nodes in the cluster.
Select Queue Node Configuration
If you enable Automatic queue scaling or set Initial Number of Nodes to a value larger than 0, you must configure the following parameters to enable the system to create compute nodes for the queue:
Parameter
Description
Inter-node interconnection
Select a mode to interconnect nodes. Valid values:
VPCNetwork: The compute nodes communicate with each other over virtual private clouds (VPCs).
eRDMANetwork: If the instance types of compute nodes support eRDMA interfaces (ERIs), the compute nodes communicate with each other over eRDMA networks.
NoteOnly compute nodes of specific instance types support ERIs. For more information, see Overview and Configure eRDMA on an enterprise-level instance.
Use Preset Node Pool
Select a created reserved node pool. The system automatically selects IP addresses and host names from the unassigned reserved nodes in the pool to create compute nodes.
NoteYou can quickly reuse pre-allocated resources when you scale out by using a reserved node pool. For more information, see Use reserved node pools in clusters.
Virtual Switch
Specify a vSwitch for the nodes to use. The system automatically assigns IP addresses to the compute nodes from the available vSwitch CIDR block.
Instance type Group
Click Add Instance and select an instance type in the panel that appears.
If you do not enable Automatic queue scaling, you can add only one instance type. If you enable Automatic queue scaling, you can add multiple instance types.
ImportantYou can select multiple vSwitches and instance types as alternatives in case that instances fail to be created due to inventory issues. When you create a compute node, the system attempts to create the node in the sequence of specified instance type and zone. For example, the system first attempts to create a node based on the instance type that you specify in sequence in the zone where the first vSwitch resides. The specifications of a created instance may vary based on the inventory.
Auto Scale
Parameter
Description
Scaling Policy
Select a scaling policy. Currently, only Supply Priority Strategy is supported. This policy indicates that compute nodes that meet the specifications requirements are created in the specified zones in the order of the configured vSwitches.
Maximum number of single expansion nodes
Specify the number of nodes to be added or removed in each scale-out or scale-in cycle. The default value 0 specifies that the number is unlimited.
We recommend that you configure this parameter to control your costs on compute nodes.
Prefix of Hostnames
Specify the hostname prefix for the compute nodes. The prefix is used to distinguish between the nodes of different queues.
Hostname Suffix
Specify the hostname suffix for the compute nodes. The suffix is used to distinguish between the nodes of different queues.
Instance RAM role
Bind a Resource Access Management (RAM) role to the nodes to enable the nodes to access Alibaba Cloud services.
We recommend that you select the default role AliyunECSInstanceForEHPCRole.
Confirm the configurations and click Save.
Scaling policy
If multiple instance types are configured in the queue, the cluster is scaled out based on the available instance types, task quantity, and GPU quantity in sequence. For example, each node in a queue must have at least 16 cores to meet your business requirements. The queue has nodes with 8 cores, 16 cores, and 32 cores. ECS instances with 16 cores are automatically added to the queue. If no ECS instances with 16 cores are available, instances with 32 cores are automatically added to the queue.
Reference
After you configure auto scaling, we recommend that you check the health status and resource usage of the monitored cluster to evaluate whether the auto scaling configurations are appropriate. For more information, see View the monitoring information.