If an Elastic High Performance Computing (E-HPC) cluster has insufficient or excessive compute nodes, you can manually scale out or scale in the cluster, or configure auto scaling for the cluster. This topic describes how to scale out and scale in an E-HPC cluster.
Manual scaling
In an E-HPC cluster, manual scaling essentially refers to manually creating or deleting compute nodes. You can scale out a cluster by increasing the number of compute nodes, or scale in a cluster by decreasing the number of compute nodes. Scale-out can increase the computing power of the cluster, whereas scale-in can reduce resource waste or save costs.
For more information about the procedure and usage notes, see Manage nodes.
Auto scaling
E-HPC provides the auto scaling feature. You can configure auto scaling for all queues or specific queues in a cluster. The system automatically adjusts the number of compute nodes in a queue based on the number of tasks and the number of GPUs of the cluster job. When a large number of computing tasks are submitted, the cluster can automatically add computing nodes to accelerate the processing of the tasks. When no task exists, the cluster can automatically reduce computing nodes to save energy and resource consumption. Auto scaling ensures that E-HPC clusters can respond more efficiently to changes in different workloads, thereby improving overall performance and resource utilization.
Global configurations
In the global configurations of a cluster, you can enable auto scaling for the cluster and configure the maximum number of nodes and the maximum number of cores in the cluster.
Go to the Cluster List page.
Log on to the E-HPC console.
In the left part of the top navigation bar, select a region.
In the left-side navigation pane, click Cluster.
On the Cluster List page, find the cluster that you want to manage and click Auto Scale.
In the Cluster Auto Scaling dialog box, configure the parameters in the Cluster Global Configuration section.
Parameter
Description
Enable Autoscale
Enable Auto Grow and Auto Shrink for all queues in the cluster.
NoteIf the configurations of a queue are different from the global configurations of a cluster, the configurations of the queue take precedence.
Scale-out Waiting Time
The estimated time required for the system to start the scale-out operation from the time when the job is submitted. The default value is 2 minutes.
Scale-in Waiting Time
The time threshold during which a node remains idle without receiving any job requests. After the time is exceeded, the system automatically releases the resources of the node. The default waiting time for scale-in is 4 minutes.
Maximum number of cluster nodes
The maximum number of nodes that can be created in the cluster.
Maximum number of cores in the cluster
The maximum number of cores that can be created in the cluster.
Automatic queue scaling
You can configure auto scaling for each queue in a cluster. For more information about the procedure and usage notes, see Configure auto scaling of nodes.