Manage queues for compute nodes - Elastic High Performance Computing

Elastic High Performance Computing (E-HPC) allows you to manage compute nodes that run different jobs or perform different tasks by adding them to different queues. By grouping compute nodes in queues, you can filter and schedule compute nodes more flexibly to optimize job execution efficiency. This topic describes how to use queues to manage compute nodes by group, including creating queues, deleting queues, and editing queue configurations.

Note

Queues are an important dimension in resource monitoring. You can view the overall load and performance of compute nodes by queue on the Monitoring page. For more information, see View the monitoring information.

Prerequisites

The cluster is in the Running state.
When you delete a queue, no compute nodes are in the queue.

Create a queue

Go to the Cluster Details page.
1. Log on to the E-HPC console.
2. In the left part of the top navigation bar, select a region.
3. In the left-side navigation pane, click Cluster.
4. On the Cluster List page, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Nodes and Queues > Queue.

Click Create Queue. On the Create Queue page, configure the parameters as required.

The following tables describe the parameters.

Basic Settings

Parameter	Description
Queue Name	Enter a queue name. The name must meet the following requirements: The name is 1 to 15 characters in length. The name contains one or more of the following character types: uppercase letters (A to Z), lowercase letters (a to z), digits (0 to 9), and underscores (_).
Automatic queue scaling	Specify whether to enable automatic scaling for the queue. After you turn on Automatic queue scaling, you can select Auto Grow and Auto Shrink based on your business requirements. After you enable automatic scaling for the queue, the system automatically adds or removes compute nodes based on the configurations and real-time workloads.
Queue Compute Nodes	Specify the number of compute nodes in the queue. If you do not enable automatic scaling for the queue, configure the initial number of compute nodes in the queue. If you enable automatic scaling for the queue, configure the minimum and maximum number of compute nodes in the queue. Important If you set the Minimal Nodes parameter to a non-zero value, the queue retains the number of nodes based on the value that you specify during cluster scale-in. Idle nodes are not released. We recommend that you specify the Minimal Nodes parameter with caution to prevent resource waste and unnecessary costs due to idle nodes in the queue.

Select Queue Node Configuration

If you enable Automatic queue scaling or set Initial Number of Nodes to a value larger than 0, you must configure the following parameters to enable the system to create compute nodes for the queue:

Configuration item	Description
Inter-node interconnection	Select a mode to interconnect nodes. Valid values: VPCNetwork: The compute nodes communicate with each other over virtual private clouds (VPCs). eRDMANetwork: If the instance types of compute nodes support eRDMA interfaces (ERIs), the compute nodes communicate with each other over eRDMA networks. Note Only compute nodes of specific instance types support ERIs. For more information, see Overview and Enable eRDMA on an enterprise-level instance.
Virtual Switch	Specify a vSwitch for the nodes to use. The system automatically assigns IP addresses to the compute nodes from the available vSwitch CIDR block.
Instance type Group	Click Add Instance and select an instance type in the panel that appears. If you do not enable Automatic queue scaling, you can add only one instance type. If you enable Automatic queue scaling, you can add multiple instance types.

Important

You can select multiple vSwitches and instance types as alternatives in case that instances fail to be created due to inventory issues. When you create a compute node, the system attempts to create the node in the sequence of specified instance type and zone. For example, the system first attempts to create a node based on the instance type that you specify in sequence in the zone where the first vSwitch resides. The specifications of a created instance may vary based on the inventory.

Auto Scale

Configuration item	Description
Scaling Policy	Select a scaling policy. Currently, only Supply Priority Strategy is supported. This policy indicates that compute nodes that meet the specifications requirements are created in the specified zones in the order of the configured vSwitches.
Maximum number of single expansion nodes	Specify the number of nodes to be added or removed in each scale-out or scale-in cycle. The default value 0 specifies that the number is unlimited. We recommend that you configure this parameter to control your costs on compute nodes.
Prefix of Hostnames	Specify the hostname prefix for the compute nodes. The prefix is used to distinguish between the nodes of different queues.
Hostname Suffix	Specify the hostname suffix for the compute nodes. The suffix is used to distinguish between the nodes of different queues.
Instance RAM role	Bind a Resource Access Management (RAM) role to the nodes to enable the nodes to access Alibaba Cloud services. We recommend that you select the default role AliyunECSInstanceForEHPCRole.

Click Save.
Click the icon to refresh the queue list. If the queue appears in the list, you have successfully created the queue.

Configure a queue

Important

To prevent impacts on ongoing business, we recommend that you configure queues when your business is idle.

Go to the Cluster Details page.
1. Log on to the E-HPC console.
2. In the left part of the top navigation bar, select a region.
3. In the left-side navigation pane, click Cluster.
4. On the Cluster List page, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Nodes and Queues > Queue.
Find the queue that you want to manage and click Edit in the Actions column.

On the Edit Queue page, configure the parameters that are described in the following tables.

Basic Settings

Parameter

Description

Automatic queue scaling

Automatic queue scaling is turned off by default. After you turn on the switch, you can select Auto Grow and Auto Shrink based on your business requirements.

Note

If the configurations of a queue are different from the global configurations of a cluster, the configurations of the queue take precedence.

Queue Compute Nodes

The range of the number of compute nodes in the queue.

Minimum Nodes: The minimum number of compute nodes ranges from 0 to 1000. The value may affect the effect of the scale-in.
Maximum Nodes: The maximum number of compute nodes ranges from 0 to 5000. The value may affect the effect of the scale-out.

Important

If you set the Minimal Nodes parameter to a non-zero value, the queue retains the number of nodes based on the value that you specify during cluster scale-in. Idle nodes are not released. We recommend that you specify the Minimal Nodes parameter with caution to prevent resource waste and unnecessary costs due to idle nodes in the queue.
The maximum number of nodes in the queue cannot exceed the maximum number of nodes in the cluster.

Select Queue Node Configuration

If you enable Automatic queue scaling or set Initial Number of Nodes to a value larger than 0, you must configure the following parameters to enable the system to create compute nodes for the queue:

Configuration item	Description
Inter-node interconnection	Select a mode to interconnect nodes. Valid values: VPCNetwork: The compute nodes communicate with each other over virtual private clouds (VPCs). eRDMANetwork: If the instance types of compute nodes support eRDMA interfaces (ERIs), the compute nodes communicate with each other over eRDMA networks. Note Only compute nodes of specific instance types support ERIs. For more information, see Overview and Enable eRDMA on an enterprise-level instance.
Virtual Switch	Specify a vSwitch for the nodes to use. The system automatically assigns IP addresses to the compute nodes from the available vSwitch CIDR block.
Instance type Group	Click Add Instance and select an instance type in the panel that appears. If you do not enable Automatic queue scaling, you can add only one instance type. If you enable Automatic queue scaling, you can add multiple instance types.

Important

Auto Scale

Configuration item	Description
Scaling Policy	Select a scaling policy. Currently, only Supply Priority Strategy is supported. This policy indicates that compute nodes that meet the specifications requirements are created in the specified zones in the order of the configured vSwitches.
Maximum number of single expansion nodes	Specify the number of nodes to be added or removed in each scale-out or scale-in cycle. The default value 0 specifies that the number is unlimited. We recommend that you configure this parameter to control your costs on compute nodes.
Prefix of Hostnames	Specify the hostname prefix for the compute nodes. The prefix is used to distinguish between the nodes of different queues.
Hostname Suffix	Specify the hostname suffix for the compute nodes. The suffix is used to distinguish between the nodes of different queues.
Instance RAM role	Bind a Resource Access Management (RAM) role to the nodes to enable the nodes to access Alibaba Cloud services. We recommend that you select the default role AliyunECSInstanceForEHPCRole.

Click Save.
Click the icon to refresh the queue list and view the information in the Auto Scaling Configurations column. If the information is updated, you have successfully edited the scaling configurations.

Delete queues

Important

Before you delete a queue, make sure that the queue does not have compute nodes. Otherwise, you cannot delete the queue.
To prevent impacts on ongoing business, we recommend that you delete queues when your business is idle.

Go to the Cluster Details page.
1. Log on to the E-HPC console.
2. In the left part of the top navigation bar, select a region.
3. In the left-side navigation pane, click Cluster.
4. On the Cluster List page, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Nodes and Queues > Queue.
Use one of the following methods to delete queues:
- To delete a single queue, find the queue and click Delete in the Actions column.
- To delete multiple queues at a time, select the queues and click Batch Delete in the lower part of the page.
In the message that appears, confirm the queue information and click OK.
Click the icon to refresh the queue list. If the queue disappears in the list, you have successfully deleted the queue.