Add, Restart, and Delete E-HPC Cluster Compute Nodes-E - HPC

In an Elastic High Performance Computing (E-HPC) cluster, compute nodes are Elastic Compute Service (ECS) instances. You can perform operations on created nodes, such as remotely connecting to them, sending commands to them, restarting, and releasing them. This topic describes how to manage compute nodes in a cluster.

Prerequisites

The cluster is in the Running state.
The following items describe the prerequisites for creating nodes:
- A queue is created in the cluster. For more information, see Manage queues.
- A vSwitch and an IP address are available in the region where you want to create nodes. For more information, see Create and manage vSwitches.
- Sufficient unused ECS instance quotas are available in the region where you want to create nodes. For more information, see View or increase the quotas of ECS instance types.

Note

The Node page of the E-HPC console displays only compute nodes. Management and logon nodes are not displayed.

Create nodes

Go to the Cluster Details page.
1. Log on to the E-HPC console.
2. In the left part of the top navigation bar, select a region.
3. In the left-side navigation pane, click Cluster.
4. On the Cluster List page, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Nodes and Queues > Node.

Click Add Node. On the Add Node page, configure the following parameters.

Basic Settings
Parameter
Description
Destination Queue
Select a queue that you created in the cluster.
Nodes
Specify the number of nodes that you want to create in the cluster.

Node Configurations

Parameter	Description
Select Node Type	You can select only Create Node.
Inter-node interconnection	Select a mode to interconnect nodes. Valid values: VPCNetwork: The compute nodes communicate with each other over virtual private clouds (VPCs). eRDMANetwork: If the instance types of compute nodes support Elastic RDMA interfaces (ERIs), the compute nodes communicate with each other over elastic Remote Direct Memory Access (eRDMA) networks. Note Only compute nodes of specific instance types support ERIs. For more information, see Overview and Configure eRDMA on an enterprise-level instance.
Use Preset Node Pool	Select the created reserved node pool. The system automatically selects IP addresses and host names from the unassigned reserved nodes in the pool to create compute nodes. Note You can quickly reuse pre-allocated resources when you scale out by using a reserved node pool. For more information, see Use reserved node pools in clusters.
Virtual Switch	Specify a vSwitch for the nodes to use. The system automatically assigns an IP address to the compute nodes from the available vSwitch CIDR block.
Instance type Group	Click Add Instance and select an instance type on the panel that appears. If you do not enable Automatic queue scaling, you can add only one instance type. If you enable Automatic queue scaling, you can add multiple instance types.
Prefix of Hostnames	Specify the hostname prefix for the compute nodes. The prefix is used to distinguish between the nodes of different queues.
Hostname Suffix	Specify the hostname suffix for the compute nodes. The suffix is used to distinguish between the nodes of different queues.
Instance RAM role	Bind a Resource Access Management (RAM) role to the nodes to enable the nodes to access Alibaba Cloud services. We recommend that you select the default role AliyunECSInstanceForEHPCRole.

Select I have learned that "deletion protection" is enabled by default for added nodes to prevent the nodes from being affected by queue scaling activities. I understand that I can disable deletion protection for the nodes or manually delete the nodes to avoid unnecessary costs. and click Confirm Add.
Note
The applications in the cluster are automatically installed on the added compute nodes, and the compute nodes automatically initialize. Existing compute nodes are not affected by this.
You can view the states of the scaled-out nodes in the node list on the Node page. If the nodes are in the Running state, the cluster is scaled out.

Restart nodes

You can restart a node if exceptions occur on the node. You can restart a node in one of the following modes:

Normal Reboot: If you set Reboot Mode to Normal Reboot, the restart command is sent to the node. Then, the operating system terminates all processes and restarts.
Force Reboot: If you set Reboot Mode to Force Reboot, the node is directly powered off. Data loss may occur. We recommend that you perform a force reboot only when a normal reboot fails.

Important

When a node is restarted, the jobs that are running on the node are stopped. Make sure that no jobs are running on the node before you restart the node.

Go to the Cluster Details page.
1. Log on to the E-HPC console.
2. In the left part of the top navigation bar, select a region.
3. In the left-side navigation pane, click Cluster.
4. On the Cluster List page, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Nodes and Queues > Node.
Select one or more compute nodes that you want to restart in the node list.
Click Reboot below the node list.
In the dialog box that appears, select a restart mode and click OK.

Connect to a node

Note

By default, a remote Workbench session persists for six hours. If no operation is performed for over six hours in the session, the session is closed and you must reconnect if you want to use the node.

Go to the Cluster Details page.
1. Log on to the E-HPC console.
2. In the left part of the top navigation bar, select a region.
3. In the left-side navigation pane, click Cluster.
4. On the Cluster List page, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Nodes and Queues > Node.
On the Node page, find the node to which you want to connect and click Remote Connection in the Actions column.
In the Remote connection dialog box, click Sign in now in the Workbench section.
In the Instance Login dialog box, configure parameters. For more information about the parameters, see Connect to an instance by using Workbench.

Send commands to nodes

If you want to quickly maintain nodes, for example, to install software or execute O&M scripts, you can send remote commands to the nodes.

Go to the Cluster Details page.
1. Log on to the E-HPC console.
2. In the left part of the top navigation bar, select a region.
3. In the left-side navigation pane, click Cluster.
4. On the Cluster List page, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Nodes and Queues > Node.
Select one or more compute nodes that you want to restart in the node list.
Click Send Command below the node list.
In the dialog box that appears, configure parameters and type your commands.
For more information about the parameters, see Send remote commands.
Click Run.

Enable or disable deletion protection

The deletion protection feature prevents unintended deletion of nodes. After this feature is enabled for a node, the node cannot be deleted until the feature is disabled for it. This helps prevent unintended or malicious deletion of nodes and ensure the stable running of a cluster.

Go to the Cluster Details page.
1. Log on to the E-HPC console.
2. In the left part of the top navigation bar, select a region.
3. In the left-side navigation pane, click Cluster.
4. On the Cluster List page, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Nodes and Queues > Node.
Select one or more compute nodes that you want to restart in the node list.
Click More below the node list and select Enable Deletion Protection or Disable Deletion Protection based on your business requirements.
In the message that appears, click OK.

Delete nodes

You can delete the compute nodes that you no longer need in a cluster to scale down the cluster.

Pay-as-you-go nodes are immediately released after you delete them. Subscription nodes are retained and will be released after you request a refund or transfer them to pay-as-you-go nodes before they expire. For more information, see Release or unsubscribe from an ApsaraDB RDS for MySQL instance.

Important

Data stored on the node cannot be restored after a node is released. If you want to retain the data on the node, we recommend that you create a snapshot to back up the data before you delete the node. For more information, see Create a snapshot.

Go to the Cluster Details page.
1. Log on to the E-HPC console.
2. In the left part of the top navigation bar, select a region.
3. In the left-side navigation pane, click Cluster.
4. On the Cluster List page, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Nodes and Queues > Node.
Select one or more compute nodes that you want to restart in the node list.
Click Delete below the node list.
Confirm the message and click OK.

Parameter	Description
Destination Queue	Select a queue that you created in the cluster.
Nodes	Specify the number of nodes that you want to create in the cluster.