Create and manage clusters in the E-HPC console - Elastic High Performance Computing

This topic describes how to create and manage a public cloud cluster of Standard Edition in the Elastic High Performance Computing (E-HPC) console to help you get started with E-HPC.

Prerequisites

A service-linked role for E-HPC is created. The first time you log on to the E-HPC console, you are prompted to create a service-linked role for E-HPC.
A virtual private cloud (VPC) and a vSwitch are created. For more information, see Create and manage a VPC and Create a vSwitch.
Apsara File Storage NAS (NAS) is activated. A NAS file system and a mount target are created. For more information, see Create a file system and Manage mount targets.

Create a cluster

Go to the Create Cluster page.

On the Create Cluster page, configure the parameters in the following steps:

Cluster Configuration

Basic Settings

Parameter	Example	Description
Region	China (Hangzhou)	The region where you want to create a cluster.
Network and Availability Zone	VPC: vpc-bp1opxu1zkhn00g** vSwitch: vsw-bp1ljgg5tjrs62n64**	The VPC in which your cluster is deployed and the vSwitch to which the cluster belongs. Note The nodes in the cluster use the IP addresses in the vSwitch. Make sure that the number of available IP addresses is greater than that of cluster nodes.
Security Group	Select Automatically create a normal security group.	A security group is used to manage the inbound and outbound traffic of nodes in a cluster. The system automatically creates rules for the security group that is automatically created to enable communication between the nodes in the cluster. Select the type of the security group that is automatically created based on your business requirements. For more information about the differences between basic and advanced security groups, see Basic security groups and advanced security groups.

Select a cluster type

This section describes how to create a public cloud cluster of Standard Edition. A cluster of this type consists of one management node and multiple compute nodes. You must select the type of the scheduler and configure the management node.

Parameter	Example	Description
Series	Select Standard Edition.	The series of the cluster.
Deployment Mode	Select Public cloud cluster.	The deployment mode of the cluster.
Cluster Type	Select SLURM.	The scheduler type of the cluster. Common schedulers in HPC scenarios are supported. Examples: Slurm and OpenPBS.
Management node	Instance Family: General-purpose Type g6 Instance Type: ecs.g6.large Image: CentOS 7.6 64 bit Storage: System Disk40G ESSD PL0 Hyper-Threading: Enable	The ECS instance in which the scheduler and domain account service are deployed. Select appropriate configurations for the management node based on your business scenario and cluster size. Payment Details The billing method of the management node. For more information, see Instance types. Pay-as-you-go: You are charged based on the actual usage duration. Preemptible instances are not supported. Subscription: You are charged based on a monthly or yearly basis. Instance Type The instance specifications of the management node that you can select based on your business requirements. We recommend that you specify the instance specifications of the management node based on the number of compute nodes. If the number of compute nodes in the cluster is less than or equal to 100, we recommend that you select 16 or more vCPUs and 64 GiB or more of memory. If the number of compute nodes in the cluster is less than or equal to 500, we recommend that you select 32 or more vCPUs and 128 GiB or more of memory. If the number of compute nodes in the cluster is greater than 500, we recommend that you select 64 or more vCPUs and 256 or more GiB of memory. Image The image used to deploy the management node. Different images support different schedulers. The actual information that is displayed prevails. Storage The system disk specification of the management node and whether to attach a data disk to the management node. For more information about the disk type and performance level, see Disks. Hyper-Threading By default, Hyper-Threading is enabled. If your business requires better performance, you can disable Hyper-Threading.

Compute Node and Queue

Basic Settings

Parameter

Example

Description

Automatic queue scaling

Off

Specifies whether to enable Automatic queue scaling. After you turn on Automatic queue scaling, you can select Auto Grow and Auto Shrink based on your business requirements.

After you enable Automatic queue scaling, the system automatically increases or decreases compute nodes based on the configurations or the real-time load.

Queue Compute Nodes

The number of nodes in the queue.

If you do not enable Automatic queue scaling, configure the initial number of compute nodes in the queue.
If you enable Automatic queue scaling, configure the minimum and maximum number of compute nodes in the queue.
Important
If you set the Minimal Nodes parameter to a non-zero value, the queue retains the number of nodes based on the value that you specify during cluster scale-in. Idle nodes are not released. We recommend that you specify the Minimal Nodes parameter with caution to prevent resource waste and unnecessary costs due to idle nodes in the queue.

Select Queue Node Configuration

Parameter	Example	Description
Inter-node interconnection	Select VPCNetwork.	The network connection mode between compute nodes. VPCNetwork: The compute nodes communicate with each other over VPCs. eRDMANetwork: If the instance types of compute nodes support Elastic RDMA interfaces (ERIs), the compute nodes communicate with each other over elastic Remote Direct Memory Access (eRDMA) networks. Note Only compute nodes of specific instance types support ERIs. For more information, see Overview and Configure eRDMA on an enterprise-level instance.
Virtual Switch	vsw-bp1ljgg5tjrs62n64****	The vSwitch to which the node belongs. The system automatically assigns an IP address to the compute node from the available vSwitch CIDR block.
Instance type Group	Instance Family: General-purpose Type g6 Instance Type: ecs.g6.large Image: CentOS 7.6 64 bit Storage: System Disk40G ESSD PL0 Hyper-Threading: Enable	Click Add Instance and select Instance Type. If you do not enable Automatic queue scaling, you can add only one instance type. If you enable Automatic queue scaling, you can add multiple instance types.
Low Latency Deployment Set	Select Disable.	A deployment set provides a deployment strategy for deploying instances on physical servers. For more information, see Deployment set.

Shared File Storage

Parameter	Example	Description
Type	Select General-purpose NAS.	The type of the file system that you want to mount. General-purpose NAS Extreme NAS Parallel file CPFS
File System	0e9104**** (Capacity NFS)	The ID and mount point of the file system that you want to mount. Make sure that the file system has sufficient mount points.
File System Directory	0e9104****-tpd33.cn.hangzhou.nas.aliyuncs.com	The directory of the file system that you want mount.
Mount Options	Select Mount over NFSv3.	The mount protocol.

Software and Service Component
You do not need to specify this parameter. By default, a logon node is configured.

Confirm configuration

Confirm the configurations and configure the cluster name and logon credentials.

Parameter	Example	Description
Cluster Name	E-HPC-test	The name of the cluster. The cluster name is displayed on the Cluster page to facilitate identification.
Login Credentials	Select Custom Password.	The credentials used to log on to the cluster. Only Custom Password is supported.
Set Password and Repeat Password	Ehpc12****	The password of the cluster. By default, the password is used for root users to log on to all nodes in the cluster.

Check the billing information, read and select Services and Agreements, and then click Create Cluster.
If a cluster named E-HPC-test appears on the Cluster page and is in the Running status, the cluster is created.

Create a user

After you create the cluster, you must create a user to submit jobs in the cluster.

On the User Management page, click Add User.

In the Add User dialog box, configure the parameters and click Confirm. The following table describes the parameters.

Parameter	Example	Description
Username	test.user	The user name. The name can contain 6 to 30 characters. The name must start with a letter. The name can contain letters, digits, and periods `(.)`.
Role Permissions	Sudo Permissions Group	Regular Permissions Group: suitable for regular users that only submit and debug jobs. Sudo Permissions Group: suitable for administrators who need to manage clusters. In addition to submitting and debugging jobs, users who have sudo permissions can run sudo commands to install software and restart nodes. Important Exercise caution when you grant sudo permissions to users. A cluster may not run as expected if a user who has sudo permissions perform a misoperation, such as deleting an E-HPC software-stack module by mistake.
Password and Repeat Password	Ehpc12****	The password required if the user wants to log on to the cluster by using the password. Follow the on-screen instructions to specify the parameters.

Scale out a cluster

On the Cluster List page, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Nodes and Queues > Node.

Click Add Node. On the Add Node page, configure the following parameters.

Basic Settings
Parameter
Example
Description
Destination Queue
comp
Select a queue that you created in the cluster.
Nodes
10
Specify the number of nodes that you want to add to the cluster.

Node Configurations

Parameter	Example	Description
Select Node Type	Create Node	Valid value: Create Node.
Inter-node interconnection	VPCNetwork	The network connection mode between nodes. VPCNetwork: The compute nodes communicate with each other over VPCs. eRDMANetwork: If the instance types of compute nodes support ERIs, the compute nodes communicate with each other over eRDMA networks. Note Only compute nodes of specific instance types support ERIs. For more information, see Overview and Configure eRDMA on an enterprise-level instance.
Virtual Switch	vsw-bp1ljgg5tjrs62n64****	The vSwitch to which the node belongs. The system automatically assigns an IP address to the compute node from the available vSwitch CIDR block.
Instance type Group	Instance Family: General-purpose Type g6 Instance Type: ecs.g6.large Image: CentOS 7.6 64 bit Storage: System Disk40G ESSD PL0 Hyper-Threading: Enable	Click Add Instance and select Instance Type. If you do not enable Automatic queue scaling, you can add only one instance type. If you enable Automatic queue scaling, you can add multiple instance types.

Select I have learned that "deletion protection" is enabled by default for added nodes to prevent the nodes from being affected by queue scaling activities. I understand that I can disable deletion protection for the nodes or manually delete the nodes to avoid unnecessary costs. and click Confirm Add.
You can view the status of the scaled-out nodes in the node list on the Node page. If the nodes are in the Running status, the cluster is scaled out.

Submit a job

On the details page of the cluster, click Job Management in the left-side navigation pane.
Click Create Job.

On the Create Job page, configure the parameters and click Confirm Create.

Note

Specify the parameters in the following table and retain the default settings for other parameters. For more information, see Submit a job.

Parameter	Required	Example	Description
Job Name	Yes	testjob	The name of the job.
Scheduler Queue	Yes	comp	The name of the queue in which the job is run.
Run Command	Yes	`/home/test.user/testjob.slurm`	The job execution command that you want to submit to the scheduler. You can enter a command or the relative path of the script file. If the script file is executable, enter its relative path. Example: `/home/test.user/testjob.slurm`. If the script file is not executable, enter the execution command. Example: `/opt/mpi/bin/mpirun /home/test/job.slurm`.

Delete a compute node

You can delete the compute nodes that you no longer require in a cluster.

Select one or more compute nodes that you want to delete from the node list.
Click Delete in the lower part of the node list.
Read the displayed message and then click Confirm.

Release a cluster

If you no longer need a cluster, you can release the cluster.

On the Cluster Details page, click More in the upper-right corner, and then select Release the cluster.
In the message that appears, click OK.

References

You can use a cluster template to quickly create a cluster in which GROMACS is pre-installed and submit jobs by using the E-HPC Portal. For more information, see Use GROMACS to analyze jobs.