Create an Elastic High Performance Computing (E-HPC) managed cluster to run HPC workloads on Alibaba Cloud. In a managed cluster, E-HPC provisions and maintains the management node. You only manage compute nodes and job queues.
Creating an E-HPC cluster automatically provisions resources such as ECS instances, which incur fees. For details, see Billing overview.
Cluster architecture
A managed cluster consists of three components:
Compute nodes: ECS instances that run jobs. Compute nodes belong to scalable queues. The number of compute nodes can grow or shrink based on workload demand.
Logon node: A single ECS instance with the Login addon deployed and an elastic IP address (EIP) bound for remote access.
Shared file system: A NAS or Cloud Parallel File Storage (CPFS) file system shared across all nodes for job and application data.
Do not use the ECS console to manage nodes in an E-HPC cluster unless necessary. Use the E-HPC console instead.
For more information, see Cluster overview.
Prerequisites
Before you begin, make sure that you have:
An E-HPC service-linked role (created automatically on first login to the E-HPC console)
A VPC and a vSwitch. See Create and manage a VPC and Create a vSwitch
Apsara File Storage NAS (NAS) activated, with a file system and mount target created. See Create a file system and Create a mount target
Procedure
Step 1: Open the Create Cluster page
Go to the Create Cluster page in the E-HPC console.
Step 2: Configure the cluster
On the Cluster Configuration step, configure network, cluster type, and scheduler settings.
Basic settings
| Parameter | Description |
|---|---|
| Region | Region where the cluster is created. |
| Network and Availability Zone | VPC and vSwitch for the cluster. Nodes use IP addresses from the vSwitch. Make sure the vSwitch has more available IP addresses than the number of cluster nodes. |
| Security group | Controls inbound and outbound traffic for cluster nodes. Select one of the following options: Automatically create a normal security group, Automatically create enterprise security groups, or Select Existing Security Group. The system automatically creates rules for inter-node communication. A single basic security group can contain up to 2,000 nodes. For larger clusters, use advanced security groups. See Basic security groups and advanced security groups. |
Cluster type
A managed cluster separates the management node from compute nodes. E-HPC creates and maintains the management node.
| Parameter | Description |
|---|---|
| Series | Select Managed Edition. |
| Deployment Mode | Select Public cloud cluster. |
| Cluster Type | Select Slurm (only supported option). |
Custom options
| Parameter | Description |
|---|---|
| Scheduler | Scheduler software to deploy. Only Slurm 22 is supported. |
| Domain Account | Domain account service for the cluster. Only NIS (Network Information Service) is supported for managed clusters. |
| Domain name resolution | Use the default value. |
| Maximum number of cluster nodes | Maximum number of nodes the cluster can contain. Works with Maximum number of cores in the cluster to control cluster size. |
| Maximum number of cores in the cluster | Maximum number of vCPUs available to compute nodes. Works with Maximum number of cluster nodes to control cluster size. |
| Cluster Deletion Protection | Prevents accidental cluster deletion. When enabled, the cluster cannot be released until you disable this setting. |
Resource group
Assign the cluster to a resource group. By default, clusters belong to the default resource group. For more information, see Resource groups.
Step 3: Configure compute nodes and queues
On the Compute Node and Queue step, set up queues and compute nodes.
Compute nodes are organized into queues. When you submit a job, specify the target queue. Each cluster has a default queue named comp. To add queues, click Add more queues.
Configure the following parameters for each queue:
Basic settings
| Parameter | Description |
|---|---|
| Automatic queue scaling | Enable or disable automatic scaling. After you enable this feature, select Auto Grow and/or Auto Shrink to automatically add or remove compute nodes based on workload. |
| Queue Compute Nodes | Set the initial, maximum, and minimum node counts. Without auto-scaling: set the initial number. With auto-scaling: set the minimum and maximum. |
Setting Minimal Nodes to a non-zero value retains that number of nodes during scale-in, even when idle. Set this value carefully to avoid unnecessary costs.
Queue node configuration
Configure node specifications if auto-scaling is enabled or the initial node count is greater than 0.
| Parameter | Description |
|---|---|
| Inter-node interconnection | Communication mode between compute nodes. Options: VPC Network (standard VPC networking) or eRDMA Network (eRDMA (elastic Remote Direct Memory Access) networking, for instance types that support Elastic RDMA Interfaces (ERIs)). See eRDMA overview and Configure eRDMA on an enterprise-level instance. |
| Use Preset Node Pool | Select a reserved node pool to reuse pre-allocated resources during scale-out. See Use reserved node pools in clusters. |
| Virtual Switch | vSwitch for compute nodes. The system assigns IP addresses from the vSwitch CIDR block. |
| Instance type Group | Click Add Instance to select instance types. Without auto-scaling: one instance type. With auto-scaling: multiple instance types. |
Specify multiple vSwitches and instance types as fallbacks for inventory shortages. The system attempts to create nodes in the order of specified instance types and zones. The first vSwitch determines the initial zone.
Auto scale
Configure the following parameters when automatic scaling is enabled.
| Parameter | Description |
|---|---|
| Scaling Policy | Only Supply Priority Strategy is supported. Nodes are created in specified zones in the order of configured vSwitches. |
| Maximum number of single expansion nodes | Nodes to add or remove per scaling cycle. Default 99. Configure this parameter to control your costs on compute nodes. |
| Prefix of Hostnames | Hostname prefix that distinguishes nodes in different queues. |
| Hostname Suffix | Hostname suffix that distinguishes nodes in different queues. |
| Instance RAM role | RAM role that grants nodes access to Alibaba Cloud services. Select a role from the dropdown. The default AliyunECSInstanceForEHPCRole role is recommended. |
Step 4: Configure shared file storage
On the Shared File Storage step, configure the file system shared across cluster nodes.
By default, the file system is mounted to the /home and /opt directories of the management node as shared storage. To mount a file system to another directory, click Add more storage.
You cannot mount different file system directories to /home and /opt.
| Parameter | Description |
|---|---|
| Type | File system type: General-purpose NAS, Extreme NAS, or Parallel file CPFS. |
| File System | ID and mount point of the file system. Make sure the file system has sufficient mount points. |
| File System Directory | Directory of the file system to mount. |
| Mount Options | Mount protocol settings. |
Step 5: Configure software and addons
On the Software and Service Component step, install software and configure addons.
Click Add software. In the dialog box, select the HPC applications to install.
Click Add Service Component. In the dialog box, select and configure an addon.
Only the Login addon is supported. It is enabled by default for public cloud clusters to allow remote access over the internet.
The Login addon has the following parameters:
| Category | Parameter | Description |
|---|---|---|
| Custom parameters | SSH | Port number, protocol, and allowed CIDR blocks for SSH connections. |
| Custom parameters | VNC | Port number, protocol, and allowed CIDR blocks for VNC connections. |
| Custom parameters | Web Portal | Port number, protocol, and allowed CIDR blocks for client connections. |
| Addon deployment resources | EIP | EIP bound to the Login addon ECS instance for internet access. Select an existing EIP or create a new one. |
| Addon deployment resources | ECS Instance | Instance type for the ECS instance that runs the Login addon. |
Step 6: Confirm and create
On the Confirm configuration step, verify the cluster settings and specify a name and credentials.
| Parameter | Description |
|---|---|
| Cluster Name | Name displayed on the Cluster page for identification. |
| Login Credentials | Authentication method. Only Custom Password is supported. |
| Set Password and Repeat Password | Password for the root user to log on to all nodes in the cluster. |
Read the service agreement, confirm the fees, and click Create Cluster.
What's next
After the cluster is created, create a cluster user to submit jobs. See Manage users and Job overview.