Standard Edition clusters for the public cloud are deployed in a cloud environment. They consist of components such as ECS (Elastic Compute Service) instances and shared file systems. You are responsible for maintaining the availability of cluster services. This topic describes how to create a cluster for the public cloud in the console.
Background information
A Standard Edition E-HPC cluster for the public cloud consists of the following components:
-
Control plane node: An ECS instance that deploys the scheduler and domain account service to manage job scheduling and user information.
-
Compute nodes: Multiple ECS instances that can be managed in queues. These nodes support scaling and are used to run jobs.
-
Logon node: An ECS instance that deploys the Login component and is bound to an EIP (Elastic IP address). It is used for remote connections to the cluster.
-
Shared storage: Supports mounting NAS and CPFS file systems to share data, such as job data and software data.
-
When you create an E-HPC cluster, the system automatically creates resources such as ECS instances. This may incur fees. For more information, see Billing overview.
-
After you create an E-HPC cluster, do not adjust individual cluster nodes in the ECS console except in special cases. Perform operations in the E-HPC console.
For more information about E-HPC clusters, see Cluster overview.
Prerequisites
-
A service-linked role is created. The first time you log on to the E-HPC console, the system prompts you to create a service-linked role for E-HPC.
-
A VPC (Virtual Private Cloud) and a vSwitch are created. For more information, see Create a VPC and Create a vSwitch.
-
Activate the NAS service and create a NAS file system and a mount target. For more information, see Create a file system and Add a mount target.
Manual creation
Step 1: Go to the Create Cluster page
Go to the Create Cluster page.
Step 2: Configure the cluster
On the Cluster Configuration page, configure the cluster network, type, scheduler, and other settings.
-
Basic settings
Parameter
Description
Region
Select the region where the cluster resides.
Network and Zone
Select the VPC and vSwitch for the cluster.
NoteThe nodes in the cluster use IP addresses from the selected vSwitch. Make sure that the number of available IP addresses in the vSwitch is greater than the number of required nodes.
Security Group
A security group controls the inbound and outbound traffic of the cluster and its nodes. A security group that is automatically created by the system has rules added to it to ensure communication between nodes in the cluster.
Select the type of security group to create automatically as needed. For information about the differences between basic and advanced security groups, see Basic security groups and advanced security groups.
-
Cluster type
This type of cluster consists of one control plane node and multiple compute nodes. You can select the scheduler type for the cluster deployment and configure the control plane node.
Configuration Item
Description
Series
Select Standard Edition.
Deployment mode
Select Public Cloud Cluster.
Cluster Type
Select the scheduler type for the cluster. Schedulers commonly used in HPC scenarios are supported, including Slurm and OpenPBS.
Control Plane Node
The control plane node is an ECS instance that has the scheduler and domain account service deployed. Select the appropriate configurations for the control plane node based on your business scenario and cluster size.
-
Billing method
Select how to pay for the control plane node. For more information about billing, see Instance type billing.
-
Pay-as-you-go: This is a post-paid method. You are billed based on the actual usage duration. Spot instances are not supported.
-
Subscription: This is a pre-paid method. You are billed on a monthly, or yearly basis.
-
-
Instance type
Select an appropriate instance type for the control plane node. The recommended instance types for control plane nodes vary based on the cluster size:
-
If the number of compute nodes is less than or equal to 100, we recommend an instance type with at least 16 vCPUs and 64 GiB of memory.
-
If the number of compute nodes is greater than 100 and less than or equal to 500, we recommend an instance type with at least 32 vCPUs and 128 GiB of memory.
-
If the number of compute nodes is greater than 500, we recommend an instance type with at least 64 vCPUs and 256 GiB of memory.
-
-
Image
After you select an image type, you can select an image. Different images correspond to different operating systems. The system deploys cluster nodes based on the image you select.
NoteCustom images have the following limits:
-
Custom images created from official Alibaba Cloud images and imported CentOS images are supported. When you import an image, select Run Detection After Import. Otherwise, the image cannot be detected in the E-HPC console.
-
You cannot use custom images created from existing E-HPC cluster nodes. Otherwise, an error occurs when you create compute nodes for the cluster.
-
Do not modify the yum source configuration of the operating system in a custom image. Otherwise, you cannot create or scale out a cluster.
-
The mount paths of a custom image (the paths where NAS file systems are mounted using the mount command) cannot include the
/homeand/optdirectories.
-
-
Storage
Select the system disk specifications for the control plane node and whether to attach a data disk. For more information about disk types and performance, see Cloud disk overview.
-
Hyper-threading
CPU hyper-threading is enabled by default. If your business scenario requires better performance, you can disable CPU hyper-threading.
NoteAfter the cluster is created, the control plane node is automatically attached to the instance RAM role
AliyunECSInstanceForEHPCRole. This role supports core features such as automatic scaling. Do not detach or replace this role in the ECS console. To grant more API call permissions, see E-HPC service role. -
-
Custom options
Parameter
Description
Scheduler
Select the scheduler software to deploy based on the selected cluster type and the image configured for the control plane node.
Domain account
Select the domain account service to deploy for the cluster.
Domain name resolution
Keep the default value.
Cluster post-processing script
This script is used to process result data or perform other subsequent operations after a cluster compute job is complete.
Maximum number of nodes
The maximum number of nodes that the cluster can contain. This parameter and the maximum number of cores control the cluster size.
Maximum number of cores
The maximum number of cores that the cluster can contain. This parameter and the maximum number of nodes control the cluster size.
Cluster deletion protection
Set whether to enable deletion protection for the cluster. If you enable this feature, you must disable it before you can release the cluster. This prevents accidental cluster releases.
-
Resource group
Resource groups are used to manage resources in groups. For more information, see Resource groups. By default, the cluster belongs to the default resource group. You can change this as needed.
Step 3: Configure compute nodes and queues
On the Compute Nodes and Queues page, configure the queues.
Queues are used to manage compute nodes in groups. You can specify a queue when you run a job. By default, a cluster has one queue (the comp queue). You can click Add More Queues to add more queues. Configure the following information for a single queue:
-
Basic settings
Configuration Item
Description
Queue auto scaling
Select whether to enable Auto Scaling. If you enable it, you can then choose whether to enable Auto Scale-out and Auto Scale-in as needed.
After you enable automatic scaling, the system automatically adds or removes compute nodes based on the configuration and real-time workload.
Number of nodes in queue
Set the number of nodes in the queue.
-
If queue auto scaling is disabled, configure the initial number of compute nodes for the queue.
-
If queue auto scaling is enabled, configure the minimum and maximum number of nodes allowed in the queue.
ImportantIf you change the minimum number of nodes to a non-zero value, the queue retains that minimum number of nodes during a scale-in, even if the nodes are idle. Set the minimum number of nodes with caution to avoid resource waste and unnecessary costs from idle nodes remaining after a scale-in.
-
-
Select queue node configuration
If queue auto scaling is enabled, or if it is disabled but the initial number of nodes is not 0, you must configure the following information so that the system can create compute nodes.
Configuration item
Description
Node interconnect
Select the network connection method between nodes.
-
VPC Network: Nodes communicate with each other over the VPC network.
-
eRDMA Network: If the nodes use instance types that support Elastic RDMA Interface (ERI), they can communicate over the elastic Remote Direct Memory Access (eRDMA) network.
NoteOnly some node instance types support ERI. For more information, see eRDMA overview and Enable eRDMA on an enterprise-level instance.
Use preset node pool
Select a created preset node pool. The system automatically selects IP addresses and hostnames from the unassigned preset nodes in the pool to create compute nodes.
NoteUsing a preset node pool for scale-out allows for the rapid reuse of pre-allocated resources. For more information, see Use a preset node pool in a cluster.
vSwitch
Select the vSwitch to which the nodes belong. The system automatically assigns IP addresses to the nodes from the available vSwitch CIDR blocks.
Instance type group
Click Add Instance Type to select the instance types for the nodes.
If automatic scaling is disabled, you can add only one instance type. If automatic scaling is enabled, you can add multiple instance types.
ImportantYou can select multiple vSwitches and multiple instance types as backups to avoid instance creation failures due to inventory issues. When creating compute nodes, the system starts from the zone of the first vSwitch and tries to create instances in the order of the specified instance types until the required number of nodes is met. The instance types of the successfully created instances may vary with inventory changes.
-
-
Auto scaling
Configuration item
Description
Scaling policy
Select a scaling policy. Currently, only the Supply-prioritized Policy is supported. This means the system will try to create compute nodes that meet the specification requirements in order from the corresponding zones, following the configured vSwitch order.
Maximum number of nodes per scaling activity
The maximum number of nodes to add or remove in each scale-out or scale-in cycle. The default value is 0, which means there is no limit.
If you have cost requirements, you can set this value to ensure that the number of scaled-out nodes does not exceed your expectations.
Hostname prefix
The starting characters of the node hostname, used to mark and distinguish nodes.
Hostname suffix
The ending characters of the node hostname, used to mark and distinguish nodes.
Host RAM role
Attach a RAM role to the nodes so they can get permissions to access Alibaba Cloud services.
We recommend that you select the default role AliyunECSInstanceForEHPCRole created by the system.
Step 4: Configure shared file storage
On the Shared File Storage page, complete the storage configuration.
By default, the /home and /opt directories of the control plane node have a file system mounted as a shared storage directory. If you want to mount a file system for other directories, click Add More Storage and complete the relevant configurations. The following file system information needs to be configured for a single directory:
The /home and /opt directories do not currently support mounting different file system directories.
|
Parameter |
Description |
|
Type |
Select the type of file system to mount.
|
|
File system |
Select the file system ID and mount target to mount. Make sure the file system has available mount targets. |
|
File system directory |
Enter the file system directory to mount. |
|
Mount option |
Select the mount protocol. |
Step 5: Configure software and service components
On the Software and Service Components page, configure the software and service components.
-
Click Add Software. In the dialog box that appears, select the software to install. E-HPC provides software commonly used in the HPC industry. You can select as needed.
-
Click Add Service Component. In the dialog box that appears, select a service component and configure its parameters.
NoteCurrently, only the Login component is supported.
Public cloud clusters are configured with the Login component by default for remote connection to the cluster over the public network. The component parameters are described as follows:
Configuration
Configuration Item
Description
Custom parameters for the Login component
SSH
Set the port number, protocol, and allowed IP CIDR block for connecting to the cluster through Secure Shell (SSH).
VNC
Set the port number, protocol, and allowed IP CIDR block for connecting to the cluster through VNC.
CLIENT
Set the port number, protocol, and allowed IP CIDR block for connecting to the cluster through a client.
Component deployment resources
EIP Instance
Bind an EIP to the ECS instance where the Login component is deployed so you can connect to the cluster over the public network. You can automatically create or select an existing EIP.
ECS Instance
Set the instance type for the ECS instance used to deploy the Login component.
NoteAfter the logon node is created, it is automatically attached to the instance RAM role
AliyunECSInstanceForEHPCRole. This role allows features such as the Web Portal to function correctly. Do not detach or replace this role in the ECS console. To grant more API call permissions, see E-HPC service role.
Step 6: Confirm configuration
On the Confirm Configuration page, confirm the configuration information and set the cluster name and logon credential.
|
Configuration |
Description |
|
Cluster name |
Enter a name. This name is displayed in the cluster list to help you find and identify the cluster. |
|
Passwordless logon |
Set whether the root user can log on without a password when switching from the control plane node to a compute node. Important
Enabling this feature configures a one-way passwordless logon from the control plane node to all compute nodes for the root user. It does not support passwordless logon from compute nodes to the control plane node. Proceed with caution. |
|
Logon credential |
Select the credential for logging on to the cluster. Currently, only Custom Password is supported. |
|
Set password, Confirm password |
Enter the password for logging on to the cluster. All nodes in the cluster use this password as the logon password for the root user by default. |
After completing the configuration, read the Terms of Service, confirm the fee information, and then click Create Cluster.
Template creation
E-HPC supports creating clusters quickly and in batches using templates. A template defines the basic parameters required to create a cluster. You can choose a cluster template provided by E-HPC or write your own custom template.
Use a public template to create a cluster
Go to the Cluster List page.
Log on to the E-HPC console.
In the left part of the top navigation bar, select a region.
In the left-side navigation pane, click Cluster.
-
On the Cluster List page, click Cluster Template.
-
In the dialog box that appears, select the template to use and click Create Cluster for that template.

-
Confirm the configuration information and enter the cluster name and other details.
-
In the Configuration Summary section, the default configuration provided by the template is displayed. If you want to modify the configuration, click Edit and modify the corresponding configuration items.
-
In the Management Settings section, complete the configuration as prompted on the page.
-
-
Read the terms of service, confirm the fee information, and then click Create Cluster.
Use a custom template to create a cluster
-
Write a custom template locally.
Go to the Cluster List page.
Log on to the E-HPC console.
In the left part of the top navigation bar, select a region.
In the left-side navigation pane, click Cluster.
-
On the Cluster List page, click Cluster Template.
-
In the dialog box that appears, click Import Local Template to upload the template file you edited locally.
-
In the Cluster Template Edit dialog box that appears, confirm that the custom template information is correct, and then click Confirm Template and Create.
-
On the Create Cluster page, confirm that the configuration information is correct, and then click Create Cluster.
References
After you create a cluster, create users to submit jobs. For more information, see User management and Job overview.