All Products
Search
Document Center

Elastic High Performance Computing:Create a managed cluster on Alibaba Cloud Public Cloud

Last Updated:May 13, 2025

Elastic High Performance Computing (E-HPC) creates and maintains the management node of a managed cluster so that you need only to manage compute nodes and can focus on business development. This topic describes how to create a managed cluster in the E-HPC console.

Background information

A managed cluster consists of the following parts:

  • An unfixed number of compute nodes. Each ECS instance works as a compute node. Compute nodes are managed by scalable queue and used to run jobs.

  • A logon node. An ECS instance works as a logon node, on which the Login addon is deployed and to which an Elastic IP Address (EIP) is bound for remote connection to the cluster.

  • A shared file system. A File Storage NAS file system or a Cloud Parallel File Storage (CPFS) file system is attached to and shared by the cluster to store job data and application data.

Important
  • When you create an E-HPC cluster, the system automatically creates resources such as ECS instances, which may incur fees. For more information, see Billing overview.

  • If you want to manage nodes in a created E-HPC cluster, do not use the ECS console for that purpose unless necessary. Use the E-HPC console instead.

For more information about E-HPC clusters, see Overview.

Prerequisites

  • A service-linked role for E-HPC is created. The first time you log on to the E-HPC console, you are prompted to create a service-linked role for E-HPC.

  • A virtual private cloud (VPC) and a vSwitch are created. For more information, see Create and manage a VPC and Create a vSwitch.

  • Apsara File Storage NAS (NAS) is activated. A NAS file system and a mount target are created. For more information, see Create a file system and Create a mount target.

Procedure

Step 1: Go to the Create Cluster page

Go to the Create Cluster page.

Step 2: Configure the cluster

On the Cluster Configuration step, configure the cluster network, type, and scheduler.

  • Basic Settings

    Parameter

    Description

    Region

    The region where you want to create a cluster.

    Network and Availability Zone

    The VPC in which your cluster is deployed and the vSwitch to which the cluster belongs.

    Note

    The nodes in the cluster use the IP addresses in the vSwitch. Make sure that the number of available IP addresses is greater than that of cluster nodes.

    Security group

    A security group is used to manage the inbound and outbound traffic of nodes in a cluster. The system automatically creates rules for the security group that is automatically created to enable communication between the nodes in the cluster.

    Select a security group type based on your business requirements. For more information about the differences between basic and advanced security groups, see Basic security groups and advanced security groups.

  • Cluster type

    A managed cluster consists of a management node and multiple compute nodes. The management node is created and maintained by E-HPC and does not require you to configure and manage it.

    Parameter

    Description

    Series

    Select Managed Edition.

    Deployment Mode

    Select Public cloud cluster.

    Cluster Type

    Select a type of scheduler for the cluster. Only Slurm can be selected.

  • Custom options

    Parameter

    Description

    Scheduler

    Select the scheduler software to deploy. Only Slurm 22 is supported.

    Domain Account

    Select the domain account service that you want to use in the cluster. Only NIS is supported for managed clusters.

    Domain name resolution

    Use the default value.

    Maximum number of cluster nodes

    Specify the maximum number of nodes that a cluster can contain. This parameter and the Maximum number of cores in the cluster parameter jointly control the cluster size.

    Maximum number of cores in the cluster

    Specify the maximum number of vCPUs that can be used by the compute nodes in the cluster. This parameter and the Maximum number of cluster nodes parameter jointly control the cluster size.

    Cluster Deletion Protection

    Specify whether to enable the deletion protection feature for the cluster. When this feature is enabled, the cluster cannot be released. To release the cluster, you must disable this feature first. This feature helps prevent misoperations.

  • Resource Group

    Resources are managed in groups. For more information, see Resource groups. By default, E-HPC clusters belong to the default resource group. You can modify the setting based on your business requirements.

Step 3: Configure compute nodes and queues

In the Compute Node and Queue step, configure queues and compute nodes for the cluster.

Compute nodes are managed in queues. When you submit a job, you can specify to which queue you want to submit the job. Each cluster has a default queue named comp. You can click Add more queues to create more queues in the cluster. You need to configure the following parameters for each queue:

  • Basic Settings

    Parameter

    Description

    Automatic queue scaling

    Specify whether to enable Automatic queue scaling. After you turn on Automatic queue scaling, you can further select Auto Grow and/or Auto Shrink based on your business requirements.

    After you enable Automatic queue scaling, the system automatically adds or removes compute nodes based on the configurations or the real-time load.

    Queue Compute Nodes

    Set the initial, maximum, and minimum numbers of nodes in the queue.

    • If you do not enable Automatic queue scaling, configure the initial number of compute nodes in the queue.

    • If you enable Automatic queue scaling, configure the minimum and maximum numbers of compute nodes in the queue.

      Important

      If you specify the Minimal Nodes parameter to a non-zero value, the queue retains the number of nodes based on the value you specify during cluster scale-in, even when the nodes are idle. We recommend that you specify the Minimal Nodes parameter with caution to avoid a waste of resources and unnecessary costs due to idle nodes in the queue.

  • Select Queue Node Configuration

    If you enable Automatic queue scaling or set Initial Number of Nodes to a value larger than 0, you must configure the following parameters to enable the system to create compute nodes for the queue:

    Parameter

    Description

    Inter-node interconnection

    Select a mode to interconnect nodes. Valid values:

    • VPCNetwork: The compute nodes communicate with each other over virtual private clouds (VPCs).

    • eRDMANetwork: If the instance types of compute nodes support eRDMA interfaces (ERIs), the compute nodes communicate with each other over eRDMA networks.

      Note

      Only compute nodes of specific instance types support ERIs. For more information, see Overview and Configure eRDMA on an enterprise-level instance.

    Use Preset Node Pool

    Select a created reserved node pool. The system automatically selects IP addresses and host names from the unassigned reserved nodes in the pool to create compute nodes.

    Note

    You can quickly reuse pre-allocated resources when you scale out by using a reserved node pool. For more information, see Use reserved node pools in clusters.

    Virtual Switch

    Specify a vSwitch for the nodes to use. The system automatically assigns IP addresses to the compute nodes from the available vSwitch CIDR block.

    Instance type Group

    Click Add Instance and select an instance type in the panel that appears.

    If you do not enable Automatic queue scaling, you can add only one instance type. If you enable Automatic queue scaling, you can add multiple instance types.

    Important

    You can select multiple vSwitches and instance types as alternatives in case that instances fail to be created due to inventory issues. When you create a compute node, the system attempts to create the node in the sequence of specified instance type and zone. For example, the system first attempts to create a node based on the instance type that you specify in sequence in the zone where the first vSwitch resides. The specifications of a created instance may vary based on the inventory.

  • Auto Scale

    Parameter

    Description

    Scaling Policy

    Select a scaling policy. Currently, only Supply Priority Strategy is supported. This policy indicates that compute nodes that meet the specifications requirements are created in the specified zones in the order of the configured vSwitches.

    Maximum number of single expansion nodes

    Specify the number of nodes to be added or removed in each scale-out or scale-in cycle. The default value 0 specifies that the number is unlimited.

    We recommend that you configure this parameter to control your costs on compute nodes.

    Prefix of Hostnames

    Specify the hostname prefix for the compute nodes. The prefix is used to distinguish between the nodes of different queues.

    Hostname Suffix

    Specify the hostname suffix for the compute nodes. The suffix is used to distinguish between the nodes of different queues.

    Instance RAM role

    Bind a Resource Access Management (RAM) role to the nodes to enable the nodes to access Alibaba Cloud services.

    We recommend that you select the default role AliyunECSInstanceForEHPCRole.

Step 4: Configure the shared file storage

In the Shared File Storage step, configure the storage resource shared by the nodes in the cluster.

By default, the configured file system will be mounted to the /home and /opt directories of the management node for use as shared storage directories. If you want to mount a file system to another directory, click Add more storage. The following table describes the parameters:

Note

You cannot mount different file system directories to /home and /opt.

Parameter

Description

Type

The type of the file system that you want to mount. Valid values:

  • General-purpose NAS

  • Extreme NAS

  • Parallel file CPFS

File System

The ID and mount point of the file system that you want to mount. Make sure that the file system has sufficient mount points.

File System Directory

The directory of the file system that you want mount.

Mount Configurations

The mount protocol.

Step 5: Configure software and addons

In the Software and Service Component step, configure the software and addons that you want to install in the cluster.

  • Click Add software. In the Add software dialog box, select the software applications that you want to install in the cluster. E-HPC provides commonly used HPC software applications for you to install based on your business requirements.

  • Click Add Service Component. In the dialog box that appears, select a service component addon and configure the addon parameters.

    Note

    Currently, only the Login addon is supported.

    By default, a public cloud cluster is configured with the Login addon to enable remote connection from the Internet. The following table describes the addon parameters:

    Category

    Parameter

    Description

    Custom parameters of the Login addon

    SSH

    Set the port number, protocol, and allowed CIDR blocks that are used when the cluster is connected by using SSH.

    VNC

    Set the port number, protocol, and allowed CIDR blocks that are used when the cluster is connected by using VNC.

    CLIENT

    Set the port number, protocol, and allowed CIDR blocks that are used when the cluster is connected by using a client.

    Addon deployment resources

    EIP

    Bind an EIP to the ECS instance on which the Login addon is deployed so that the cluster can be connected over the Internet. You can select an existing EIP or allow the system to create one for you.

    ECS Instance

    Specify the instance type for the ECS instance on which the Login addon is deployed.

Step 6: Confirm the configurations

In the Confirm configuration step, confirm the cluster configurations and specify a name and logon credentials for the cluster.

Parameter

Description

Cluster Name

The name of the cluster. The cluster name is displayed on the Cluster page to facilitate identification.

Cluster Password-free

Specifies whether to enable password-free logon to a compute node for the root user who just switched from the management node.

Important

If you enable this feature, the one-way password-free logon mode is configured for the root user. Specifically, the root user can log on to all compute nodes in a cluster without entering any password from the management node, but not the other way around. Exercise caution when you use this feature.

Logon Credentials

The credentials that are used to log on to the cluster. Only Custom Password is supported.

Set Password and Repeat Password

The password that is used to log on to the cluster. By default, the password is used for the root user to log on to all nodes in the cluster.

Read the service agreement, confirm the fees, and click Create Cluster.

References

After you create a cluster, you must create a cluster user to submit jobs to the cluster. For more information, see Manage users and Overview.