All Products
Search
Document Center

Elastic High Performance Computing:Create a standard cluster on Alibaba Cloud Public Cloud

Last Updated:May 27, 2025

A standard Elastic High Performance Computing (E-HPC) cluster is deployed in a cloud environment and consists of Elastic Compute Service (ECS) instances and a shared file system. Users can manage the availability of a standard E-HPC cluster on their own. This topic describes how to create a standard cluster in the E-HPC console.

Background information

A standard E-HPC public cloud cluster consists of the following parts:

  • A management node. An ECS instance works as a management node, on which a scheduler and a domain account service are deployed to schedule jobs and manage user information.

  • An unfixed number of compute nodes. Each ECS instance works as a compute node. Compute nodes are managed by scalable queue and used to run jobs.

  • A logon node. An ECS instance works as a logon node, on which the Login addon is deployed and to which an Elastic IP Address (EIP) is bound for remote connection to the cluster.

  • A shared file system. A File Storage NAS file system or a Cloud Parallel File Storage (CPFS) file system is attached to and shared by the cluster to store job data and application data.

Important
  • When you create an E-HPC cluster, the system automatically creates resources such as ECS instances, which may incur fees. For more information, see Billing overview.

  • If you want to manage nodes in a created E-HPC cluster, do not use the ECS console for that purpose unless necessary. Use the E-HPC console instead.

For more information about E-HPC clusters, see Overview.

Prerequisites

  • A service-linked role for E-HPC is created. The first time you log on to the E-HPC console, you are prompted to create a service-linked role for E-HPC.

  • A virtual private cloud (VPC) and a vSwitch are created. For more information, see Create and manage a VPC and Create a vSwitch.

  • Apsara File Storage NAS (NAS) is activated. A NAS file system and a mount target are created. For more information, see Create a file system and Create a mount target.

Manually create a cluster

Step 1: Go to the Create Cluster page

Go to the Create Cluster page.

Step 2: Configure the cluster

On the Cluster Configuration step, configure the cluster network, type, and scheduler.

  • Basic Settings

    Parameter

    Description

    Region

    The region where you want to create a cluster.

    Network and Availability Zone

    The VPC in which your cluster is deployed and the vSwitch to which the cluster belongs.

    Note

    The nodes in the cluster use the IP addresses in the vSwitch. Make sure that the number of available IP addresses is greater than that of cluster nodes.

    Security group

    A security group is used to manage the inbound and outbound traffic of nodes in a cluster. The system automatically creates rules for the security group that is automatically created to enable communication between the nodes in the cluster.

    Select a security group type based on your business requirements. For more information about the differences between basic and advanced security groups, see Basic security groups and advanced security groups.

  • Select a cluster type

    A standard cluster consists of a management node and multiple compute nodes. You must select a scheduler type and configure a management node for the cluster.

    Parameter

    Description

    Series

    Select Standard Edition.

    Deployment Mode

    Select Public cloud cluster.

    Cluster Type

    Select a scheduler type for the cluster. Mainstream HPC job schedulers are supported, including Slurm and OpenPBS.

    Management node

    The ECS instance in which the scheduler and domain account service are deployed. Select an appropriate instance type for the management node based on your business scenario and cluster size.

    • Billing method

      Specify how you want to pay for the management node. For more information, see Instance types.

      • Pay-as-you-go: You pay after you use the instance based on the actual usage. Preemptible instances do not support this billing method.

      • Subscription: You pay a subscription fee by month or year before you use the instance.

    • Instance Type

      Select an instance type for the management node based on your business requirements. The following items list the management node instance specifications recommended for different cluster sizes:

      • Compute node quantity ≤ 100: 16 or more vCPUs and 64 GiB or larger memory

      • 100 < Compute node quantity ≤ 500: 32 or more vCPUs and 128 GiB or larger memory

      • Compute node quantity > 500: 64 or more vCPUs and 256 or larger memory

    • Image

      After you select an image type, you can select the image that you want to use. Different images apply to different operating systems. The system deploys cluster nodes based on the image that you select.

      Note

      If you want to use custom images, take note of the following limits:

      • E-HPC supports CentOS images and custom images that are created based on Alibaba Cloud images. When you import an image, make sure that Check After Import is selected. Otherwise, the image cannot be identified in the E-HPC console.

      • You cannot use an existing image that was generated for another cluster. Otherwise, compute nodes may not run as expected after the current cluster is created.

      • You cannot modify the yum repository configurations of the operating system in a custom image. Otherwise, the cluster cannot be created or scaled out.

      • The mount directory of the custom image cannot be the /home directory or /opt directory.

    • Storage

      Specify the system disk specification of the management node and whether to attach a data disk to the management node. For more information about disk types and performance levels, see Disks.

    • Other Settings

      Specify whether to enable hyper-threading. By default, hyper-threading is enabled. If your business requires higher performance, you can disable hyper-threading.

  • Custom Options

    Parameter

    Description

    Scheduler

    Select the scheduler software to be deployed based on the selected cluster type and the configured management node image.

    Domain Account

    Select the domain account service that you want to use in the cluster.

    Domain name resolution

    Use the default value.

    Cluster post-processing script

    Select a script to process result data or perform subsequent operations after a computing job is completed.

    Maximum number of cluster nodes

    Specify the maximum number of nodes that a cluster can contain. This parameter and the Maximum number of cores in the cluster parameter jointly control the cluster size.

    Maximum number of cores in the cluster

    Specify the maximum number of vCPUs that can be used by the compute nodes in the cluster. This parameter and the Maximum number of cluster nodes parameter jointly control the cluster size.

    Cluster Deletion Protection

    Specify whether to enable the deletion protection feature for the cluster. When this feature is enabled, the cluster cannot be released. To release the cluster, you must disable this feature first. This feature helps prevent misoperations.

  • Resource Group

    Resources are managed in groups. For more information, see Resource groups. By default, E-HPC clusters belong to the default resource group. You can modify the setting based on your business requirements.

Step 3: Configure compute nodes and queues

In the Compute Node and Queue step, configure queues and compute nodes for the cluster.

Compute nodes are managed in queues. When you submit a job, you can specify to which queue you want to submit the job. Each cluster has a default queue named comp. You can click Add more queues to create more queues in the cluster. You need to configure the following parameters for each queue:

  • Basic Settings

    Parameter

    Description

    Automatic queue scaling

    Specify whether to enable Automatic queue scaling. After you turn on Automatic queue scaling, you can further select Auto Grow and/or Auto Shrink based on your business requirements.

    After you enable Automatic queue scaling, the system automatically adds or removes compute nodes based on the configurations or the real-time load.

    Queue Compute Nodes

    Set the initial, maximum, and minimum numbers of nodes in the queue.

    • If you do not enable Automatic queue scaling, configure the initial number of compute nodes in the queue.

    • If you enable Automatic queue scaling, configure the minimum and maximum numbers of compute nodes in the queue.

      Important

      If you specify the Minimal Nodes parameter to a non-zero value, the queue retains the number of nodes based on the value you specify during cluster scale-in, even when the nodes are idle. We recommend that you specify the Minimal Nodes parameter with caution to avoid a waste of resources and unnecessary costs due to idle nodes in the queue.

  • Select Queue Node Configuration

    If you enable Automatic queue scaling or set Initial Number of Nodes to a value larger than 0, you must configure the following parameters to enable the system to create compute nodes for the queue:

    Parameter

    Description

    Inter-node interconnection

    Select a mode to interconnect nodes. Valid values:

    • VPCNetwork: The compute nodes communicate with each other over virtual private clouds (VPCs).

    • eRDMANetwork: If the instance types of compute nodes support eRDMA interfaces (ERIs), the compute nodes communicate with each other over eRDMA networks.

      Note

      Only compute nodes of specific instance types support ERIs. For more information, see Overview and Configure eRDMA on an enterprise-level instance.

    Use Preset Node Pool

    Select a created reserved node pool. The system automatically selects IP addresses and host names from the unassigned reserved nodes in the pool to create compute nodes.

    Note

    You can quickly reuse pre-allocated resources when you scale out by using a reserved node pool. For more information, see Use reserved node pools in clusters.

    Virtual Switch

    Specify a vSwitch for the nodes to use. The system automatically assigns IP addresses to the compute nodes from the available vSwitch CIDR block.

    Instance type Group

    Click Add Instance and select an instance type in the panel that appears.

    If you do not enable Automatic queue scaling, you can add only one instance type. If you enable Automatic queue scaling, you can add multiple instance types.

    Important

    You can select multiple vSwitches and instance types as alternatives in case that instances fail to be created due to inventory issues. When you create a compute node, the system attempts to create the node in the sequence of specified instance type and zone. For example, the system first attempts to create a node based on the instance type that you specify in sequence in the zone where the first vSwitch resides. The specifications of a created instance may vary based on the inventory.

  • Auto Scale

    Parameter

    Description

    Scaling Policy

    Select a scaling policy. Currently, only Supply Priority Strategy is supported. This policy indicates that compute nodes that meet the specifications requirements are created in the specified zones in the order of the configured vSwitches.

    Maximum number of single expansion nodes

    Specify the number of nodes to be added or removed in each scale-out or scale-in cycle. The default value 0 specifies that the number is unlimited.

    We recommend that you configure this parameter to control your costs on compute nodes.

    Prefix of Hostnames

    Specify the hostname prefix for the compute nodes. The prefix is used to distinguish between the nodes of different queues.

    Hostname Suffix

    Specify the hostname suffix for the compute nodes. The suffix is used to distinguish between the nodes of different queues.

    Instance RAM role

    Bind a Resource Access Management (RAM) role to the nodes to enable the nodes to access Alibaba Cloud services.

    We recommend that you select the default role AliyunECSInstanceForEHPCRole.

Step 4: Configure the shared file storage

In the Shared File Storage step, configure the storage resource shared by the nodes in the cluster.

By default, the configured file system will be mounted to the /home and /opt directories of the management node for use as shared storage directories. If you want to mount a file system to another directory, click Add more storage. The following table describes the parameters:

Note

You cannot mount different file system directories to /home and /opt.

Parameter

Description

Type

The type of the file system that you want to mount. Valid values:

  • General-purpose NAS

  • Extreme NAS

  • Parallel file CPFS

File System

The ID and mount point of the file system that you want to mount. Make sure that the file system has sufficient mount points.

File System Directory

The directory of the file system that you want mount.

Mount Configurations

The mount protocol.

Step 5: Configure software and addons

In the Software and Service Component step, configure the software and addons that you want to install in the cluster.

  • Click Add software. In the Add software dialog box, select the software applications that you want to install in the cluster. E-HPC provides commonly used HPC software applications for you to install based on your business requirements.

  • Click Add Service Component. In the dialog box that appears, select a service component addon and configure the addon parameters.

    Note

    Currently, only the Login addon is supported.

    By default, a public cloud cluster is configured with the Login addon to enable remote connection from the Internet. The following table describes the addon parameters:

    Category

    Parameter

    Description

    Custom parameters of the Login addon

    SSH

    Set the port number, protocol, and allowed CIDR blocks that are used when the cluster is connected by using SSH.

    VNC

    Set the port number, protocol, and allowed CIDR blocks that are used when the cluster is connected by using VNC.

    CLIENT

    Set the port number, protocol, and allowed CIDR blocks that are used when the cluster is connected by using a client.

    Addon deployment resources

    EIP

    Bind an EIP to the ECS instance on which the Login addon is deployed so that the cluster can be connected over the Internet. You can select an existing EIP or allow the system to create one for you.

    ECS Instance

    Specify the instance type for the ECS instance on which the Login addon is deployed.

Step 6: Confirm the configurations

In the Confirm configuration step, confirm the cluster configurations and specify a name and logon credentials for the cluster.

Parameter

Description

Cluster Name

The name of the cluster. The cluster name is displayed on the Cluster page to facilitate identification.

Cluster Password-free

Specifies whether to enable password-free logon to a compute node for the root user who just switched from the management node.

Important

If you enable this feature, the one-way password-free logon mode is configured for the root user. Specifically, the root user can log on to all compute nodes in a cluster without entering any password from the management node, but not the other way around. Exercise caution when you use this feature.

Logon Credentials

The credentials that are used to log on to the cluster. Only Custom Password is supported.

Set Password and Repeat Password

The password that is used to log on to the cluster. By default, the password is used for the root user to log on to all nodes in the cluster.

Read the service agreement, confirm the fees, and click Create Cluster.

Create a cluster by using a template

E-HPC allows you to quickly create multiple clusters by using a template. The template defines the basic parameters that are required to create a cluster. You can use a template provided by E-HPC or create your own template.

Create a cluster by using a built-in template

  1. Go to the Cluster List page.

    1. Log on to the E-HPC console.

    2. In the left part of the top navigation bar, select a region.

    3. In the left-side navigation pane, click Cluster.

  2. On the Cluster page, click Cluster Templates.

  3. In the dialog box that appears, select a template and click Create Cluster.

    image

  4. Confirm the configurations and enter information such as the cluster name.

    • The Configuration Summary section shows the configurations specified in the template. You can click Edit next to specific configurations to modify the configurations based on your business requirements.

    • In the Manage Settings section, supplement information as prompted.

  5. Read the service agreement, confirm the fees, and click Create Cluster.

Create a cluster by using a custom template

  1. Write a custom local template.

    This topic provides an example. You must change the parameter values based on your actual scenario.

    ### Basic cluster settings
    Region: cn-hangzhou                            # Optional. The region in which you want to create a cluster. If you leave this parameter empty, the current region is used by default.
    ClusterName: "TestClusterName"                 # Optional. The cluster name. If you leave this parameter empty, the system generates a name based on the cluster type, for example, SLURM-Region-DATESTAMP.
    ClusterDescription: "XXXXX"                    # Optional. The cluster description. Optional. 
    ClusterCategory: "Standard"                    # Required. The cluster edition. Valid values: ['Standard', 'Serverless', 'SuperComputing'].
    ClusterVpcId: ""                               # Optional. The VPC ID. If you leave this parameter empty, a valid VPC ID in the current region is used by default.
    ClusterVSwitchId: ""                           # Optional. The vSwitch ID of the management node of the cluster. If you leave this parameter empty, a valid vSwitch ID in the current VPC is used by default.
    IsEnterpriseSecurityGroup: true                # Optional. Specifies whether to use an enterprise-class security group. Default value: false, which specifies a regular security group. This parameter is valid only when SecurityGroupId is left empty. Optional. Default value: false, which indicates that a basic security group is used.
    SecurityGroupId: sg-bp1gje9ip78z7v6zy203       # Optional. The security group ID. This parameter is left empty by default, which specifies that a security group is automatically created. 
    ClusterCustomConfiguration:                    # Optional. The custom PostInstall script of the cluster.  
      Script: oss://                               # The URL of the Object Storage Service (OSS) bucket that stores the script.
      Args: arg1 arg2                              # The parameters passed in the script.
    MaxCount: 1000                                 # Optional. The maximum number of compute nodes that the cluster can contain. Default value: 1000.
    MaxCoreCount: 100000                           # Optional. The maximum combined number of vCPUs that can be used by all the compute nodes in the cluster. Default value: 10000.
    DeletionProtection: true                       # Optional. Specifies whether to enable deletion protection for the cluster. Default value: true, which specifies that deletion protection is enabled for the cluster.
    ResourceGroupId: rg-acfm2xumdifd3ri            # Optional. The security group to which the cluster belongs. If you leave this parameter empty, a valid security group ID in the current account is used by default.
    Tags:                                          # Optional. The cluster tags.  
      - Key: String
        Value: String
    
    ### Management service settings
    Manager:                                     # The management node configurations.
      Scheduler:                                 # The scheduler configurations.
        Type: "SLURM"                            # Optional. The scheduler type. Default value: SLURM.
        Version: "22"                            # Optional. The scheduler version. Default value: 22.
      DirectoryService:                          # The domain account service configurations.
        Type: "NIS"                              # Optional. The domain account service type. Default value: NIS.
        Version: "x.x.x"                         # Optional. The domain account service version.
      DNS:                                       # The domain name resolution configurations.
        Type: "NIS"                              # Optional. The domain name resolution service type. Default value: NIS.
        Version: "x.x.x"                         # Optional. The domain name resolution service version.
      ManagerNode:                               # The management node instance configurations.
        InstanceType: "ecs.c7.xlarge"            # Optional. The instance type. This parameter is required for non-managed clusters.
        ImageId: "m-xxxxxx"                      # Optional. The image to be used by the instance. This parameter is required for non-managed clusters.
        InstanceChargeType: "PostPaid"           # Optional. The billing method of the instance. Valid values: PostPaid and Subscription. Default value: PostPaid.
        PeriodUnit: "Month"                      # Optional. The unit of subscription duration of the instance. This parameter is required only when InstanceChargeType is set to Subscription.
        Period: 1                                # Optional. The subscription duration of the instance. This parameter is required only when InstanceChargeType is set to Subscription.
        AutoRenew: false                         # Optional. Specifies whether to enable auto-renewal for the instance. This parameter is required only when InstanceChargeType is set to Subscription.
        AutoRenewPeriod: 1                       # Optional. The auto-renewal subscription duration of the instance. This parameter is required only when InstanceChargeType is set to Subscription.
        SpotStrategy: "SpotWithPriceGo"          # Optional. The purchase mode of the preemptible instance. This parameter is invalid for the management node instance.
        SpotPriceLimit: 0.5                      # The price upper limit for the preemptible instance. This parameter is invalid for the management node instance.
        Duration: 1                              # The period for which the preemptible instance is retained. This parameter is invalid for the management node instance.
        SystemDisk:                              # Optional. The system disk configurations.
          Category: "cloud_essd"                 # Optional. The disk type. Default value: cloud_essd.
          Size: 40                               # Optional. The disk size. Default value: 40.
          Level: "PL0"                           # Optional. The disk PL. Default value: PL0.
        DataDisks:                               # Optional. The data disk configurations.
          Category: "cloud_essd"                 # Optional. The disk type. Default value: cloud_essd.
            Size: 40                               # Optional. The disk size. Default value: 40.
            Level: "PL0"                           # Optional. The disk PL. Default value: PL0.
            DeleteWithInstance: false            # Optional. Specifies whether to delete the disk along with the instance. Default value: false.
        EnableHT: false                          # Optional. Specifies whether to enable hyper-threading. Default value: true.
    
    ### Queue and node configurations                      # Optional. The queue configurations.
    Queues:                                      # 
      - Name: workq                              # Optional. The queue name.
        EnableScaleOut: false                     # Optional. Specifies whether to enable automatic queue scale-out. Default value: false.
        EnableScaleIn: false                      # Optional. Specifies whether to enable automatic queue scale-in. Default value: false.
        MinCount: 0                              # Optional. The minimum number of compute nodes that the queue must contain.
        MaxCount: 500                            # Optional. The maximum number of compute nodes that the queue can contain.
        InitialCount: 0                          # Optional. The initial number of compute nodes in the queue.
        InterConnect: erdma                      # Optional. The type of network between nodes in the queue. Valid values: VPC and eRDMA.
        VSwitchIds:                              # Optional. The vSwitches used in the queue.
          - "vsw-xxxxxxx"
          - "vsw-xxxxxxx"        
        ComputeNodes:                            # Optional. The compute node configurations.
          InstanceType: "ecs.c7.xlarge"            # Optional. The instance type. This parameter is required for non-managed clusters.
            ImageId: "m-xxxxxx"                      # Optional. The image to be used by the instance. This parameter is required for non-managed clusters.
            InstanceChargeType: "PostPaid"           # Optional. The billing method of the instance. Valid values: PostPaid and Subscription. Default value: PostPaid.
            PeriodUnit: "Month"                      # Optional. The unit of subscription duration of the instance. This parameter is required only when InstanceChargeType is set to Subscription.
            Period: 1                                # Optional. The subscription duration of the instance. This parameter is required only when InstanceChargeType is set to Subscription.
            AutoRenew: false                         # Optional. Specifies whether to enable auto-renewal for the instance. This parameter is required only when InstanceChargeType is set to Subscription.
            AutoRenewPeriod: 1                       # Optional. The auto-renewal subscription duration of the instance. This parameter is required only when InstanceChargeType is set to Subscription.
            SpotStrategy: "SpotWithPriceGo"          # Optional. The purchase mode of the preemptible instance. This parameter is invalid for the management node instance.
            SpotPriceLimit: 0.5                      # The price upper limit for the preemptible instance. This parameter is invalid for the management node instance.
            Duration: 1                              # The period for which the preemptible instance is retained. This parameter is invalid for the management node instance.
            SystemDisk:                              # Optional. The system disk configurations.
              Category: "cloud_essd"                 # Optional. The disk type. Default value: cloud_essd.
              Size: 40                               # Optional. The disk size. Default value: 40.
              Level: "PL0"                           # Optional. The disk PL. Default value: PL0.
            DataDisks:                               # Optional. The data disk configurations.
              Category: "cloud_essd"                 # Optional. The disk type. Default value: cloud_essd.
                Size: 40                               # Optional. The disk size. Default value: 40.
                Level: "PL0"                           # Optional. The disk PL. Default value: PL0.
                DeleteWithInstance: false            # Optional. Specifies whether to delete the disk along with the instance. Default value: false.
            EnableHT: false                          # Optional. Specifies whether to enable hyper-threading. Default value: true.   
        AllocationStrategy: "PriorityInstanceType"   # Optional. The auto scaling policy, which may be a supply-prioritized or cost-prioritized one.
        RamRole: "xxxxxx"                            # Optional. The RAM role assumed by the node.
        HostNamePrefix: "xxxxx"                        # Optional. The hostname prefix of the node.
        HostNameSuffix: "xxxxx"                        # Optional. The hostname suffix of the node.
        KeepAliveNodes:                                # Optional. The keep-alive nodes.
          - compute000
          - compute001
          - compute002
    ### Shared storage configurations
    SharedStorage:
      - MountDirectory: "/home"                    # Optional. The mount directory of the cluster.
        FileSystemId: "xxxx"                       # Optional. The file system ID.
        NASDirectory: "/"                          # Optional. The file directory of the file system.
        MountTargetDomain: "xxxxxx"                # Optional. The mount targets.
        ProtocolType: "NFS"                        # Optional. The storage protocol to be used.
        MountOptions: "xxxxx"                      # Optional. The mount options.
    ### Software applications
    AdditionalPackages:                            # Optional. The software applications to be installed in the cluster.
      - Name: "LAMMPS"                             # The application name.
        Version: "xxxx"                            # The application version.
      - Name: "Gromacs"
        Version: "xxx"
    ### Cluster addons
    Addons:
      - Name: "LoginNode"                          # The addon name.    
        Version: "xxxxxx"                          # The addon version.
        ServicesSpec: "JSON String"                # The custom parameters for the addon.
        ResourcesSpec: "JSON String"               # The custom resources for the addon.
  2. Go to the Cluster List page.

    1. Log on to the E-HPC console.

    2. In the left part of the top navigation bar, select a region.

    3. In the left-side navigation pane, click Cluster.

  3. On the Cluster page, click Cluster Templates.

  4. In the dialog box that appears, click Import Local Template to upload the prepared template file.

  5. In the Edit Cluster Template dialog box, confirm the information and click Confirm template and create.

  6. On the Create Cluster page, confirm the configurations and click Create Cluster.

Reference

After you create a cluster, you must create a cluster user to submit jobs to the cluster. For more information, see Manage users and Overview.