All Products
Search
Document Center

Elastic High Performance Computing:Overview

Last Updated:Nov 30, 2023

An Elastic High Performance Computing (E-HPC) cluster is a group of ECS instances that deliver high-performance computing capabilities. Compared with typical Elastic Compute Service (ECS) instances, E-HPC clusters offer higher performance, scalability, reliability, and availability. This topic describes the terms and features of an E-HPC cluster.

Nodes

Each node in an E-HPC cluster is an ECS instance. The nodes are classified into logon nodes, management nodes, and compute nodes. The following table describes each type of node and its role in an E-HPC cluster.

Node

Description

Logon node

The logon node is used to log on to an E-HPC cluster. You can also debug, compile, and install software, and submit jobs through the logon node.

Management node

The management node is used to manage the cluster. The scheduling service and domain account service are deployed.

  • Scheduling service runs schedulers, such as PBS or Slurm, and processes and schedules jobs.

  • Domain account service is used to manage the user information of a cluster.

Important

The management node is used to schedule jobs and resolve domain accounts. To ensure business continuity, we recommend that you do not use management nodes to compile software, or upload or download compressed data.

Compute node

The compute node is used to run high-performance computing jobs.

We recommend that you choose the instance specifications of management nodes and schedule jobs based on the number of compute nodes. The following table lists the recommended instance specifications and quantity of jobs.

Number of compute nodes

Specifications of management nodes

Job quantity

100 or less compute nodes

  • 16 or more vCPUs

  • 64 or more GiB of memory

  • Less than 5,000 queued jobs

  • Less than 10,000 uncompleted jobs

500 or less compute nodes

  • 32 or more vCPUs

  • 128 or more GiB of memory

  • Less than 10,000 queued jobs

  • Less than 20,000 uncompleted jobs

More than 500 compute nodes

  • 64 or more vCPUs

  • 256 or more GiB of memory

  • Less than 10,000 queued jobs

  • Less than 20,000 uncompleted jobs

Images

An image includes the operating system and configuration data for your business. It is used to provide the ECS instances that make up an E-HPC cluster. E-HPC supports the following types of images:

  • Public images: images provided by Alibaba Cloud.

  • Custom images: images created from ECS instances or snapshots, or images imported from your computer.

  • Shared images: images shared by other Alibaba Cloud accounts.

  • Alibaba Cloud Marketplace images: images provided by independent software vendors (ISVs) that are licensed by Alibaba Cloud Marketplace.

  • Community images: images that are released on the image platform of Alibaba Cloud Community.

Important
  • The image types that you can select vary based on the specified region, specified instance type for the node, and whether the current Alibaba Cloud account has available image resources. All available image types are displayed on the console.

  • The schedulers, domain account services, and supported shared storage and software vary based on images.

For more information, see Image overview.

Scheduler

Schedulers are used to schedule jobs on a cluster. The following table describes the schedulers that are supported by E-HPC:

Type

Scheduler

Displayed in the console

PBS

PBS Pro19

pbs19

PBS Pro18

pbs

Note

The version of the scheduler software to install depends on the image that you use.

OpenPBS 20

OpenPBS 22

Slurm

Slurm 22

slurm22

Slurm 20

slurm20

Slurm 19

slurm19

Slurm 17

slurm

GridEngine

Open Grid Scheduler (SGE)

opengridscheduler

Others

Deadline

deadline

Note

The supported schedulers vary based on images. For more information, see the "Schedulers, domain account services, and shared storage supported by images" section in this topic.

Domain account services

The domain account service is used to manage cluster users. E-HPC supports the following domain account services:

  • Network Information Service (NIS) provides centralized identity management. You can create a user on the NIS server. After a new node is added to NIS, you can use the user to log on to the node without the need to create a user on each node.

  • Lightweight Directory Access Protocol (LDAP) is used to authenticate E-HPC users. You can authorize and group users by using LDAP to simplify permission management within your organization.

Note

The supported domain account services vary based on images. For more information, see the "Schedulers, domain account services, and shared storage supported by images" section in this topic.

Shared storage

The user data, scheduler information, and shared job data of E-HPC clusters are stored in the file system for shared access by all nodes in the cluster. E-HPC supports the following types of file systems:

  • Aspara File Storage NAS: includes General-purpose NAS and Extreme NAS.

  • Cloud Parallel File Storage (CPFS) file system: supports CPFS-NFS and CPFS-POSIX mounting methods.

  • Others: file storage that is not hosted by Alibaba Cloud, such as your self-managed NAS file system.

Note

The supported storage varies based on images. For more information, see the "Schedulers, domain account services, and shared storage supported by images" section in this topic.

Schedulers, domain account services, and shared storage supported by images

The following table describes the supported schedulers, domain account services, and shared storage by images.

Note
  • If you create an E-HPC cluster in the E-HPC console, the supported image types, schedulers, and domain account services are displayed in the console.

  • For images that are labeled with a custom scheduler, custom domain account service, or custom shared storage in the table, the scheduler, domain account service, and shared storage are not provided with the image. You need to install them by yourself.

  • CentOS 6 and CentOS 8 have reached their EOL, meaning that the Linux community is no longer maintaining these operating system versions. For security and reliability reasons, we recommend that you switch to other operating systems. For more information, see Change the CentOS 6 source address and Change CentOS 8 repository addresses.

Public image

Scheduler

Domain account service

Shared storage

  • CentOS 7.2 64-bit

  • CentOS 7.3 64-bit

  • CentOS 7.4 64-bit

  • CentOS 7.5 64-bit

  • CentOS 7.6 64-bit

  • CentOS 7.8 64-bit

  • CentOS 7.9 64-bit

  • CentOS 7.9 64-bit (UEFI)

  • PBS Pro18

  • PBS Pro19

  • Slurm 17

  • Slurm 19

  • Slurm 20

  • Slurm 22

  • Open Grid Scheduler (SGE)

  • Deadline

  • NIS

  • LDAP

  • General-purpose NAS

  • Extreme NAS

  • CPFS-NFS

  • CPFS-POSIX

CentOS 8.0 64-bit

Open PBS 20

NIS

  • General-purpose NAS

  • Extreme NAS

  • CPFS-NFS

CentOS 6.9 64-bit

  • PBS Pro18

  • Deadline

  • NIS

  • LDAP

  • General-purpose NAS

  • Extreme NAS

CentOS 6.10 64-bit

Custom

Custom

  • General-purpose NAS

  • Extreme NAS

Alibaba Cloud Linux 2.1903 LTS 64-bit

PBS Pro18

  • NIS

  • LDAP

  • General-purpose NAS

  • Extreme NAS

  • CPFS-NFS

Alibaba Cloud Linux 3.2104 LTS 64-bit

Open Grid Scheduler (SGE)

NIS

  • General-purpose NAS

  • Extreme NAS

  • CPFS-NFS

Alibaba Cloud Linux 3.2104 LTS 64-bit for ARM

Open Grid Scheduler (SGE)

NIS

  • General-purpose NAS

  • Extreme NAS

  • CPFS-NFS

Ubuntu 20.04 64-bit

Slurm 22

NIS

  • General-purpose NAS

  • Extreme NAS

  • CPFS-NFS

Ubuntu 20.04 64-bit for ARM

Slurm 22

NIS

  • General-purpose NAS

  • Extreme NAS

  • CPFS-NFS

  • Windows Server 2022

  • Windows Server 2019

  • Windows Server 2016

  • Windows Server 2012 R2

  • Windows Server 2008 R2

Custom

Custom

Custom

E-HPC cluster users

You must create a user to submit, debug, and run jobs on an E-HPC cluster. You can grant two types of permissions to users when you create the users.

  • Ordinary permissions: suitable for ordinary users that only need to submit and debug jobs.

  • Sudo permissions: suitable for administrative users who need to manage the E-HPC cluster. In addition to ordinary permissions, sudo permissions allow users to install software and restart nodes by running sudo commands.

    Important

    You can create a root user only when you create an E-HPC cluster. We recommend that you do not use the root user for day-to-day operations. This minimizes the risk of damage to cluster data due to improper or accidental operations.

For more information, see Create a user.

Software

E-HPC provides access to major computing applications, runtime libraries, and Message Passing Interface (MPI) libraries. You can install the software based on your business requirements. For more information, see Software overview.

E-HPC cluster status

  • Creating: The cluster is being created. The ECS instances that make up the cluster are created in this stage.

  • Uninitialized: The image is being installed on the instances in the cluster.

  • Initializing: The cluster is being initialized. The root user is initialized in this stage.

  • Running: The cluster is up and running.

  • Exception: A cluster enters the Exception state when management nodes are deleted or stopped, or the scheduler is logged off. You can try to restore the cluster. If the cluster fails to be restored, submit a ticket.

  • Releasing: The cluster is being shut down and will be released.