An Elastic High Performance Computing (E-HPC) cluster is a group of Elastic Compute Service (ECS) instances purpose-built for high-performance computing workloads. Each cluster consists of three node types—logon, management, and compute—connected through shared storage and a job scheduler that distributes work across compute nodes.
Nodes
Each node in an E-HPC cluster is an ECS instance. Nodes fall into three categories based on their role.
| Node | Role |
|---|---|
| Logon node | Entry point to the cluster. Log in, debug code, compile software, install packages, and submit jobs from this node. |
| Management node | Runs the scheduling service (PBS, Slurm, or another supported scheduler) and the domain account service. Do not use management nodes to compile software or transfer compressed data—this node must remain available for job scheduling and domain account resolution. |
| Compute node | Executes high-performance computing jobs. |
Management node sizing
Size management nodes based on the number of compute nodes in your cluster.
| Compute nodes | Minimum management node specs | Job limits |
|---|---|---|
| 100 or fewer | 16+ vCPUs, 64+ GiB memory | Fewer than 5,000 queued; fewer than 10,000 uncompleted |
| 500 or fewer | 32+ vCPUs, 128+ GiB memory | Fewer than 10,000 queued; fewer than 20,000 uncompleted |
| More than 500 | 64+ vCPUs, 256+ GiB memory | Fewer than 10,000 queued; fewer than 20,000 uncompleted |
Images
An image provides the operating system and configuration data for the ECS instances in your cluster. E-HPC supports five image types.
-
Public images: images provided by Alibaba Cloud.
-
Custom images: images created from ECS instances or snapshots, or imported from your own machine.
-
Shared images: images shared by other Alibaba Cloud accounts.
-
Alibaba Cloud Marketplace images: images from independent software vendors (ISVs) licensed through Alibaba Cloud Marketplace.
-
Community images: images published on the Alibaba Cloud Community image platform.
Available image types depend on the selected region, node instance type, and your account's image resources. All available types are shown in the console. The scheduler, domain account service, and shared storage options also vary by image.
For details, see Overview.
Scheduler
A scheduler distributes and runs jobs across compute nodes. E-HPC supports the following schedulers.
The version you can install depends on the image you select. For the full compatibility matrix, see Compatibility matrix.
| Scheduler family | Scheduler | Console identifier |
|---|---|---|
| PBS | PBS Pro19 | pbs19 |
| PBS Pro18 | pbs | |
| OpenPBS 20 | ||
| OpenPBS 22 | ||
| Slurm | Slurm 22 | slurm22 |
| Slurm 20 | slurm20 | |
| Slurm 19 | slurm19 | |
| Slurm 17 | slurm | |
| Grid Engine | Open Grid Scheduler (SGE) | opengridscheduler |
| Other | Deadline | deadline |
Domain account services
A domain account service centralizes user identity management across all nodes in the cluster. E-HPC supports two options.
-
Network Information Service (NIS): provides centralized identity management. Create a user once on the NIS server; after adding a new node to NIS, that user can immediately log in to the new node without per-node account creation.
-
Lightweight Directory Access Protocol (LDAP): authenticates E-HPC users and supports grouping and permission delegation, which simplifies access management at organizational scale.
Supported domain account services vary by image. See Compatibility matrix.
Shared storage
All nodes in a cluster share access to a common file system that stores user data, scheduler state, and job data. E-HPC supports the following storage types.
-
Aspara File Storage NAS: General-purpose NAS and Extreme NAS.
-
Cloud Parallel File Storage (CPFS): supports CPFS-NFS and CPFS-POSIX mounting methods.
-
Self-managed storage: NAS or other file systems not hosted by Alibaba Cloud.
Supported storage options vary by image. See Compatibility matrix.
Compatibility matrix
The table below shows which schedulers, domain account services, and shared storage options are available for each public image.
The console displays the image types, schedulers, and domain account services available for your cluster configuration.
Images labeled "Custom" in the table do not bundle a scheduler, domain account service, or shared storage. Install these components yourself.
CentOS 6 and CentOS 8 have reached end of life (EOL). The Linux community no longer maintains these versions. Switch to a supported operating system. For migration guidance, see How do I change CentOS 6 repository addresses? and Change CentOS 8 repository addresses.
| Public image | Scheduler | Domain account service | Shared storage |
|---|---|---|---|
| CentOS 7.2 64-bit CentOS 7.3 64-bit CentOS 7.4 64-bit CentOS 7.5 64-bit CentOS 7.6 64-bit CentOS 7.8 64-bit CentOS 7.9 64-bit CentOS 7.9 64-bit (UEFI) |
PBS Pro18, PBS Pro19, Slurm 17, Slurm 19, Slurm 20, Slurm 22, Open Grid Scheduler (SGE), Deadline | NIS, LDAP | General-purpose NAS, Extreme NAS, CPFS-NFS, CPFS-POSIX |
| CentOS 8.0 64-bit | OpenPBS 20 | NIS | General-purpose NAS, Extreme NAS, CPFS-NFS |
| CentOS 6.9 64-bit | PBS Pro18, Deadline | NIS, LDAP | General-purpose NAS, Extreme NAS |
| CentOS 6.10 64-bit | Custom | Custom | General-purpose NAS, Extreme NAS |
| Alibaba Cloud Linux 2.1903 LTS 64-bit | PBS Pro18 | NIS, LDAP | General-purpose NAS, Extreme NAS, CPFS-NFS |
| Alibaba Cloud Linux 3.2104 LTS 64-bit | Open Grid Scheduler (SGE) | NIS | General-purpose NAS, Extreme NAS, CPFS-NFS |
| Alibaba Cloud Linux 3.2104 LTS 64-bit for ARM | Open Grid Scheduler (SGE) | NIS | General-purpose NAS, Extreme NAS, CPFS-NFS |
| Ubuntu 20.04 64-bit | Slurm 22 | NIS | General-purpose NAS, Extreme NAS, CPFS-NFS |
| Ubuntu 20.04 64-bit for ARM | Slurm 22 | NIS | General-purpose NAS, Extreme NAS, CPFS-NFS |
| Windows Server 2022 Windows Server 2019 Windows Server 2016 Windows Server 2012 R2 Windows Server 2008 R2 |
Custom | Custom | Custom |
Cluster users
To submit, debug, and run jobs on an E-HPC cluster, you need a cluster user account. Two permission levels are available.
-
Ordinary permissions: submit and debug jobs. Suitable for end users who run workloads but do not administer the cluster.
-
Sudo permissions: everything in ordinary permissions, plus the ability to install software and restart nodes using
sudocommands. Suitable for cluster administrators.
The root user can only be created at cluster creation time. Avoid using the root user for day-to-day operations to reduce the risk of accidental data loss or configuration errors.
For details, see Manage users.
Software
E-HPC provides access to major computing applications, runtime libraries, and Message Passing Interface (MPI) libraries. For a full list, see Software overview.
E-HPC cluster status
An E-HPC cluster moves through the following states.
| Status | Description |
|---|---|
| Creating | The cluster is being created. ECS instances are provisioned in this phase. |
| Uninitialized | The selected image is being installed on the instances. |
| Initializing | The cluster is being initialized. The root user is set up in this phase. |
| Running | The cluster is up and running. |
| Exception | A management node was deleted or stopped, or the scheduler went offline. Try to restore the cluster. If restoration fails, submit a ticket. |
| Releasing | The cluster is shutting down and will be released. |