Alibaba Cloud Elastic High Performance Computing E-HPC Product Introduction

Introduction: On September 20, 2022, the program "Alibaba Cloud EDA Cloud Solution" was officially launched. Three experts from Alibaba Cloud showed you how Alibaba Cloud helped chip design enter the "cloud highway" from multiple perspectives. He Ronghui, a senior technical expert in Alibaba Cloud high-performance computing, shared the theme of Alibaba Cloud Elastic High Performance Computing E-HPC Product Introduction. The following is a summary of his speech for reading:

1、 Elastic High Performance Computing E-HPC Product Overview

1. E-HPC product design starting point

Elastic high-performance computing E-HPC products are designed to help customers quickly build HPC environments on the cloud and fully experience the advantages of cloud services.

E-HPC product design mainly considers the following three aspects:

• Build cloud supercomputing environment according to HPC customers' understanding: customers' understanding of HPC environment is a tightly coupled cluster with corresponding software deployed, not a discrete server and storage network;

• Provide supercomputing services according to HPC customers' habits: customers have long-term fixed business workflows for HPC use, and cloud services need to match their existing use processes;

• Combined service capability provides new models/experiences: through the characteristics of cloud services, it provides offline HPC uncommon or non-existent models, forming the characteristics of cloud supercomputing.

2. HPC Product Solution View

• Cluster resource lifecycle management: including cluster creation, cluster expansion, automatic scaling based on application load, cluster management and operation and maintenance;

• Cluster job management and scheduling: including scheduler management and configuration, job scheduling, workload monitoring and reporting, third-party scheduler integration, cloud on cloud and cloud off cloud hybrid scheduling;

• ISV application software and operation environment management: including application software deployment, cluster account management, and Estack;

• E-HPC performance services: including performance data visualization, performance analysis and optimization, and data caching;

Finally, the service portal is provided through OpenAPI, and users can use E-HPC services through cloud desktop or local clients.

2、 E-HPC product functions

E-HPC products mainly have four functions: cluster management, automatic scaling, business reports and performance analysis.

Cluster management: help customers arrange and use cloud IaaS resources in the way of HPC clusters, including:

• Cluster resource management;

• Cluster user management;

• Cluster job and scheduler management;

• Access to on cloud and off cloud resources;

Automatic scaling: dynamically scale cluster resources according to the actual requirements of HPC workload, including:

• Job load detection and statistics;

• Resource join/leave scheduler;

• Multi dimensional scaling strategy;

Business report: monitor the business data of HPC cluster dimension, form time series chart and statistical report, including:

• Real time monitoring and statistics of HPC operations;

• HPC cluster resource monitoring and statistics;

• Job/resource operation event notification and alarm linkage;

Performance analysis: hardware level, process level and function level performance analysis and performance report for HPC jobs, including:

• HPC operation performance monitoring;

• Analysis of HPC operation performance;

1. E-HPC cluster management

E-HPC provides cluster management services such as cluster resource creation, deployment, and cluster node status management.

As shown in the figure, on the left is a common cluster diagram, including graphic nodes, header nodes, computing nodes, and file storage; On the right side of the figure is to map the cluster to the cloud. In combination with Alibaba Cloud service components: ECS instance, GPU instance, and shadowless cloud desktop, you can create a cluster on the E-HPC console with one click.

The E-HPC management and control service helps customers manage one or more clusters in a region, monitor cluster status, and implement reports, alarms, performance analysis and other businesses.

2. E-HPC cluster resource expansion

There are two types of cluster resource expansion for F-HPC: manual expansion and automatic expansion.

• Manual capacity expansion: the user specifies the type and quantity of resources required for direct capacity expansion. E-HPC will be responsible for creating corresponding resources, completing relevant software deployment, joining the corresponding queue of the scheduler, and setting resource status availability; When the capacity is reduced, the corresponding configuration of the scheduler will also be completed, and the related computing node information will be cleared from the scheduler;

• Automatic expansion and contraction: It will be combined with the scheduler. The triggering of automatic expansion and contraction does not require the user's direct participation. The user only needs to configure the automatic expansion and contraction strategy. E-HPC will connect with the scheduler to sense the workload and expand and shrink the capacity according to the configured strategy.

Best Practices for Auto Scaling:

Load sensing:

• Perceived parallel resource requirements: number of cores, nodes, memory, GPU, etc;

• Perceive parallel management requirements: queues and vnodes that need to be expanded;

• Threshold limit: user resource upper limit, cluster resource upper limit, queue resource upper limit, etc;

• Scheduling strategy: scheduling priority, inter job dependency restrictions, etc;

E-HPC provides multiple capacity expansion strategies, including:

• Expansion instance type priority;

• Cross AZ/cross region;

• Capacity expansion by queue;

• Capacity expansion in batches;

• Wait time for capacity expansion;

• Reserved instances;

• Automatic recovery of waiting time/waiting strategy;

• Cost optimization;

• Inventory strategy;

• ……

3. E-HPC cluster event monitoring and business report

Because E-HPC service is closely combined with HPC scheduler, E-HPC can generate monitoring data and report data from fine-grained events at scheduler level and job level, which is convenient for users to analyze business status, obtain business bottlenecks and optimize business processes.

Event based capacity expansion water level overload alarm:

4. E-HPC hybrid cluster scheme (mainly under cloud control)

In the networking scheme focusing on cloud management and control, E-HPC is responsible for connecting the head node offline machine room with the offline machine room scheduler to achieve the ability to expand and shrink cloud resources according to the scheduler load.

Similarly, customers can choose to expand or shrink the capacity manually. E-HPC expands the computing nodes on the cloud and adds the online downgrader according to customer needs.

Agent Mode Architecture Diagram

a. Scheme features:

• Consistent habits: the original local HPC cluster does not need to be changed, and the original user habits and scheduler scripts do not need to be changed;

• One click creation: one click creation of E-HPC clusters on the cloud, proxy node generation management of cloud resources, integration of cloud automatic scaling and reporting services;

• Elastic scalability: expand online resources during business peak hours, and charge on demand; Automatic release of online resources in business downturn to save costs;

b. Applicable scenarios:

Continue to use the existing HPC cluster of the original offline machine room, and flexibly expand and release the cloud resources based on the peak and valley of the business, so as to quickly supply and save costs.

Create a hybrid cloud cluster:

5. E-HPC hybrid cluster scheme (mainly on cloud control)

The networking scheme focusing on cloud management and control is called the node under the nano pipeline. The cloud HPC cluster is the main cluster. The head node and login node are all on the cloud, and the offline node is used as an auxiliary and legacy node.

Master control mode architecture diagram

a. Scheme features:

• Cloud generation O&M: cloud E-HPC management and control, resource management of HPC scheduler dimension, saving the O&M cost of offline clusters;

• Managed cloud: One click on the cloud to create an E-HPC cluster, managed local offline computing resources, and used the old offline equipment;

• Elastic scalability: expand online resources at business peak, and charge on demand; Automatic release of online resources in business downturn to save costs;

b. Applicable scenarios:

The original offline computer room HPC cluster resources are old, and we hope to flexibly use the cloud resources, gradually transition to the cloud, and manage the original offline computing resources, so as to effectively save costs.

Best practice of hybrid cloud master mode:

6. E-HPC Cluster Scheduler Compatibility Scheme

E-HPC provides the scheduler plug-in as an extension component of the platform. When the existing scheduler type or version of E-HPC does not meet the current business requirements, customers can build a customized scheduler through the plug-in and access the E-HPC platform.

Third party/commercial scheduler integration is realized through plug-in mechanism:

• E-HPC control carries out cluster management through plug-in framework interface, including job management, resource management, load monitoring, capacity expansion, etc;

• Customized plug-in code implements the plug-in framework definition function to complete the adaptation with the scheduler;

• E-HPC cluster creation process supports customized plug-in installation and deployment

• Support configuration of functions supported by plug-ins through configuration files

• E-HPC provides plug-in templates and sample scheduler plug-ins such as PBS and LSF.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us