Elastic High Performance Computing E-HPC Product Introduction

1、 Elastic high-performance computing E-HPC product overview

1. Starting point of E-HPC product design

Elastic high-performance computing E-HPC products are designed to help customers quickly build HPC environments on the cloud and fully experience the advantages of cloud services.

E-HPC product design mainly considers the following three aspects:

• Build cloud supercomputing environment according to HPC customers' understanding: customers' understanding of HPC environment is a tightly coupled cluster with corresponding software deployed, rather than discrete servers and storage networks;

• Provide supercomputing services according to the habits of HPC customers: customers have long-term fixed business workflow for HPC use, and cloud services need to match their existing use process;

• Combined service capability to provide new models/new experiences: through the characteristics of cloud services, provide the uncommon or non-existent models of offline HPC, and form the characteristics of cloud supercomputing.

2. HPC product solution view

• Cluster resource lifecycle management: including cluster creation, cluster expansion and scaling, automatic scaling based on application load, cluster management and operation and maintenance;

• Cluster job management and scheduling: including scheduler management and configuration, job scheduling, job load monitoring and reporting, third-party scheduler integration, cloud and cloud hybrid scheduling;

• ISV application software and operation environment management: including application software deployment, cluster account management, and stack;

• E-HPC performance services: including performance data visualization, performance analysis and optimization, and data caching;

Finally, the business portal is provided through OpenAPI. Users can use E-HPC services through cloud desktop or local clients.

2、 E-HPC product function

The E-HPC product mainly has four major functions: cluster management, automatic scaling, business report and performance analysis.

Cluster management: help customers arrange and use IaaS resources on the cloud in the way of HPC cluster, including:

• Cluster resource management;

• Cluster user management;

• Cluster job and scheduler management;

• Access to resources on and off the cloud;

Automatic scaling: dynamically scale cluster resources according to the actual demand of HPC workload, including:

• Operation load detection and statistics;

• Resource join/leave scheduler;

• Multi-dimensional scaling strategy;

Business report: monitor the business data of HPC cluster dimension, and form time series chart and statistical report, including:

• Real-time monitoring and statistics of HPC operation;

• HPC cluster resource monitoring and statistics;

• Job/resource operation event notification and alarm linkage;

Performance analysis: Hardware level, process level and function level performance analysis and performance report for HPC jobs, including:

• HPC operation performance monitoring;

• HPC operation performance analysis;

1. E-HPC cluster management

E-HPC provides cluster management services such as cluster resource creation, deployment, cluster node status management, etc.

As shown in the figure, on the left is a common cluster diagram, including graphics nodes, header nodes, computing nodes and file storage; On the right side of the figure, the cluster is mapped to the cloud. In combination with Alibaba Cloud service components: ECS instance, GPU instance, and shadowless cloud desktop, the cluster is created by clicking on the E-HPC console.

E-HPC control service helps customers manage one or more clusters in a region, monitor the cluster status, and implement reports, alarms, performance analysis and other businesses.

2. E-HPC cluster resource expansion and contraction

F-HPC cluster resource scaling can be divided into two types: manual scaling and automatic scaling.

• Manual capacity expansion: the user specifies the type and quantity of resources needed to expand the capacity directly. E-HPC will be responsible for creating the corresponding resources and completing the related software deployment, joining the corresponding queue of the scheduler, and setting the resource status available; When shrinking, it will also complete the corresponding configuration of the scheduler, and clear the relevant computing node information from the scheduler;

• Automatic scaling: it will be combined with the scheduler. The trigger of automatic scaling does not require the user's direct participation. The user only needs to configure the automatic scaling strategy. E-HPC will connect with the scheduler to sense the workload and scale the capacity according to the configured strategy.

Load sensing:

• Perception of parallel resource requirements: number of cores, nodes, memory, GPU, etc;

• Perception of parallel management requirements: queues and vnodes to be expanded;

• Threshold limit: user resource limit, cluster resource limit, queue resource limit, etc;

• Scheduling strategy: scheduling priority, dependency restriction between jobs, etc;

E-HPC provides a variety of expansion strategies, including:

• Priority of instance size expansion;

• Cross AZ/cross region;

• Queue expansion;

• Batch expansion;

• Expansion waiting time;

• Retain instances;

• Automatic recovery of waiting time/waiting strategy;

• Cost optimization;

• Inventory strategy;

• ……

3. E-HPC cluster event monitoring and business report

Due to the close combination of E-HPC service and HPC scheduler, E-HPC can generate monitoring data and report data from fine-grained events at the scheduler level and job level, which is convenient for users to analyze business status, obtain business bottlenecks, and optimize business processes.

Realize the alarm of expansion water level overrun based on events:

4. E-HPC hybrid cluster scheme (mainly under cloud control)

In the networking scheme based on offline control, the head node is in the offline machine room, and the E-HPC is responsible for connecting with the offline machine room scheduler to achieve the ability to expand and shrink the cloud resources according to the scheduler load.

Similarly, customers can choose to expand and shrink the capacity manually. E-HPC will expand the computing nodes on the cloud and add a line scaler according to customer needs.

A. Program features:

• Consistent habits: the original local HPC cluster does not need to be changed, and the original user usage habits and scheduler scripts will not be changed;

• One-click creation: create an E-HPC cluster on the cloud with one click. Proxy nodes manage cloud resources on behalf of each other, and integrate cloud auto-scaling and reporting services;

• Elastic scaling: expand online resources during peak business hours, and charge as needed; Automatic release of online resources during business downturn to save costs;

B. Applicable scenarios:

Continue to use the existing HPC cluster of the original offline computer room, and flexibly expand and release the cloud resources based on the peak and valley of the business, so as to quickly supply and save costs.

To create a mixed cloud cluster: https://help.aliyun.com/document_detail/84850.html

5. E-HPC hybrid cluster scheme (cloud control is the main)

The networking scheme based on cloud management and control is called the sub-pipeline node. The cluster is based on the cloud HPC cluster. The head node and the login node are both on the cloud, and the offline node is used as an auxiliary and old one.

A. Program features:

• Cloud generation operation and maintenance: E-HPC control on the cloud, resource management of HPC scheduler dimension, saving the operation and maintenance costs of offline clusters;

• Managed offline: create an E-HPC cluster with one click on the cloud, manage local offline computing resources, and use the old offline equipment;

• Elastic scaling: expand online resources during peak business hours, and charge as needed; Automatic release of online resources during business downturn to save costs;

B. Applicable scenarios:

The HPC cluster resources of the original offline computer room are old, and we hope to use the cloud resources flexibly and gradually transition to the cloud. At the same time, we can manage the original offline computing resources and save costs efficiently.

Best practices of hybrid cloud master mode

6. Compatibility scheme of E-HPC cluster scheduler

E-HPC provides a scheduler plug-in as an extension component of the platform. When the existing scheduler type or version of E-HPC does not meet the current business, customers can use this plug-in to build a customized scheduler and access the E-HPC platform.

Third party/commercial scheduler integration through plug-in mechanism:

• E-HPC control performs cluster management through plug-in framework interface, including job management, resource management, load monitoring, capacity expansion, etc;

• Plug-in code customization realizes the plug-in framework definition function, and completes the adaptation with the scheduler;

• E-HPC cluster creation process supports customized plug-in installation and deployment

• Support configuration of plug-in supported functions through configuration file

• E-HPC provides plug-in templates and PBS, LSF and other scheduler plug-in samples.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us