Overview - Elastic High Performance Computing - Alibaba Cloud Documentation Center

Elastic High Performance Computing (E-HPC) provides scheduler plug-ins in addition to mainstream schedulers. If the types or versions of the existing schedulers do not meet your business requirements, you can use a scheduler plug-in to build a custom scheduler, and then connect the scheduler to the E-HPC console. This topic provides an overview of scheduler plug-ins.

What is a scheduler plug-in?

E-HPC is a platform as a service (PaaS) tool that integrates mainstream open source schedulers. In most cases, if you migrate your services to the cloud, you also need to migrate your on-premises scheduler to the cloud. However, the built-in scheduler of E-HPC cannot connect with some third-party schedulers due to compatibility issues.

To meet your requirements on scheduler types and versions, E-HPC provides various scheduler plug-ins. This way, you can build a custom scheduler. For example, if you need to use electronic design automation (EDA) software in the cloud, you must purchase a license to run the required scheduler. However, E-HPC does not provide such licenses. In this case, you can build a custom scheduler and connect the scheduler to E-HPC.

E-HPC scheduler plug-ins provide plug-in templates, configuration files, and modularized features. You can build a custom scheduler based on the scheduler features and your business requirements. After you build a custom scheduler plug-in, you can use the E-HPC console to create a cluster and deploy the plug-in on the cluster. This way, you can manage your nodes, jobs, and auto scaling settings.

Process

In this section, a job is submitted in the E-HPC console to describe how to use a scheduler plug-in in a cluster.

Log on to the E-HPC console, select a cluster, and initiate a request to submit a job.
E-HPC receives the request from the console and sends a command to the cluster.
The scheduling node identifies the type of the scheduler plug-in, downloads the plug-in to a local path, and parses the value of the JobSubmit parameter. If the JobSubmit parameter is set to false, an error is returned. The error indicates that the plug-in cannot submit jobs. If the JobSubmit parameter is set to true, the scheduler plug-in can be used as expected.
Call the job submission command of the scheduler plug-in. For example, if you use the PBS scheduler, call the qsub command. If you use the LSF scheduler, call the bsub command. After the job is submitted, a result is returned.

Scheduler plug-in files

The following figure shows the directories of scheduler plug-in files. 2022-04-06_14-57-28 Scheduler plug-in files include the following files:

ehpc_custom.conf: The configuration file, which specifies the scheduler information of the plug-in and the available scheduler features. For more information, see Configuration files.
*.py: The script files, which are used to implement specific features of the scheduler based on the scheduler template. The files must be stored in the second-level directory of the /<scheduler name>/<scheduler version> directory, for example, /LSF/10.1.0.

Configuration files

The configuration file of a scheduler plug-in specifies the scheduler information of the plug-in and the available scheduler features. The following figure shows the features. 2022-04-06_15-06-24 [Scheduler] specifies the scheduler information, including the scheduler name and version number. [SchedulerCapability] specifies the available plug-in features, including the features supported by E-HPC. Each item in the [SchedulerCapability] section shows a feature and specifies whether to enable the feature. The following features are supported:

Scheduling service detection (priority: 3): configures the statuses of the cluster nodes displayed in the console.
Node operations (priority: 2): adds or removes nodes in the console for manual scale-in or scale-out.
Resource query (priority: 2): obtains resources based on node information and visualizes the resources in the console.
Node status query (priority: 1): monitors node statuses to implement auto scaling in the console.
Job operations (priority: 1): submits or queries jobs in the console.
Queue operations (priority: 1): adds or queries queues in the console.

Note

The higher the priority value, the more significant the feature. Scheduling service check is the most significant feature. All the features are available only when scheduling service check is enabled.