All Products
Search
Document Center

Elastic High Performance Computing:Deploy a hybrid cloud cluster in master mode

Last Updated:Mar 27, 2024

A master hybrid-cloud cluster is deployed with the Open Grid Scheduler (SGE) scheduler, an existing on-premises cluster, and a new cluster on the cloud. The cluster on the cloud consists of multiple compute nodes and management nodes, and the on-premises cluster consists of compute nodes. This topic describes how to deploy a master hybrid-cloud cluster and manage on-premises compute nodes in the master hybrid-cloud cluster.

Important

In this example, you must use the SGE scheduler. You also must submit a ticket to apply for adding your Alibaba Cloud account to the whitelist before you use the SGE scheduler.

Background information

A large number of life science enterprises have deployed data centers to perform high-performance computing on a large scale. However, the outdated IT infrastructure of data centers cannot keep up with the sharp rise of computing power requirements in life science. In some scenarios, life science enterprises may need to use the resources of data centers and build high-performance computing clusters on the cloud. To meet the requirements of these scenarios, Elastic High Performance Computing (E-HPC) provides a hybrid cloud solution to help manage computing resources across the cloud in a unified and efficient manner. Cloud resources allow the flexible scaling of E-HPC clusters.

Hybrid cloud clusters can be deployed in the following modes:

  • Master mode: Management nodes reside in a new cluster on the cloud.

  • Proxy mode: Management nodes reside in an existing cluster.

    Note

    In proxy mode, an existing cluster resides in a data center or in a different region on the cloud.

Scenario

Assume that the compute nodes of the existing on-premises cluster have the following specifications:

  • Operating system: CentOS 7.6

    Important

    The operating system of on-premises compute nodes must be CentOS 7.

  • Specifications: 2 vCPUs and 4 GiB of memory

Procedure

This topic includes the following operations.

Operation

Description

Step 1: Create an E-HPC cluster on the cloud

Prepare an E-HPC cluster on the cloud and configure the network.

Step 2: Obtain the configurations of the E-HPC cluster

Obtain the configurations of the E-HPC cluster to configure the on-premises compute nodes.

Step 3: Deploy the on-premises compute nodes

Deploy a scheduler client and a user management client on each on-premises compute node, and then mount a file system on each compute node.

Step 4: Add the on-premises compute nodes to the E-HPC cluster

Create a queue in the E-HPC cluster to manage the on-premises compute nodes, and then add the on-premises compute nodes to the queue.

Step 5: Configure auto scaling

Configure the auto scaling settings for the E-HPC cluster to automatically add compute nodes to or remove compute nodes from the cluster.

Step 6: Submit a test job

Submit a test job to the queue to check whether the nodes are running as expected.

(Optional) Remove one or more on-premises compute nodes from the E-HPC cluster

Remove the on-premises compute nodes that you no longer need from the E-HPC cluster and stop their services.

Step 1: Create an E-HPC cluster on the cloud

  1. Create an E-HPC cluster.

    For more information, see Create a cluster by using the wizard. Configure the following settings:

    • Deploy a tiny cluster that consists of one management node and one compute node. The compute node is optional. We recommend that you deploy the management node with an ECS instance that has 4 vCPUs and 8 GiB of memory.

    • Scheduler: Select opengridscheduler (SGE).

    • Domain account service: Select nis.

  2. Establish a network connection between the E-HPC cluster and the data center.

    You can establish a network connection between the E-HPC cluster and the data center with a VPN gateway or an Express Connect circuit.

  3. Configure the security group.

    You need to configure an inbound rule for the security group to which the E-HPC cluster belongs. This allows on-premises compute nodes to access the E-HPC cluster and facilitates the unified management of the on-premises compute nodes and the E-HPC cluster.

    On the details page of the E-HPC cluster, click the ID of the security group to go to the Security Groups page and add an inbound access rule. Set the CIDR blocks of the on-premises compute nodes to the source IP address, as shown in the following figure.

    混合云最佳实践6

The following configurations are used in this topic.

Configuration item

Description

Management node

One ECS instance is deployed. Instance type: ecs.c7a.xlarge.large.

Compute node

One ECS instance and two local machines are deployed. Instance type: ecs.c7a.xlarge. Operating system: CentOS 7.6.

Network

SSL-VPN is configured to establish a network connection between the E-HPC cluster and the data center.

Step 2: Obtain the configurations of the E-HPC cluster

After you create the E-HPC cluster, you need to obtain the configurations of the E-HPC cluster to configure the on-premises compute nodes.

  1. Log on to the E-HPC cluster. For more information, see Log on to a cluster.

  2. Run the following command to download the script that is used to obtain the configurations:

    wget https://public-ehpc-package.oss-cn-hangzhou.aliyuncs.com/scheduler/offpremise_config_fetch.sh
  3. Run the following command to execute the script:

    source offpremise_config_fetch.sh

    The output file OFFPREMISE_PARAMS contains the configurations of the E-HPC cluster. Run the cat command to obtain the configurations.

    混合云最佳实践1

Step 3: Deploy the on-premises compute nodes

You need to deploy a scheduler client and a user management client on each on-premises compute node, and then mount a file system on each compute node.

Note

Before you perform step 2, make sure that the network between the on-premises compute nodes and the E-HPC cluster is connected. If you use SSL-VPN to establish a network connection between the on-premises compute nodes and the E-HPC cluster, you must install an SSL client on each on-premises compute node.

  • Single on-premises compute node

    Log on to an on-premises compute node and perform the following operations:

    1. Run the following command to download the script that is used to deploy the on-premises compute node:

      wget https://public-ehpc-package.oss-cn-hangzhou.aliyuncs.com/scheduler/deploy_nis_sge_client.sh 
    2. Run the following command to execute the script:

      sh deploy_nis_sge_client.sh <Cluster configurations>

      Replace <Cluster configurations> with the cluster configurations. Example:

      sh deploy_nis_sge_client.sh ehpc-hz-e5WF****** manager 10.18.XX.XX 000a******-vm***.cn-hangzhou.nas.aliyuncs.com:/ehpc-hz-e5WF****** manager 10.18.XX.XX
  • Multiple on-premises compute nodes

    Log on to the on-premises configuration node or a compute node and perform the following operations. Make sure that the network between the configuration node or the compute node and other on-premises compute nodes is connected.

    1. Run the following command to download the script that is used to deploy the on-premises compute nodes:

      wget https://public-ehpc-package.oss-cn-hangzhou.aliyuncs.com/scheduler/batch_install_clients.sh
    2. Run the following command to create a file named hosts:

      Each row in the hosts file indicates an on-premises compute node. Format: [host_name] [host_ip] format. Example:

      混合云最佳实践2

    3. Run the following command to execute the script:

      sh batch_install_clients.sh <Cluster configurations> ./hosts

      Replace <Cluster configurations> with the cluster configurations. Example:

      sh batch_install_clients.sh ehpc-hz-e5WF****** manager 10.18.XX.XX 000a******-vm***.cn-hangzhou.nas.aliyuncs.com:/ehpc-hz-e5WF****** manager 10.18.XX.XX ./hosts

Step 4: Add the on-premises compute nodes to the E-HPC cluster

On-premises compute nodes cannot be moved among queues. Therefore, you must create a queue in the E-HPC cluster to manage the on-premises compute nodes, and then add the on-premises compute nodes to the queue.

  1. Create a queue. For more information, see Manage a queue.

  2. Add the on-premises compute nodes to the E-HPC cluster.

    1. On the Clusters page of the E-HPC console, find the E-HPC cluster and click Resize.

    2. In the Resize panel, click the On-premises Node tab.

    3. Select the created queue, enter or import the information of the on-premises compute nodes based on your needs, and then click OK.

      混合云最佳实践3

      Enter or import the following information:

      • Hostname: the hostname of the compute node.

      • Node ID/IP Address: the IP address of the compute node.

      • CPU: the number of CPUs of the compute node that is viewed by running the lscpu command.

      • Memory: the memory size of the compute node.

      Important

      Make sure that the hostname and host IP address of each on-premises compute node are valid. Otherwise, the compute node cannot be added to the cluster.

      For the on-premises compute nodes that use SSL-VPN to establish network connections, enter the IP address of tun0 for the VPN gateway. You can run the ifconfig command to view the IP address of tun0 for the VPN gateway.

  3. On the Nodes page, find the E-HPC cluster and query the on-premises compute nodes by node type or queue.

    Refresh the page to update the status of the on-premises compute nodes. Wait for 1 or 2 minutes. If the on-premises compute nodes enter the Running state, the nodes are added.

Step 5: Configure auto scaling

E-HPC clusters support the auto scaling feature. You can configure the auto scaling settings for the E-HPC cluster to automatically add compute nodes to or remove compute nodes from the cluster based on the real-time workloads. For more information, see Configure an auto scaling policy.

混合云最佳实践4

Note

On-premises compute nodes and the queue to which the compute nodes belong does not support the auto scaling feature.

Step 6: Submit a test job

  1. Run the following command to submit a test job to the queue to which the nodes on the cloud belong and check whether the nodes are running as expected.

    qsub sample1.sh

    The script file sample1.sh provides an example. Replace the -q parameter with the queue to which the nodes on the cloud belong and the -o and -e parameters with the actual paths:

    #!/bin/bash
    
    #$ -cwd
    #$ -l vf=1G
    #$ -q workq
    #$ -N sge-demo1
    #$ -o /home/test/
    #$ -e /home/test/
    
    echo "demo start, sleep 10s and ping localhost"
    sleep 10
    ping -c 10 localhost
    echo "demo complete"
  2. Run the following command to submit a test job to the queue to which the on-premises compute nodes belong:

    qsub sample2.sh

    The script file sample2.sh provides an example. Replace the -q parameter with the queue to which the nodes on the cloud belong and the -o and -e parameters with the actual paths:

    #!/bin/bash
    
    #$ -cwd
    #$ -l vf=1G
    #$ -q local
    #$ -N sge-demo2
    #$ -o /home/test/
    #$ -e /home/test/
    
    echo "demo start, sleep 10s and ping localhost"
    sleep 10
    ping -c 10 localhost
    echo "demo complete"

(Optional) Remove one or more on-premises compute nodes from the E-HPC cluster

  1. Remove one or more on-premises compute nodes from the E-HPC cluster.

    For more information, see Restart, reset, and delete a node.

  2. Stop the services of the on-premises compute nodes.

    • Single on-premises compute node

      Log on to an on-premises compute node and perform the following operations:

      1. Run the following command to download the script that is used to stop the services:

        wget https://public-ehpc-package.oss-cn-hangzhou.aliyuncs.com/scheduler/nis_sge_client_stop.sh 
      2. Run the following command to execute the script:

        sh nis_sge_client_stop.sh <Cluster configurations>

        Replace <Cluster configurations> with the cluster configurations. Example:

        sh nis_sge_client_stop.sh ehpc-hz-e5WF****** manager 10.18.XX.XX 000a******-vm***.cn-hangzhou.nas.aliyuncs.com:/ehpc-hz-e5WF****** manager 10.18.XX.XX
    • Multiple on-premises compute nodes

      Log on to the on-premises configuration node or a compute node and perform the following operations. Make sure that the network between the configuration node or the compute node and other on-premises compute nodes is connected.

      1. Run the following command to download the script that is used to stop the services:

        wget https://public-ehpc-package.oss-cn-hangzhou.aliyuncs.com/scheduler/batch_stop_clients.sh
      2. Create a file named hosts.

        Each row in the hosts file indicates an on-premises compute node. Format: [host_name] [host_ip]. Example:

        混合云最佳实践2

      3. Run the following command to execute the script:

        sh batch_stop_clients.sh <Cluster configurations> ./hosts

        Replace <Cluster configurations> with the cluster configurations. Example:

        sh batch_stop_clients.sh  ehpc-hz-e5WF****** manager 10.18.XXX.XX 000a******-vm***.cn-hangzhou.nas.aliyuncs.com:/ehpc-hz-e5WF****** manager 10.18.XX.XX ./hosts