edit-icon download-icon

Auto scale

Last Updated: Aug 08, 2018

Auto scale enables Elastic High Performance Computing (E-HPC) to dynamically allocate cloud resources based on your customized scaling rules. For example, you can resize your computing capacity based on the workload of your cluster. Auto scale feature helps you to:

  • Reduce human-intervention and decrease the cost involved in maintenance.

  • Minimize the cost of your fleet while ensuring availability.

  • Improve fault tolerance. Auto scale runs health checks, stops unhealthy instances, and adds new ones into your cluster.

  • Improve availability. (Make sure that you have enough computing capacity.)

Configure the Auto scale service

The Auto scale service is disabled by default. To enable it, log on to the E-HPC console and click Auto scale from the left-side navigation pane.

image.png | center | 692x291

Parameters

  • EnableGrow: Whether to enable Auto scale out.

  • GrowIntervalInMinutes: The interval to check the workload to determine when and how to scale out. The default value is 2 minutes, and the value ranges from 2 to 10 minutes.

  • GrowTimeoutInMinutes: The time-out value required to start a node. The default value is 20 minutes and the value ranges from 10 to 60 minutes. If the status of node cannot enter the running status during this period, it is released and placed in the next resize queue.

  • ExtraNodesGrowRatio: The percentage of extra nodes to be added into the cluster together with the required nodes. The default value is 0, and the value ranges from 0 to 100. If 100 nodes are required to handle the workload, and the ExtraNodesGrowRatio value is 2, then the number of nodes added will be 102.

    Take an MPI job that requires 32 nodes to run as an example. If only 32 nodes are added, and if one node malfunctions, it can lead to the failure of the entire job process, rendering other 31 nodes ready but not running. With this option, the cluster will be scaled out to 35 nodes, so the probability of failure can be reduced to some extent. The extra nodes will be released immediately. This action might bring about almost negligible increase in the cost however, it will ensure availability.

  • GrowRatio (based on workload): The percentage of nodes to be added actually out of workload required nodes. The default value is 100, and the value ranges from 1 to 100. For example, workload requires 10 new nodes, with the GrowRatio being 50. In this case, 5 nodes will be added.

For example, 10 jobs are available and ready to be executed, and each job requires one node to run for only several minutes. According to the workload based on your scaling configuration, 10 nodes will be added. But it takes several minutes for each node to enter the running status. With GrowRatio, you can add only 5 nodes to run the first 5 jobs simultaneously, then the remaining 5 will run. You can use GrowRatio to improve cluster utilization.

  • MaxNodesInCluster: The maximum number of nodes that a cluster can have. The default value is 100. The minimum value is 1.
  • EnableShrink: Whether to enable Auto scale in.

  • ShrinkIntervalInMinutes: The interval to check the workload to decide whether and how to scale in. The default value is 2 minutes, and the value ranges from 2 to 10 minutes.

    Note: ShrinkInterval >= GrowInterval

  • ShrinkIdleTimes: The consecutive times of a node hitting not running status during the scale-in checks. The default value is 3, and the value ranges from 2 to 5. By default, a node that hits not running status for three consecutive times during the auto scale-in check, it will be released. In other words, a node that is not running for six consecutive minutes is released by default.

  • ExcludeNodes: The list of nodes to which auto scaling will not be applied. Each node is separated with a half-width comma. You can use this option to keep the minimal number of nodes running in your cluster.

Scenarios and configurations

Generally, the Auto scale service applies to users who do not use subscribed resources or services. For example:

  • You need the E-HPC cluster to handle multiple large-scale computing jobs intensively only for several hours a day, then release the computing resources.

  • The flow of the workload fluctuates throughout the day and hence, the cluster does not have to handle heavy workload 24 hours a day.

You can select and configure parameters based on different job types and the utilization of your cluster. For example, use GrowRatio when you want to run a multiple large scale computing jobs with each job running for a short period of time. If you want to run 1,000 jobs, with each job would require one node to run for a minute then set the GrowRatio value as 10, and 100 nodes will be added.

Run LAMMPS in the E-HPC cluster

Procedure

  1. Create a cluster and select the applications to install.

    image.png | center | 463x511

  2. Configure Auto scaling: Enable Auto scale out and Auto scale in.image.png | center | 554x226
  3. If the workload does not increase, extra nodes are released in a few minutes.

  4. Create the user of the cluster.image.png | center | 562x320

  5. Select Network Attached Storage (NAS) shared storage to store your data.
  6. Access your cluster through the console or SSH, then submit jobs.image.png | center | 562x246For example, the content in job1.sh shows that two computing nodes are required.image.png | center | 592x291
  7. The two nodes are automatically added to the cluster within two minutes and can be viewed on the console.image.png | center | 637x132
  8. Several minutes later, the nodes are ready and you can view your jobs being processed.image.png | center | 692x148
  9. After completion, you can view the job details on the console.image.png | center | 577x311
  10. After several minutes, extra nodes are released.image.png | center | 592x100
  11. Moreover, you can also view auto scaling logs after a couple of minutes.image.png | center | 692x237
Thank you! We've received your feedback.