All Products
Search
Document Center

Elastic High Performance Computing:Test the performance of an SCC-based cluster

Last Updated:Jun 07, 2023

Instances in the Super Computing Cluster (SCC) instance family have no virtualization overheads, and are ideal for applications that require high parallelization, high bandwidth, and low latency. Such applications include high-performance computing (HPC) applications such as Artificial Intelligence (AI) and machine learning. This topic describes how to create and test the performance of an SCC-based cluster.

Background

SCC instances are based on Elastic Compute Service (ECS) Bare Metal Instance family, using high-speed Remote Direct Memory Access (RDMA) based interconnections to significantly improve the network performance and acceleration ratio of large-scale clusters. In short, SCC instances have all the benefits of ECS Bare Metal Instance and can provide high-quality network performance with high bandwidth and low latency. For more information, see Overview.

SCC-based clusters leverage RDMA networks to provide low-latency networking to meet the parallel computing requirements. You can also isolate individual clusters by virtual private clouds (VPCs). SCC-based clusters have no virtualization overheads, providing your applications with direct access to hardware resources. SCC-based clusters are suitable for scenarios such as manufacturing simulations, life sciences, machine learning, large-scale molecular dynamics (MD) simulations, and weather predictions.

Compared with ECS instances, SCC instances are connected over high-bandwidth and low-latency RDMA networks. Therefore, SCC instances provide more powerful communication capabilities than ECS instances. The following figure shows the port information that is displayed when you access an SCC instance. Eth0 represents the RDMA port and lo represents the VPC port.

SCC..png

Limits

SCC instances are only available in some regions. For more information, see ECS Instance Types Available for Each Region.

Create an SCC cluster

  1. Log on to the E-HPC console.

  2. Create an SCC-based E-HPC cluster. For more information, see Create a cluster by using the wizard.

    Configure the following settings.

    Parameter

    Description

    Hardware settings

    Deploy a tiny cluster with one management node and zero compute nodes. The compute node uses an SCC instance.

    Important

    If you use an SCC instance as the compute node, you cannot add the instance to a cluster when you create the cluster. You need to add the SCC instance that is used as the compute node to the cluster by resizing the cluster after the cluster is created.

    Software settings

    Deploy a CentOS 7.6 public image and the PBS scheduler.

    SCC-创建集群..png
  3. Scale out the cluster and add compute nodes that use SCC instances. For more information, see Scale out a cluster.

    In this example, two SCC instances of the ecs.scch5s.16xlarge type are used.

    SCC-扩容集群..png
  4. Create a cluster user. For more information, see Create a user.

    The user is used to log on to the cluster, compile LAMMPS, and submit jobs. Grant the sudo permissions to the user.

  5. Install software. For more information, see Install software.

    Install the following software:

    • LINPACK 2018

    • Intel MPI 2018

Test the network performance of the SCC cluster

Test the peak bandwidth of the RDMA network

  1. Log on to the compute000 node and the compute001 node, respectively.

  2. Perform the following steps to test the peak read BPS:

    1. Run the following command on the compute000 node:

      ib_read_bw -a -q 20 --report_gbits   ## Run the command on the compute000 node.
    2. Run the following command on the compute001 node:

      ib_read_bw -a -q 20 --report_gbits compute000  ## Run the command on the compute001 node.

      The following code provides an example of the returned output.

      读带宽测试峰值
  3. Perform the following steps to test the peak write BPS:

    1. Run the following command on the compute000 node:

      ib_write_bw -a -q 20 --report_gbits  ## Run the command on the compute000 node.
    2. Run the following command on the compute001 node:

      ib_write_bw -a -q 20 --report_gbits compute000  ## Run the command on the compute001 node.

      The following code provides an example of the returned output.

      写带宽测试峰值

Test the latency of the RDMA network

  1. Connect to your Kubernetes cluster. For more information, see Log on to an E-HPC cluster.

  2. Perform the following steps to test the read latency of the RDMA network:

    1. Run the following command on the compute000 node:

      ib_read_lat -a   ## Run the command on the compute000 node.
    2. Run the following command on the compute001 node:

      ib_read_lat -F -a compute000  ## Run the command on the compute001 node.

      The following code provides an example of the returned output.

      SCC-读延迟..png
  3. Perform the following steps to test the write latency of the RDMA network:

    1. Run the following command on the compute000 node:

      ib_write_lat -a  ## Run the command on the compute000 node.
    2. Run the following command on the compute001 node:

      ib_write_lat -F -a compute000  ## Run the command on the compute001 node.

      The following code provides an example of the returned output.

      RoCE网络的写延迟

Monitor the bandwidth utilization of the RDMA network

  1. Log on to a compute node (such as the compute000 node) as the root user.

  2. Run the following command to monitor the bandwidth utilization of the RDMA network:

    rdma_monitor -s

    The following figure provides a sample response.

    SCC-网络监测..png

View the performance data of the SCC instances

  1. In the left-side navigation pane of the E-HPC console, choose Job and Performance Management > E-HPC Tune.

  2. On the ClusterView page, find the cluster that you want to view and click Node in the Operation column.

  3. On the NodeView tab, select nodes, time period, and metrics to view the performance data of the nodes in the instance.

    SCC-集群性能..png