This topic describes how to create an Elastic High Performance Computing (E-HPC) Cluster (formerly E-HPC NEXT) that supports elastic Remote Direct Memory Access (eRDMA). It also shows how to configure runtime parameters for the OSU-Benchmark application to accelerate communication for multi-node High Performance Computing (HPC) applications.
Background information
Using eRDMA technology, multi-node parallel HPC tasks in an E-HPC Cluster (formerly E-HPC NEXT) achieve high-speed network performance comparable to on-premises clusters. These tasks, such as climate forecasting, industrial simulation, and molecular dynamics, benefit from high bandwidth and low latency, which significantly improves the efficiency of numerical simulations. You can experience the benefits of RDMA on your existing network without deploying additional network interface controllers (NICs), which ensures seamless integration and ease of use.
Preparations
Go to the Create Cluster page to create an E-HPC cluster. For more information, see Create a Standard Edition cluster.
The following table shows an example cluster configuration.
Instance type: ecs.c7.xlarge. This instance type provides 4 vCPUs and 8 GiB of memory.
Image: aliyun_2_1903_x64_20G_alibase_20240628.vhd
NoteThe osu-benchmark installation package is built on the Alibaba Cloud Linux 2.1903 LTS 64-bit image.
erdma-installer
mpich-aocc
Instance type: ecs.c7.xlarge. This instance type provides 4 vCPUs and 8 GiB of memory.
Image: aliyun_2_1903_x64_20G_alibase_20240628.vhd
Create a cluster user. For more information, see User management.
Configuration Item | Configuration | |
Cluster Configuration | Region | Shanghai |
Network and Zone | Select Zone L | |
Series | Standard Edition | |
Deployment Mode | Public Cloud Cluster | |
Cluster Type | SLURM | |
Control Plane Node | ||
Compute Nodes and Queues | Number of Queue Nodes | Initial nodes: |
Inter-node Interconnection | eRDMA Network Note Only some node specifications support Elastic RDMA Interconnect (ERI). For more information, see elastic Remote Direct Memory Access (eRDMA) and Enable eRDMA on enterprise-level instances. | |
Instance Type Group | Instance type: ecs.c8ae.xlarge or other AMD instances of the same generation. Image: aliyun_2_1903_x64_20G_alibase_20240628.vhd | |
Shared File Storage | /home cluster mount directory | By default, the |
/opt cluster mount directory | ||
Software and Service Components | Software to Install | |
Installable Service Components | Logon Node: |
Check the eRDMA environment
Check if the eRDMA configuration of the compute nodes is correct.
Log on to the Elastic High Performance Computing console and click the destination cluster.
On the page, select all compute nodes in the cluster and click Send Command.

Check the eRDMA network status and the RDMA hardware and software support on the compute nodes.
Send the following command to all compute nodes.
hpcacc erdma check
If the following result is returned, the eRDMA configuration is correct.

If an abnormal message is returned, run the following command to fix the issue.
hpcacc erdma repairAfter the issue is fixed, confirm that the eRDMA configuration is correct.
OSU-Benchmark test
OSU-Benchmark is used to evaluate the communication performance of HPC clusters and distributed systems. This topic uses the following two benchmarks to test communication performance based on different network protocols (TCP vs. RDMA):
Network latency test (osu_latency): Measures the one-way latency of point-to-point communication, which is the time taken to send a message from one process to another, excluding the response time. This test focuses on the communication efficiency of small messages, from 1 B to several kilobytes. Small message latency reflects the underlying performance of network hardware, such as RDMA acceleration, and the optimization level of the MPI library. It is a core indicator of HPC system responsiveness. For example, low latency significantly reduces communication overhead in real-time simulations or machine learning parameter synchronization.
Network bandwidth test (osu_bw): Measures the sustainable bandwidth of point-to-point communication, which is the amount of data transferred per unit of time. This test focuses on the transfer efficiency of large messages, from several kilobytes to several megabytes. Bandwidth performance directly affects the efficiency of large data transfers, such as matrix exchanges in scientific computing or file I/O scenarios. If the measured bandwidth is much lower than the theoretical value, optimize the MPI configuration for multi-threaded communication or check network settings such as MTU and throttling.
The test procedure is as follows:
Connect to the E-HPC cluster as the user you created. For more information, see Connect to a cluster.
Run the following command to check if the required environment components are installed correctly.
module availRun the following command to download and decompress the precompiled osu-benchmark installation package.
cd ~ && wget https://ehpc-perf.oss-cn-hangzhou.aliyuncs.com/AMD-Genoa/osu-bin.tar.gz tar -zxvf osu-bin.tar.gzRun the following command to navigate to the test working directory and edit the Slurm job script.
cd ~/pt2pt vim slurm.jobThe test script is as follows:
#!/bin/bash #SBATCH --job-name=osu-bench #SBATCH --ntasks-per-node=1 #SBATCH --nodes=2 #SBATCH --partition=comp #SBATCH --output=%j.out #SBATCH --error=%j.out # Load environment parameters module purge module load aocc/4.0.0 gcc/12.3.0 libfabric/1.16.0 mpich-aocc/4.0.3 # Run MPI latency test: eRDMA echo -e "++++++ use erdma for osu_lat: START" mpirun -np 2 -ppn 1 -genv FI_PROVIDER="verbs;ofi_rxm" ./osu_latency echo -e "------ use erdma for osu_lat: END\n" # Run MPI latency test: TCP echo -e "++++++ use tcp for osu_lat: START" mpirun -np 2 -ppn 1 -genv FI_PROVIDER="tcp;ofi_rxm" ./osu_latency echo -e "------ use tcp for osu_lat: END\n" # Run MPI bandwidth test: eRDMA echo -e "++++++ use erdma for osu_bw: START" mpirun -np 2 -ppn 1 -genv FI_PROVIDER="verbs;ofi_rxm" ./osu_bw echo -e "------ use erdma for osu_bw: END\n" # Run MPI bandwidth test: TCP echo -e "++++++ use tcp for osu_bw: START" mpirun -np 2 -ppn 1 -genv FI_PROVIDER="tcp;ofi_rxm" ./osu_bw echo -e "------ use tcp for osu_bw: END\n"Note-np 2: Specifies the total number of processes. A value of 2 means the MPI job starts two processes.-ppn 1: Specifies the number of processes per node. A value of 1 means one process runs on each node.-genv: Sets an environment variable that applies to all processes.FI_PROVIDER="tcp;ofi_rxm": Uses the TCP protocol and enhances communication reliability through the RXM framework.FI_PROVIDER="verbs;ofi_rxm": Prioritizes the high-performance Verbs protocol (based on RDMA) and optimizes message transmission through the RXM framework. Alibaba Cloud eRDMA provides a high-performance elastic RDMA network.
Run the following command to submit the test job.
sbatch slurm.jobThe command line displays the job ID.

Run the following command to view the job status. During the test, you can also view monitoring information for E-HPC in the console, such as storage, job, and node status. For more information, see View monitoring information.
squeue
In the current directory, you can view the log file that corresponds to the job ID. The output is as follows:
Network latency test results: These results show the relationship between message size (in bytes, from 1 B to 4 MB) and average latency. The following are example test results:
Using the Verbs protocol (based on eRDMA)

Using the TCP protocol

The test data shows that for small messages (1 B to 8 KB), the latency of eRDMA is significantly lower than that of TCP.
Network bandwidth test results: These results show the relationship between message size (in bytes, from 1 B to 4 MB) and bandwidth. The following are example test results:
Using the Verbs protocol (based on eRDMA)

Using the TCP protocol

The test data shows that for message sizes from 16 KB to 64 KB, eRDMA fully utilizes the network bandwidth, whereas the TCP protocol stack introduces additional processing overhead.