how to use sccgn instances and verify the performance of sccgn instances - Elastic Compute Service

To further optimize the network performance of GPU-accelerated servers that use the SHENLONG architecture, Alibaba Cloud provides GPU-accelerated compute-optimized Super Computing Cluster (SCC) instance families, which are named sccgn instance families. sccgn instances provide superior computing power and strong network communication capabilities. This topic describes how to use sccgn instances and verify the performance of sccgn instances.

Usage notes

sccgn instances are equipped with GPUs and high-performance NVIDIA Mellanox ConnectX SmartNICs to deliver superior computing power and strong network communication capabilities. sccgn instances are suitable for scenarios that require high-intensity computing and communication, such as deep learning and high-performance computing. Take note of the following items when you use sccgn instances:

If your business requires only the remote direct memory access (RDMA) feature, select Auto-install RDMA Software Stack when you select an image to create an sccgn instance.
If your business requires the GPUDirect RDMA feature, select Auto-install GPU Driver to install the required software stacks and toolkits when you select an image to create an sccgn instance.
Note
GPUDirect RDMA is a technology introduced in Kepler-class GPUs and Compute Unified Device Architecture (CUDA) 5.0 to allow direct data exchange between GPUs and third-party devices that use standard Peripheral Component Interface Express (PCIe) features. Examples of third-party devices include GPUs, network interfaces, video acquisition devices, and storage adapters. For more information, see NVIDIA documentation.
If the network interface controller (NIC) driver that you want to install is an OpenFabrics Enterprise Distribution (OFED) open source version (download URL), you can install the NIC driver and then install a GPU driver and CUDA.
Note
The nvidia-peermem kernel module is integrated into the GPU driver as of CUDA 11.4 and R470. You do not need to install the nv_peer_mem module. For more information, visit nv_peer_memory.

Functional verification and bandwidth verification

Functional verification

This verification checks whether the RDMA software stack is installed and configured as expected on an sccgn instance.

Run the following command to check the installation of the RDMA software stack.

Note

For information about the issues that you may encounter during the check, see the FAQ section of this topic.

rdma_qos_check -V

The following command output indicates that the RDMA software stack is installed as expected:

===========================================================
*    rdma_qos_check
-----------------------------------------------------------
* ITEM          DETAIL                               RESULT
===========================================================
* link_up       eth1: yes                                ok
* mlnx_device   eth1: 1                                  ok
* drv_ver       eth1: 5.2-2.2.3                          ok
...
* pci           0000:c5:00.1                             ok
* pci           0000:e1:00.0                             ok
* pci           0000:e1:00.1                             ok
===========================================================

Bandwidth verification

This verification checks whether the RDMA network bandwidth meets the requirements of hardware.

Run the following command on the server side:

ib_read_bw -a -q 20 --report_gbits -d mlx5_bond_0

The following code shows an example command output:

---------------------------------------------------------------------------------------
                    RDMA_Read BW Test
 Dual-port       : OFF        Device         : mlx5_bond_0
 Number of qps   : 20        Transport type : IB
 Connection type : RC        Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 CQ Moderation   : 100
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Outstand reads  : 16
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x11ca PSN 0x6302b0 OUT 0x10 RKey 0x17fddc VAddr 0x007f88e1e5d000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:46:14
 local address: LID 0000 QPN 0x11cb PSN 0x99aeda OUT 0x10 RKey 0x17fddc VAddr 0x007f88e265d000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:46:14
 local address: LID 0000 QPN 0x11cc PSN 0xf0d01c OUT 0x10 RKey 0x17fddc VAddr 0x007f88e2e5d000
 ...
  remote address: LID 0000 QPN 0x11dd PSN 0x8efe92 OUT 0x10 RKey 0x17fddc VAddr 0x007f672004b000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:45:14
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 8388608    20000            165.65             165.63            0.002468
---------------------------------------------------------------------------------------

Run the following command on the client side:

ib_read_bw -a -q 20 --report_gbits -d mlx5_bond_0 #server_ip

The following code shows an example command output:

---------------------------------------------------------------------------------------
                    RDMA_Read BW Test
 Dual-port       : OFF        Device         : mlx5_bond_0
 Number of qps   : 20        Transport type : IB
 Connection type : RC        Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Outstand reads  : 16
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x11ca PSN 0x787f05 OUT 0x10 RKey 0x17fddc VAddr 0x007f671684b000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:45:14
 local address: LID 0000 QPN 0x11cb PSN 0x467042 OUT 0x10 RKey 0x17fddc VAddr 0x007f671704b000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:45:14
 local address: LID 0000 QPN 0x11cc PSN 0xac262e OUT 0x10 RKey 0x17fddc VAddr 0x007f671784b000
 ...
  remote address: LID 0000 QPN 0x11dd PSN 0xeb1c3f OUT 0x10 RKey 0x17fddc VAddr 0x007f88eb65d000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:46:14
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
Conflicting CPU frequency values detected: 800.000000 != 3177.498000. CPU Frequency is not max.
 2          20000           0.058511            0.058226            3.639132
Conflicting CPU frequency values detected: 799.996000 != 3384.422000. CPU Frequency is not max.
...
Conflicting CPU frequency values detected: 800.000000 != 3166.731000. CPU Frequency is not max.
 4194304    20000            165.55             165.55            0.004934
Conflicting CPU frequency values detected: 800.000000 != 2967.226000. CPU Frequency is not max.
 8388608    20000            165.65             165.63            0.002468
---------------------------------------------------------------------------------------

When the preceding commands are being run, run the rdma_monitor -s -t -G command to monitor the bandwidth of each port on NICs in the ECS console. The following code shows an example command output:

------
2022-2-18 09:48:59 CST
tx_rate: 81.874 (40.923/40.951)
rx_rate: 0.092 (0.055/0.037)
tx_pause: 0 (0/0)
rx_pause: 0 (0/0)
tx_pause_duration: 0 (0/0)
rx_pause_duration: 0 (0/0)
np_cnp_sent: 0
rp_cnp_handled: 4632
num_of_qp: 22
np_ecn_marked: 0
rp_cnp_ignored: 0
out_of_buffer: 0
out_of_seq: 0
packet_seq_err: 0
tx_rate_prio0: 0.000 (0.000/0.000)
rx_rate_prio0: 0.000 (0.000/0.000)
tcp_segs_retrans: 0
tcp_retrans_rate: 0
cpu_usage: 0.35%
free_mem: 1049633300 kB

------

NCCL test cases

To test and verify the performance of an instance that uses RDMA networks in your application, the following section provides an example on how to use the RDMA feature of sccgn instances to accelerate your application. In the example, NVIDIA Collective Communication Library (NCCL) test cases are used.

Note

For more information about NCCL tests, visit nccl-tests.

Run the following commands to set up SSH mutual trust between instances.

After one instance generates a public key, the instance copies the public key to another instance for setting up SSH mutual trust between the instances. The following code provides an example on how to set up SSH mutual trust between instances:

# Run the following commands on the host1 instance:
ssh-keygen
ssh-copy-id -i ~/.ssh/id_rsa.pub ${host2}

# Run the following command on the host1 instance to check whether the instance can connect to the host2 instance without using a password: If yes, SSH mutual trust is set up between the instances. 
ssh root@{host2}

Run the following commands to test the acceleration performance of the RDMA feature in your application.

The following code provides an example on how to use an NCCL test case to test the acceleration performance:

#!/bin/sh
# Use instances that run Alibaba Cloud Linux 2.
# Install openmpi and a compiler.
wget https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/openmpi-4.0.3-1.x86_64.rpm
rpm -ivh --force openmpi-4.0.3-1.x86_64.rpm --nodeps
yum install -y gcc-c++

# Modify ~/.bashrc.
export PATH=/usr/local/cuda-11.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/lib:/usr/local/lib/openmpi:/usr/local/cuda-11.0/lib64:$LD_LIBRARY_PATH

# Download and compile the test code.
git clone https://github.com/NVIDIA/nccl-tests
cd nccl-tests/
make MPI=1 CUDA_HOME=/usr/local/cuda

# Replace host1 and host2 with the IP addresses of the instances.
mpirun --allow-run-as-root -np 16 -npernode 8 -H host1:8,host2:8  \
  --bind-to none \
  -mca btl_tcp_if_include bond0 \
  -x PATH \
  -x LD_LIBRARY_PATH \
  -x CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  -x NCCL_SOCKET_IFNAME=bond0 \
  -x NCCL_IB_HCA=mlx5 \
  -x NCCL_IB_DISABLE=0 \
  -x NCCL_DEBUG=INFO \
  -x NCCL_NSOCKS_PERTHREAD=8 \
  -x NCCL_SOCKET_NTHREADS=8 \
  -x NCCL_IB_GID_INDEX=3 \
  -x NCCL_DEBUG_SUBSYS=NET,GRAPH \
  -x NCCL_IB_QPS_PER_CONNECTION=4 \
  /root/nccl-tests/build/all_reduce_perf -b 4M -e 4M -f 2 -g 1 -t 1 -n 20

The following code shows an example command output:

# Instance output
# nThread 1 nGpus 1 minBytes 4194304 maxBytes 4194304 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid  57655 on iZ2ze58t*****3vnehjdZ device  0 [0x54] NVIDIA A100-SXM-80GB
#   Rank  1 Pid  57656 on iZ2ze58t*****3vnehjdZ device  1 [0x5a] NVIDIA A100-SXM-80GB
#   Rank  2 Pid  57657 on iZ2ze58t*****3vnehjdZ device  2 [0x6b] NVIDIA A100-SXM-80GB
#   Rank  3 Pid  57658 on iZ2ze58t*****3vnehjdZ device  3 [0x70] NVIDIA A100-SXM-80GB
#   Rank  4 Pid  57659 on iZ2ze58t*****3vnehjdZ device  4 [0xbe] NVIDIA A100-SXM-80GB
#   Rank  5 Pid  57660 on iZ2ze58t*****3vnehjdZ device  5 [0xc3] NVIDIA A100-SXM-80GB
#   Rank  6 Pid  57661 on iZ2ze58t*****3vnehjdZ device  6 [0xda] NVIDIA A100-SXM-80GB
#   Rank  7 Pid  57662 on iZ2ze58t*****3vnehjdZ device  7 [0xe0] NVIDIA A100-SXM-80GB
#   Rank  8 Pid  58927 on iZ2ze58t*****3vnehjeZ device  0 [0x54] NVIDIA A100-SXM-80GB
#   Rank  9 Pid  58928 on iZ2ze58t*****3vnehjeZ device  1 [0x5a] NVIDIA A100-SXM-80GB
#   Rank 10 Pid  58929 on iZ2ze58t*****3vnehjeZ device  2 [0x6b] NVIDIA A100-SXM-80GB
#   Rank 11 Pid  58930 on iZ2ze58t*****3vnehjeZ device  3 [0x70] NVIDIA A100-SXM-80GB
#   Rank 12 Pid  58931 on iZ2ze58t*****3vnehjeZ device  4 [0xbe] NVIDIA A100-SXM-80GB
#   Rank 13 Pid  58932 on iZ2ze58t*****3vnehjeZ device  5 [0xc3] NVIDIA A100-SXM-80GB
#   Rank 14 Pid  58933 on iZ2ze58t*****3vnehjeZ device  6 [0xda] NVIDIA A100-SXM-80GB
#   Rank 15 Pid  58934 on iZ2ze58t*****3vnehjeZ device  7 [0xe0] NVIDIA A100-SXM-80GB
iZ2ze6t9*****ssopZ:57655:57655 [0] NCCL INFO NCCL_SOCKET_IFNAME set to bond0
...
iZ2ze58t*****3vnehjeZ:58929:59248 [2] NCCL INFO NET/IB: Dev 1 Port 1 qpn 4573 mtu 3 GID 3 (0/22D00C8FFFF0000)
iZ2ze58t*****3vnehjdZ:57657:58004 [2] NCCL INFO NET/IB: Dev 1 Port 1 qpn 4573 mtu 3 GID 3 (0/22E00C8FFFF0000)
iZ2ze58t*****3vnehjeZ:58927:59225 [0] NCCL INFO Channel 04 : 0[54000] -> 8[54000] [receive] via NET/IB/0/GDRDMA
iZ2ze58t*****3vnehjeZ:58927:59225 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 54000 / HCA 0 (distance 4 <= 4), read 1
iZ2ze58t*****3vnehjeZ:58931:59227 [4] NCCL INFO NET/IB: Dev 2 Port 1 qpn 4573 mtu 3 GID 3 (0/62D00C8FFFF0000)
iZ2ze58t*****3vnehjdZ:57659:58012 [4] NCCL INFO NET/IB: Dev 2 Port 1 qpn 4573 mtu 3 GID 3 (0/62E00C8FFFF0000)
iZ2ze58t*****3vnehjeZ:58933:59183 [6] NCCL INFO NET/IB: Dev 3 Port 1 qpn 4573 mtu 3 GID 3 (0/A2D00C8FFFF0000)
iZ2ze58t*****3vnehjeZ:58927:59225 [0] NCCL INFO Channel 00 : 8[54000] -> 0[54000] [send] via NET/IB/0/GDRDMA
iZ2ze58t*****3vnehjeZ:58927:59225 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 54000 / HCA 0 (distance 4 <= 4), read 1
iZ2ze58t*****3vnehjdZ:57661:58000 [6] NCCL INFO NET/IB: Dev 3 Port 1 qpn 4573 mtu 3 GID 3 (0/A2E00C8FFFF0000)
iZ2ze58t*****3vnehjeZ:58927:59225 [0] NCCL INFO Channel 04 : 8[54000] -> 0[54000] [send] via NET/IB/0/GDRDMA
iZ2ze58t*****3vnehjdZ:57655:57848 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 4660 mtu 3 GID 3 (0/E2E00C8FFFF0000)
iZ2ze58t*****3vnehjeZ:58927:59225 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 4660 mtu 3 GID 3 (0/E2D00C8FFFF0000)
iZ2ze58t*****3vnehjeZ:58927:59225 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 4661 mtu 3 GID 3 (0/E2D00C8FFFF0000)
iZ2ze58t*****3vnehjdZ:57655:57848 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 4661 mtu 3 GID 3 (0/E2E00C8FFFF0000)
#
#                                                       out-of-place                       in-place
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
     4194304       1048576     float     sum    241.5   17.37   32.56  4e-07    235.2   17.84   33.44  4e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 33.002
#

FAQ

When you run the `rdma_qos_check -V` command, the drv_fw_ver eth1: 5.2-2.2.3/22.29.1016 fail error message is returned.

Cause: The Mellanox NIC firmware is not updated.

Solution:

If the instance runs an Alibaba Cloud Linux 2 or CentOS 8.3 operating system, run the /usr/share/nic-drivers-mellanox-rdma/sources/alifwmanager-22292302 --force --yes command to update the NIC firmware of the instance.
If the instance runs a Debian-base operating system, download the firmware update program (download URL) and then run the ./alifwmanager-22292302 --force --yes command to update the NIC firmware of the instance.

When you run the `rdma_qos_check -V` command, the * roce_ver : 0 fail error message is returned.

Cause: Kernel modules, such as configfs and rdma_cm, are missing.

Solution: Run the modprobe mlx5_ib && modprobe configfs && modprobe rdma_cm command to load the required kernel modules.

When you run the `systemctl start networking` command on an instance that runs a Debian operating system to start the network service, the system prompts that the bond interfaces cannot be found.

Cause: The error may occur because the mlx5_ib kernel module is not loaded.

Solution: Run the modprobe mlx5_ib command to load the mlx5_ib kernel module.

When you run the `rdma_qos_check -V` or `ib_read_bw` command, the ERROR: RoCE tos isn't correct on mlx5_bond_3 error message is returned.

Run the rdma_qos_init command to initialize the network.

When you run the `rdma_qos_check -V` command, the cm_tos mlx5_bond_1: 0 fail error message is returned.

After you restart an instance that runs an Alibaba Cloud Linux 2 operating system, the cm_tos mlx5_bond_1: 0 fail error message is returned when you run the rdma_qos_check -V verification command. To resolve the issue, run the rdma_qos_init command to initialize the network.

When you run the `rdma_qos_check -V` command, the trust_mod eth1: pcp fail error message is returned.

After you restart an instance that runs a CentOS 8.3 operating system, the trust_mod eth1: pcp fail error message is returned when you run the rdma_qos_check -V verification command. To resolve the issue, run the rdma_qos_init command to initialize the network.

**The IP address of the RDMA network interface bond* cannot be obtained.**

Run the ifdown bond* and ifup bond* commands to obtain the IP address of the bond interface.

Note

Replace * with the serial number of the corresponding network interface.

Usage notes

Functional verification and bandwidth verification

Functional verification

Bandwidth verification

NCCL test cases

FAQ

When you run the rdma_qos_check -V command, the drv_fw_ver eth1: 5.2-2.2.3/22.29.1016 fail error message is returned.

When you run the rdma_qos_check -V command, the * roce_ver : 0 fail error message is returned.

When you run the systemctl start networking command on an instance that runs a Debian operating system to start the network service, the system prompts that the bond interfaces cannot be found.

When you run the rdma_qos_check -V or ib_read_bw command, the ERROR: RoCE tos isn't correct on mlx5_bond_3 error message is returned.

When you run the rdma_qos_check -V command, the cm_tos mlx5_bond_1: 0 fail error message is returned.

When you run the rdma_qos_check -V command, the trust_mod eth1: pcp fail error message is returned.

The IP address of the RDMA network interface bond* cannot be obtained.

When you run the `rdma_qos_check -V` command, the drv_fw_ver eth1: 5.2-2.2.3/22.29.1016 fail error message is returned.

When you run the `rdma_qos_check -V` command, the * roce_ver : 0 fail error message is returned.

When you run the `systemctl start networking` command on an instance that runs a Debian operating system to start the network service, the system prompts that the bond interfaces cannot be found.

When you run the `rdma_qos_check -V` or `ib_read_bw` command, the ERROR: RoCE tos isn't correct on mlx5_bond_3 error message is returned.

When you run the `rdma_qos_check -V` command, the cm_tos mlx5_bond_1: 0 fail error message is returned.

When you run the `rdma_qos_check -V` command, the trust_mod eth1: pcp fail error message is returned.

**The IP address of the RDMA network interface bond* cannot be obtained.**