To further optimize the network performance of GPU-accelerated servers that use the SHENLONG architecture, Alibaba Cloud provides GPU-accelerated compute-optimized Super Computing Cluster (SCC) instance families, which are named sccgn instance families. sccgn instances provide superior computing power and strong network communication capabilities. This topic describes how to use sccgn instances and verify the performance of sccgn instances.
Usage notes
sccgn instances are equipped with GPUs and high-performance NVIDIA Mellanox ConnectX SmartNICs to deliver superior computing power and strong network communication capabilities. sccgn instances are suitable for scenarios that require high-intensity computing and communication, such as deep learning and high-performance computing. Take note of the following items when you use sccgn instances:
If your business requires only the remote direct memory access (RDMA) feature, select Auto-install RDMA Software Stack when you select an image to create an sccgn instance.
If your business requires the GPUDirect RDMA feature, select Auto-install GPU Driver to install the required software stacks and toolkits when you select an image to create an sccgn instance.
NoteGPUDirect RDMA is a technology introduced in Kepler-class GPUs and Compute Unified Device Architecture (CUDA) 5.0 to allow direct data exchange between GPUs and third-party devices that use standard Peripheral Component Interface Express (PCIe) features. Examples of third-party devices include GPUs, network interfaces, video acquisition devices, and storage adapters. For more information, see NVIDIA documentation.
If the network interface controller (NIC) driver that you want to install is an OpenFabrics Enterprise Distribution (OFED) open source version (download URL), you can install the NIC driver and then install a GPU driver and CUDA.
NoteThe nvidia-peermem kernel module is integrated into the GPU driver as of CUDA 11.4 and R470. You do not need to install the nv_peer_mem module. For more information, visit nv_peer_memory.
Functional verification and bandwidth verification
Functional verification
This verification checks whether the RDMA software stack is installed and configured as expected on an sccgn instance.
Run the following command to check the installation of the RDMA software stack.
For information about the issues that you may encounter during the check, see the FAQ section of this topic.
rdma_qos_check -V
The following command output indicates that the RDMA software stack is installed as expected:
===========================================================
* rdma_qos_check
-----------------------------------------------------------
* ITEM DETAIL RESULT
===========================================================
* link_up eth1: yes ok
* mlnx_device eth1: 1 ok
* drv_ver eth1: 5.2-2.2.3 ok
...
* pci 0000:c5:00.1 ok
* pci 0000:e1:00.0 ok
* pci 0000:e1:00.1 ok
===========================================================
Bandwidth verification
This verification checks whether the RDMA network bandwidth meets the requirements of hardware.
Run the following command on the server side:
ib_read_bw -a -q 20 --report_gbits -d mlx5_bond_0
The following code shows an example command output:
--------------------------------------------------------------------------------------- RDMA_Read BW Test Dual-port : OFF Device : mlx5_bond_0 Number of qps : 20 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON CQ Moderation : 100 Mtu : 1024[B] Link type : Ethernet GID index : 3 Outstand reads : 16 rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x11ca PSN 0x6302b0 OUT 0x10 RKey 0x17fddc VAddr 0x007f88e1e5d000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:46:14 local address: LID 0000 QPN 0x11cb PSN 0x99aeda OUT 0x10 RKey 0x17fddc VAddr 0x007f88e265d000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:46:14 local address: LID 0000 QPN 0x11cc PSN 0xf0d01c OUT 0x10 RKey 0x17fddc VAddr 0x007f88e2e5d000 ... remote address: LID 0000 QPN 0x11dd PSN 0x8efe92 OUT 0x10 RKey 0x17fddc VAddr 0x007f672004b000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:45:14 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 8388608 20000 165.65 165.63 0.002468 ---------------------------------------------------------------------------------------
Run the following command on the client side:
ib_read_bw -a -q 20 --report_gbits -d mlx5_bond_0 #server_ip
The following code shows an example command output:
--------------------------------------------------------------------------------------- RDMA_Read BW Test Dual-port : OFF Device : mlx5_bond_0 Number of qps : 20 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 100 Mtu : 1024[B] Link type : Ethernet GID index : 3 Outstand reads : 16 rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x11ca PSN 0x787f05 OUT 0x10 RKey 0x17fddc VAddr 0x007f671684b000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:45:14 local address: LID 0000 QPN 0x11cb PSN 0x467042 OUT 0x10 RKey 0x17fddc VAddr 0x007f671704b000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:45:14 local address: LID 0000 QPN 0x11cc PSN 0xac262e OUT 0x10 RKey 0x17fddc VAddr 0x007f671784b000 ... remote address: LID 0000 QPN 0x11dd PSN 0xeb1c3f OUT 0x10 RKey 0x17fddc VAddr 0x007f88eb65d000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:46:14 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] Conflicting CPU frequency values detected: 800.000000 != 3177.498000. CPU Frequency is not max. 2 20000 0.058511 0.058226 3.639132 Conflicting CPU frequency values detected: 799.996000 != 3384.422000. CPU Frequency is not max. ... Conflicting CPU frequency values detected: 800.000000 != 3166.731000. CPU Frequency is not max. 4194304 20000 165.55 165.55 0.004934 Conflicting CPU frequency values detected: 800.000000 != 2967.226000. CPU Frequency is not max. 8388608 20000 165.65 165.63 0.002468 ---------------------------------------------------------------------------------------
When the preceding commands are being run, run the rdma_monitor -s -t -G
command to monitor the bandwidth of each port on NICs in the ECS console. The following code shows an example command output:
------
2022-2-18 09:48:59 CST
tx_rate: 81.874 (40.923/40.951)
rx_rate: 0.092 (0.055/0.037)
tx_pause: 0 (0/0)
rx_pause: 0 (0/0)
tx_pause_duration: 0 (0/0)
rx_pause_duration: 0 (0/0)
np_cnp_sent: 0
rp_cnp_handled: 4632
num_of_qp: 22
np_ecn_marked: 0
rp_cnp_ignored: 0
out_of_buffer: 0
out_of_seq: 0
packet_seq_err: 0
tx_rate_prio0: 0.000 (0.000/0.000)
rx_rate_prio0: 0.000 (0.000/0.000)
tcp_segs_retrans: 0
tcp_retrans_rate: 0
cpu_usage: 0.35%
free_mem: 1049633300 kB
------
NCCL test cases
To test and verify the performance of an instance that uses RDMA networks in your application, the following section provides an example on how to use the RDMA feature of sccgn instances to accelerate your application. In the example, NVIDIA Collective Communication Library (NCCL) test cases are used.
For more information about NCCL tests, visit nccl-tests.
#!/bin/sh
# Use instances that run Alibaba Cloud Linux 2.
# Install openmpi and a compiler.
yum install -y gcc-c++
wget http://mirrors.cloud.aliyuncs.com/opsx/ecs/linux/binary/rdma/sccgn7ex/Alinux2/openmpi-4.1.3.tar.gz
tar -xzf openmpi-4.1.3.tar.gz
cd openmpi-4.1.3
./configure --prefix=/usr/local/openmpi
make -j && make install
# Modify ~/.bashrc.
export PATH=/usr/local/cuda/bin:/usr/local/openmpi/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH
# Download and compile the test code.
git clone https://github.com/NVIDIA/nccl-tests
cd nccl-tests/
make MPI=1 CUDA_HOME=/usr/local/cuda
# Replace host1 and host2 with the IP addresses of the instances.
mpirun --allow-run-as-root -np 16 -npernode 8 -H {host1}:{host2} \
--bind-to none \
-mca btl_tcp_if_include bond0 \
-x PATH \
-x CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
-x NCCL_SOCKET_IFNAME=bond0 \
-x NCCL_IB_HCA=mlx5 \
-x NCCL_IB_DISABLE=0 \
-x NCCL_DEBUG=INFO \
-x NCCL_NSOCKS_PERTHREAD=8 \
-x NCCL_SOCKET_NTHREADS=8 \
-x NCCL_IB_GID_INDEX=3 \
-x NCCL_DEBUG_SUBSYS=NET,GRAPH \
-x NCCL_IB_QPS_PER_CONNECTION=4 \
./build/all_reduce_perf -b 4M -e 4M -f 2 -g 1 -t 1 -n 20