To further optimize the network performance of GPU-accelerated servers that use the SHENLONG architecture, Alibaba Cloud provides GPU-accelerated compute-optimized Super Computing Cluster (SCC) instance families, which are named sccgn instance families. sccgn instances provide superior computing power and strong network communication capabilities. This topic describes how to use sccgn instances.
Prerequisites
Auto-install RDMA Software Stack is selected when you select an image to create a GPU-accelerated compute-optimized SCC instance. If GPUDirect RDMA is required for your business, Auto-install GPU Driver is also selected to install the required software stacks and toolkits.
GPUDirect RDMA is a technology introduced in Kepler-class GPUs and Compute Unified Device Architecture (CUDA) 5.0 to allow direct data exchange between GPUs and third-party devices that use standard PCI Express features. Examples of third-party devices: GPUs, network interfaces, video acquisition devices, and storage adapters. For more information, see NVIDIA documentation.
Background information
sccgn instances are equipped with GPUs and high-performance NVIDIA Mellanox ConnectX SmartNICs to deliver superior computing power and strong network communication capabilities. sccgn instances are suitable for scenarios that require high-intensity computing and communication, such as deep learning and high-performance computing.
Functional verification and bandwidth verification
- Functional verification
This verification checks whether a RDMA software stack is properly installed and configured. Run the following command to perform the check. For information about the issues that you may encounter during the check, see FAQ.
rdma_qos_check -V
A command output similar to the following one indicates that the RDMA software stack is properly installed:
=========================================================== * rdma_qos_check ----------------------------------------------------------- * ITEM DETAIL RESULT =========================================================== * link_up eth1: yes ok * mlnx_device eth1: 1 ok * drv_ver eth1: 5.2-2.2.3 ok ... * pci 0000:c5:00.1 ok * pci 0000:e1:00.0 ok * pci 0000:e1:00.1 ok ===========================================================
- Bandwidth verification
This verification checks whether RDMA network bandwidth meets the requirements of hardware.
- Run the following command on the server side:
ib_read_bw -a -q 20 --report_gbits -d mlx5_bond_0
A command output similar to the following one is returned:
--------------------------------------------------------------------------------------- RDMA_Read BW Test Dual-port : OFF Device : mlx5_bond_0 Number of qps : 20 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON CQ Moderation : 100 Mtu : 1024[B] Link type : Ethernet GID index : 3 Outstand reads : 16 rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x11ca PSN 0x6302b0 OUT 0x10 RKey 0x17fddc VAddr 0x007f88e1e5d000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:46:14 local address: LID 0000 QPN 0x11cb PSN 0x99aeda OUT 0x10 RKey 0x17fddc VAddr 0x007f88e265d000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:46:14 local address: LID 0000 QPN 0x11cc PSN 0xf0d01c OUT 0x10 RKey 0x17fddc VAddr 0x007f88e2e5d000 ... remote address: LID 0000 QPN 0x11dd PSN 0x8efe92 OUT 0x10 RKey 0x17fddc VAddr 0x007f672004b000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:45:14 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 8388608 20000 165.65 165.63 0.002468 ---------------------------------------------------------------------------------------
- Run the following command on the client side:
ib_read_bw -a -q 20 --report_gbits -d mlx5_bond_0 #server_ip
A command output similar to the following one is returned:
--------------------------------------------------------------------------------------- RDMA_Read BW Test Dual-port : OFF Device : mlx5_bond_0 Number of qps : 20 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 100 Mtu : 1024[B] Link type : Ethernet GID index : 3 Outstand reads : 16 rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x11ca PSN 0x787f05 OUT 0x10 RKey 0x17fddc VAddr 0x007f671684b000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:45:14 local address: LID 0000 QPN 0x11cb PSN 0x467042 OUT 0x10 RKey 0x17fddc VAddr 0x007f671704b000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:45:14 local address: LID 0000 QPN 0x11cc PSN 0xac262e OUT 0x10 RKey 0x17fddc VAddr 0x007f671784b000 ... remote address: LID 0000 QPN 0x11dd PSN 0xeb1c3f OUT 0x10 RKey 0x17fddc VAddr 0x007f88eb65d000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:46:14 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] Conflicting CPU frequency values detected: 800.000000 != 3177.498000. CPU Frequency is not max. 2 20000 0.058511 0.058226 3.639132 Conflicting CPU frequency values detected: 799.996000 != 3384.422000. CPU Frequency is not max. ... Conflicting CPU frequency values detected: 800.000000 != 3166.731000. CPU Frequency is not max. 4194304 20000 165.55 165.55 0.004934 Conflicting CPU frequency values detected: 800.000000 != 2967.226000. CPU Frequency is not max. 8388608 20000 165.65 165.63 0.002468 ---------------------------------------------------------------------------------------
rdma_monitor -s -t -G
command to monitor the bandwidth of each port on NICs in the ECS console.A command output similar to the following one is returned:
------ 2022-2-18 09:48:59 CST tx_rate: 81.874 (40.923/40.951) rx_rate: 0.092 (0.055/0.037) tx_pause: 0 (0/0) rx_pause: 0 (0/0) tx_pause_duration: 0 (0/0) rx_pause_duration: 0 (0/0) np_cnp_sent: 0 rp_cnp_handled: 4632 num_of_qp: 22 np_ecn_marked: 0 rp_cnp_ignored: 0 out_of_buffer: 0 out_of_seq: 0 packet_seq_err: 0 tx_rate_prio0: 0.000 (0.000/0.000) rx_rate_prio0: 0.000 (0.000/0.000) tcp_segs_retrans: 0 tcp_retrans_rate: 0 cpu_usage: 0.35% free_mem: 1049633300 kB ------
- Run the following command on the server side:
Use cases of nccl-tests
To test and verify the performance of an instance that uses RDMA networks in your application, the following section provides an example on how to use RDMA to accelerate your application. In the example, nccl-tests is used. For more information about nccl-tests, visit nccl-tests.
#!/bin/sh
# Use instances that run Alibaba Cloud Linux 2 operating systems.
# Install openmpi and a compiler.
wget https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/openmpi-4.0.3-1.x86_64.rpm
rpm -ivh --force openmpi-4.0.3-1.x86_64.rpm --nodeps
yum install -y gcc-c++
# Modify ~/.bashrc.
export PATH=/usr/local/cuda-11.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/lib:/usr/local/lib/openmpi:/usr/local/cuda-11.0/lib64:$LD_LIBRARY_PATH
# Download and compile the test code.
git clone https://github.com/NVIDIA/nccl-tests
cd nccl-tests/
make MPI=1 CUDA_HOME=/usr/local/cuda
# Replace host1 and host2 with the IP addresses of the instances.
mpirun --allow-run-as-root -np 16 -npernode 8 -H {host1}:{host2} \
--bind-to none \
-mca btl_tcp_if_include bond0 \
-x PATH \
-x CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
-x NCCL_SOCKET_IFNAME=bond0 \
-x NCCL_IB_HCA=mlx5 \
-x NCCL_IB_DISABLE=0 \
-x NCCL_DEBUG=INFO \
-x NCCL_NSOCKS_PERTHREAD=8 \
-x NCCL_SOCKET_NTHREADS=8 \
-x NCCL_IB_GID_INDEX=3 \
-x NCCL_DEBUG_SUBSYS=NET,GRAPH \
-x NCCL_IB_QPS_PER_CONNECTION=4 \
./build/all_reduce_perf -b 4M -e 4M -f 2 -g 1 -t 1 -n 20
A command output similar to the following one is returned:
# Instance output
# nThread 1 nGpus 1 minBytes 4194304 maxBytes 4194304 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 57655 on iZ2ze58t*****3vnehjdZ device 0 [0x54] NVIDIA A100-SXM-80GB
# Rank 1 Pid 57656 on iZ2ze58t*****3vnehjdZ device 1 [0x5a] NVIDIA A100-SXM-80GB
# Rank 2 Pid 57657 on iZ2ze58t*****3vnehjdZ device 2 [0x6b] NVIDIA A100-SXM-80GB
# Rank 3 Pid 57658 on iZ2ze58t*****3vnehjdZ device 3 [0x70] NVIDIA A100-SXM-80GB
# Rank 4 Pid 57659 on iZ2ze58t*****3vnehjdZ device 4 [0xbe] NVIDIA A100-SXM-80GB
# Rank 5 Pid 57660 on iZ2ze58t*****3vnehjdZ device 5 [0xc3] NVIDIA A100-SXM-80GB
# Rank 6 Pid 57661 on iZ2ze58t*****3vnehjdZ device 6 [0xda] NVIDIA A100-SXM-80GB
# Rank 7 Pid 57662 on iZ2ze58t*****3vnehjdZ device 7 [0xe0] NVIDIA A100-SXM-80GB
# Rank 8 Pid 58927 on iZ2ze58t*****3vnehjeZ device 0 [0x54] NVIDIA A100-SXM-80GB
# Rank 9 Pid 58928 on iZ2ze58t*****3vnehjeZ device 1 [0x5a] NVIDIA A100-SXM-80GB
# Rank 10 Pid 58929 on iZ2ze58t*****3vnehjeZ device 2 [0x6b] NVIDIA A100-SXM-80GB
# Rank 11 Pid 58930 on iZ2ze58t*****3vnehjeZ device 3 [0x70] NVIDIA A100-SXM-80GB
# Rank 12 Pid 58931 on iZ2ze58t*****3vnehjeZ device 4 [0xbe] NVIDIA A100-SXM-80GB
# Rank 13 Pid 58932 on iZ2ze58t*****3vnehjeZ device 5 [0xc3] NVIDIA A100-SXM-80GB
# Rank 14 Pid 58933 on iZ2ze58t*****3vnehjeZ device 6 [0xda] NVIDIA A100-SXM-80GB
# Rank 15 Pid 58934 on iZ2ze58t*****3vnehjeZ device 7 [0xe0] NVIDIA A100-SXM-80GB
iZ2ze6t9*****ssopZ:57655:57655 [0] NCCL INFO NCCL_SOCKET_IFNAME set to bond0
...
iZ2ze58t*****3vnehjeZ:58929:59248 [2] NCCL INFO NET/IB: Dev 1 Port 1 qpn 4573 mtu 3 GID 3 (0/22D00C8FFFF0000)
iZ2ze58t*****3vnehjdZ:57657:58004 [2] NCCL INFO NET/IB: Dev 1 Port 1 qpn 4573 mtu 3 GID 3 (0/22E00C8FFFF0000)
iZ2ze58t*****3vnehjeZ:58927:59225 [0] NCCL INFO Channel 04 : 0[54000] -> 8[54000] [receive] via NET/IB/0/GDRDMA
iZ2ze58t*****3vnehjeZ:58927:59225 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 54000 / HCA 0 (distance 4 <= 4), read 1
iZ2ze58t*****3vnehjeZ:58931:59227 [4] NCCL INFO NET/IB: Dev 2 Port 1 qpn 4573 mtu 3 GID 3 (0/62D00C8FFFF0000)
iZ2ze58t*****3vnehjdZ:57659:58012 [4] NCCL INFO NET/IB: Dev 2 Port 1 qpn 4573 mtu 3 GID 3 (0/62E00C8FFFF0000)
iZ2ze58t*****3vnehjeZ:58933:59183 [6] NCCL INFO NET/IB: Dev 3 Port 1 qpn 4573 mtu 3 GID 3 (0/A2D00C8FFFF0000)
iZ2ze58t*****3vnehjeZ:58927:59225 [0] NCCL INFO Channel 00 : 8[54000] -> 0[54000] [send] via NET/IB/0/GDRDMA
iZ2ze58t*****3vnehjeZ:58927:59225 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 54000 / HCA 0 (distance 4 <= 4), read 1
iZ2ze58t*****3vnehjdZ:57661:58000 [6] NCCL INFO NET/IB: Dev 3 Port 1 qpn 4573 mtu 3 GID 3 (0/A2E00C8FFFF0000)
iZ2ze58t*****3vnehjeZ:58927:59225 [0] NCCL INFO Channel 04 : 8[54000] -> 0[54000] [send] via NET/IB/0/GDRDMA
iZ2ze58t*****3vnehjdZ:57655:57848 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 4660 mtu 3 GID 3 (0/E2E00C8FFFF0000)
iZ2ze58t*****3vnehjeZ:58927:59225 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 4660 mtu 3 GID 3 (0/E2D00C8FFFF0000)
iZ2ze58t*****3vnehjeZ:58927:59225 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 4661 mtu 3 GID 3 (0/E2D00C8FFFF0000)
iZ2ze58t*****3vnehjdZ:57655:57848 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 4661 mtu 3 GID 3 (0/E2E00C8FFFF0000)
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
4194304 1048576 float sum 241.5 17.37 32.56 4e-07 235.2 17.84 33.44 4e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth : 33.002
#
FAQ
- Problem 1
When you run the
rdma_qos_check -V
verification command, the drv_fw_ver eth1: 5.2-2.2.3/22.29.1016 fail error message is returned.Solution:
The error message indicates that the Mellanox NIC firmware is not updated. You can perform the following operations:- If the instance runs an Alibaba Cloud Linux 2 or CentOS 8.3 operating system, run
the
/usr/share/nic-drivers-mellanox-rdma/sources/alifwmanager-22292302 --force --yes
command to update the NIC firmware of the instance. - If the instance runs a Debian-base operating system, download the firmware update
program (download URL ) and then run the
./alifwmanager-22292302 --force --yes
command to update the NIC firmware of the instance.
- If the instance runs an Alibaba Cloud Linux 2 or CentOS 8.3 operating system, run
the
- Problem 2
When you run the
rdma_qos_check -V
verification command, the * roce_ver : 0 fail error message is returned.Solution:
The error message indicates that kernel modules such as configfs and rdma_cm are missing. You can run the
modprobe mlx5_ib && modprobe configfs && modprobe rdma_cm
command to load the required kernel modules. - Problem 3
When you run the
systemctl start networking
command on an instance that runs a Debian operating system to start the network service, the system prompts that the bond interfaces cannot be found.Solution:
The error may occur because the the mlx5_ib kernel module is not loaded. You can run the
modprobe mlx5_ib
command to load this kernel module. - Problem 4
When you run the
rdma_qos_check -V
verification command or theib_read_bw
bandwidth verification command, the ERROR: RoCE tos isn't correct on mlx5_bond_3 error message is returned.Solution:
You can run the
rdma_qos_init
command to initialize the network. - Problem 5
After you restart an instance that runs an Alibaba Cloud Linux 2 operating system, the cm_tos mlx5_bond_1: 0 fail error message is returned when you run the
rdma_qos_check -V
verification command.Solution:
You can run the
rdma_qos_init
command to initialize the network. - Problem 6
After you restart an instance that runs a CentOS 8.3 operating system, the trust_mod eth1: pcp fail error message is returned when you run the
rdma_qos_check -V
verification command.Solution:
You can run the
rdma_qos_init
command to initialize the network. - Problem 7
The IP address of the RDMA network interface bond* cannot be obtained.
Solution:
You can run theifdown bond*
andifup bond*
commands to obtain IP address of the bond interface.Note Replace*
with the serial number of the corresponding network interface.