Alibaba Cloud provides the sccgn instance family, a GPU-accelerated compute-optimized Super Computing Cluster (SCC), to improve the network performance of GPU servers on the SHENLONG architecture. This instance family offers superior computing power and network communication. This topic describes how to use sccgn instances and verify their performance.
Usage notes
sccgn instances are equipped with GPUs and high-performance Mellanox network interface controllers (NICs) to provide superior computing power and network communication capabilities. They are suitable for scenarios that require high-intensity computing and dense communication, such as deep learning and high-performance computing (HPC). Note the following when you use sccgn instances:
-
If you only need the remote direct memory access (RDMA) feature, select Install RDMA software stack when you choose an image to create an sccgn instance.
-
If your business requires the GPUDirect RDMA feature, select Install GPU driver when you create an sccgn instance to install the required software stacks and toolkits.
NoteGPUDirect RDMA is a technology introduced by NVIDIA for Kepler-class GPUs and CUDA 5.0. It uses standard features of the PCIe data bus to provide a direct data path between a GPU and third-party peer devices. Examples of peer devices include other GPUs, network interfaces, video capture devices, and storage adapters. For more information, see NVIDIA documentation.
-
If the NIC driver installed on your instance is an open source OpenFabrics Enterprise Distribution (OFED) version (download URL), install the NIC driver before you install the GPU driver and CUDA.
NoteCUDA 11.4, R470, and later versions include the nvidia_peermem module. You do not need to separately install the nv_peer_mem module. For more information, see nv_peer_memory.
Functional and bandwidth verification
Functional verification
This section describes how to verify that the RDMA software stack is installed and configured correctly on an sccgn instance.
Run the following command to check the installation of the RDMA software stack.
If you encounter issues during the check, see the FAQ section.
rdma_qos_check -V
If output similar to the following is returned, the RDMA software stack is installed correctly.
===========================================================
* rdma_qos_check
-----------------------------------------------------------
* ITEM DETAIL RESULT
===========================================================
* link_up eth1: yes ok
* mlnx_device eth1: 1 ok
* drv_ver eth1: 5.2-2.2.3 ok
...
* pci 0000:c5:00.1 ok
* pci 0000:e1:00.0 ok
* pci 0000:e1:00.1 ok
===========================================================
Bandwidth verification
This procedure describes how to check whether the RDMA network bandwidth meets the hardware requirements.
-
Server-side command
ib_read_bw -a -q 20 --report_gbits -d mlx5_bond_0The following is an example of the output:
--------------------------------------------------------------------------------------- RDMA_Read BW Test Dual-port : OFF Device : mlx5_bond_0 Number of qps : 20 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON CQ Moderation : 100 Mtu : 1024[B] Link type : Ethernet GID index : 3 Outstand reads : 16 rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x11ca PSN 0x6302b0 OUT 0x10 RKey 0x17fddc VAddr 0x007f88e1e5d000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:46:14 local address: LID 0000 QPN 0x11cb PSN 0x99aeda OUT 0x10 RKey 0x17fddc VAddr 0x007f88e265d000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:46:14 local address: LID 0000 QPN 0x11cc PSN 0xf0d01c OUT 0x10 RKey 0x17fddc VAddr 0x007f88e2e5d000 ... remote address: LID 0000 QPN 0x11dd PSN 0x8efe92 OUT 0x10 RKey 0x17fddc VAddr 0x007f672004b000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:45:14 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 8388608 20000 165.65 165.63 0.002468 --------------------------------------------------------------------------------------- -
Client-side command
ib_read_bw -a -q 20 --report_gbits -d mlx5_bond_0 {server_ip}The following is an example of the output:
--------------------------------------------------------------------------------------- RDMA_Read BW Test Dual-port : OFF Device : mlx5_bond_0 Number of qps : 20 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 100 Mtu : 1024[B] Link type : Ethernet GID index : 3 Outstand reads : 16 rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x11ca PSN 0x787f05 OUT 0x10 RKey 0x17fddc VAddr 0x007f671684b000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:45:14 local address: LID 0000 QPN 0x11cb PSN 0x467042 OUT 0x10 RKey 0x17fddc VAddr 0x007f671704b000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:45:14 local address: LID 0000 QPN 0x11cc PSN 0xac262e OUT 0x10 RKey 0x17fddc VAddr 0x007f671784b000 ... remote address: LID 0000 QPN 0x11dd PSN 0xeb1c3f OUT 0x10 RKey 0x17fddc VAddr 0x007f88eb65d000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:46:14 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] Conflicting CPU frequency values detected: 800.000000 != 3177.498000. CPU Frequency is not max. 2 20000 0.058511 0.058226 3.639132 Conflicting CPU frequency values detected: 799.996000 != 3384.422000. CPU Frequency is not max. ... Conflicting CPU frequency values detected: 800.000000 != 3166.731000. CPU Frequency is not max. 4194304 20000 165.55 165.55 0.004934 Conflicting CPU frequency values detected: 800.000000 != 2967.226000. CPU Frequency is not max. 8388608 20000 165.65 165.63 0.002468 ---------------------------------------------------------------------------------------
While the commands are running, you can run the rdma_monitor -s -t -G command to monitor the actual bandwidth of each port on the NIC from the ECS console. The following is an example of the output:
------
2022-2-18 09:48:59 CST
tx_rate: 81.874 (40.923/40.951)
rx_rate: 0.092 (0.055/0.037)
tx_pause: 0 (0/0)
rx_pause: 0 (0/0)
tx_pause_duration: 0 (0/0)
rx_pause_duration: 0 (0/0)
np_cnp_sent: 0
rp_cnp_handled: 4632
num_of_qp: 22
np_ecn_marked: 0
rp_cnp_ignored: 0
out_of_buffer: 0
out_of_seq: 0
packet_seq_err: 0
tx_rate_prio0: 0.000 (0.000/0.000)
rx_rate_prio0: 0.000 (0.000/0.000)
tcp_segs_retrans: 0
tcp_retrans_rate: 0
cpu_usage: 0.35%
free_mem: 1049633300 kB
------
nccl-tests use case
To test and verify the performance of instances with RDMA networks in applications, this section provides an nccl-tests use case. This use case demonstrates how to use the RDMA feature of sccgn instances to accelerate your applications. The following example shows how to run nccl-tests:
For more information about nccl-tests, see nccl-tests.
#!/bin/sh
# The operating system used is Alibaba Cloud Linux 2.
# Install Open MPI and a compiler.
yum install -y gcc-c++
wget http://mirrors.cloud.aliyuncs.com/opsx/ecs/linux/binary/rdma/sccgn7ex/Alinux2/openmpi-4.1.3.tar.gz
tar -xzf openmpi-4.1.3.tar.gz
cd openmpi-4.1.3
./configure --prefix=/usr/local/openmpi
make -j && make install
# Modify ~/.bashrc.
export PATH=/usr/local/cuda/bin:/usr/local/openmpi/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH
# Download and compile the test code.
git clone https://github.com/NVIDIA/nccl-tests
cd nccl-tests/
make MPI=1 CUDA_HOME=/usr/local/cuda
# Replace {host1} and {host2} with your IP addresses.
mpirun --allow-run-as-root -np 16 -npernode 8 -H {host1}:{host2} \
--bind-to none \
-mca btl_tcp_if_include bond0 \
-x PATH \
-x CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
-x NCCL_SOCKET_IFNAME=bond0 \
-x NCCL_IB_HCA=mlx5 \
-x NCCL_IB_DISABLE=0 \
-x NCCL_DEBUG=INFO \
-x NCCL_NSOCKS_PERTHREAD=8 \
-x NCCL_SOCKET_NTHREADS=8 \
-x NCCL_IB_GID_INDEX=3 \
-x NCCL_DEBUG_SUBSYS=NET,GRAPH \
-x NCCL_IB_QPS_PER_CONNECTION=4 \
./build/all_reduce_perf -b 4M -e 4M -f 2 -g 1 -t 1 -n 20