Configure eRDMA on GPU instances with Docker - Elastic GPU Service

Integrating eRDMA into a container (Docker) environment allows applications to bypass the operating system kernel and directly access the host's physical eRDMA devices. This accelerates data transfer and communication, making it ideal for containerized applications that require large-scale data transfer and high-performance networking. This topic describes how to use an eRDMA container image to quickly configure eRDMA on a GPU-accelerated instance.

Note

If your services require large-scale RDMA networking capabilities, you can attach an elastic RDMA interface (ERI) to a supported GPU-accelerated instance type. For more information, see eRDMA overview.

Before you begin

Before you configure the eRDMA container image on a GPU-accelerated instance, obtain its details. You need to know the supported GPU-accelerated instance types before creating an instance, and the image address before pulling the image.

Log on to the Container Registry console.
In the navigation pane on the left, click Artifact Center.

In the Repository Name search box, enter erdma and select the target image egs/erdma.

The eRDMA container image is updated approximately every three months. The following table provides details about the image.

Image name	Version information	Image address	Supported instances	Benefits
eRDMA	Python: 3.10.12 CUDA: 12.4.1 cuDNN: 9.1.0.70 NCCL: 2.21.5 Base image: Ubuntu 22.04	egs-registry.cn-hangzhou.cr.aliyuncs.com/egs/erdma:cuda12.4.1-cudnn9-ubuntu22.04	The eRDMA container image supports all 8th-generation GPU-accelerated instances, such as ebmgn8is and gn8is. Note For more information about instances, see GPU compute-optimized instance families.	Access the Alibaba Cloud eRDMA network directly from a container. Get an out-of-the-box experience with compatible eRDMA, drivers, and CUDA.
eRDMA	Python: 3.10.12 CUDA: 12.1.1 cuDNN: 8.9.0.131 NCCL: 2.17.1 Base image: Ubuntu 22.04	egs-registry.cn-hangzhou.cr.aliyuncs.com/egs/erdma:cuda12.1.1-cudnn8-ubuntu22.04

Procedure

After you install Docker on a GPU-accelerated instance and enable eRDMA in the Docker environment, you can directly access eRDMA devices from within a container. This procedure uses Ubuntu 20.04 as an example.

Create a GPU-accelerated instance and configure eRDMA.
For more information, see Enable eRDMA on a GPU-accelerated instance.
We recommend that you create a GPU-accelerated instance with an eRDMA network card on the ECS console, and select the Install GPU Driver and Install eRDMA Software Stack options.
Note
After the GPU-accelerated instance is created, the system automatically installs the Tesla driver, CUDA, the cuDNN library, and the eRDMA software stack. This is faster than manual installation.
In the Image section, select Ubuntu 20.04 64-bit. The GPU driver installation takes approximately 10 to 20 minutes. The installation extends the instance startup time and includes an automatic restart.
Connect to the GPU-accelerated instance.
For detailed instructions, see Connect to a Linux instance by using Workbench.

Run the following commands to install Docker on the Ubuntu GPU-accelerated instance.

sudo apt-get update
sudo apt-get -y install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL http://mirrors.cloud.aliyuncs.com/docker-ce/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] http://mirrors.cloud.aliyuncs.com/docker-ce/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io

Run the following command to verify the Docker installation.
```
docker -v
```

Run the following commands to install the NVIDIA Container Toolkit package.

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

Run the following commands to enable Docker to start on boot and then restart the Docker service.
```
sudo systemctl enable docker
sudo systemctl restart docker
```

Run the following command to pull the eRDMA container image.

sudo docker pull egs-registry.cn-hangzhou.cr.aliyuncs.com/egs/erdma:cuda12.1.1-cudnn8-ubuntu22.04

Run the following command to run the eRDMA container.

 sudo docker run -d -t --network=host --gpus all \
  --privileged \
  --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
  --name erdma \
  -v /root:/root \
  egs-registry.cn-hangzhou.cr.aliyuncs.com/egs/erdma:cuda12.1.1-cudnn8-ubuntu22.04

Verify the configuration

This example uses two GPU-accelerated instances, such as host1 and host2. A Docker environment is installed and an eRDMA container is successfully running on each instance.

In the containers on host1 and host2, respectively, check whether the eRDMA network adapter is functioning properly.

Run the following command to enter the container environment.
```
sudo docker exec -it erdma bash
```

Run the following command to check the eRDMA network devices inside the container.

ibv_devinfo

The output shows the state of both eRDMA network devices as PORT_ACTIVE, indicating they are working correctly.

root@xxx:~/# ibv_devinfo
hca_id: erdma_0
	transport:			eRDMA (0)
	fw_ver:				0.2.0
	node_guid:			0216:3eff:fe2c:6aad
	sys_image_guid:			0216:3eff:fe2c:6aad
	vendor_id:			0x1ded
	vendor_part_id:			4223
	hw_ver:				0x0
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		1024 (3)
			active_mtu:		1024 (3)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet
hca_id: erdma_1
	transport:			eRDMA (0)
	fw_ver:				0.2.0
	node_guid:			0216:3eff:fe16:58b6
	sys_image_guid:			0216:3eff:fe16:58b6
	vendor_id:			0x1ded
	vendor_part_id:			4223
	hw_ver:				0x0
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		1024 (3)
			active_mtu:		1024 (3)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet

In the container, run the nccl-test on host1 and host2.

Run the following command to download the nccl-tests code.
```
git clone https://github.com/NVIDIA/nccl-tests.git
```

Run the following commands to compile nccl-tests.

apt update
apt install openmpi-bin libopenmpi-dev -y
cd nccl-tests && make MPI=1 CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr/local/cuda MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi

Establish a password-free connection between host1 and host2, and configure SSH to connect through port 12345.
After you configure the SSH connection, you can test the password-free connection between the two containers by running the ssh -p 12345 ${host2} command from inside the container.
1. In the container on host1, run the following command to generate an SSH key and copy the public key to the container on host2.
```
ssh-keygen
ssh-copy-id -i ~/.ssh/id_rsa.pub ${host2}
```
2. In the container on host2, run the following command to install the SSH service and set the listening port for the SSH server to 12345.
```
apt-get update && apt-get install ssh -y
mkdir /run/sshd
/usr/sbin/sshd -p 12345 
```
3. In the container on host1, run the following command to test the password-free connection to the container on host2.
```
ssh root@${host2} -p 12345
```

In the container on host1, run the all_reduce_perf test.

mpirun --allow-run-as-root -np 16 -npernode 8 -H 172.16.15.237:8,172.16.15.235:8 \
 --bind-to none -mca btl_tcp_if_include eth0 \
 -x NCCL_SOCKET_IFNAME=eth0 \
 -x NCCL_IB_DISABLE=0 \
 -x NCCL_IB_GID_INDEX=1 \
 -x NCCL_NET_GDR_LEVEL=5 \
 -x NCCL_DEBUG=INFO \
 -x NCCL_ALGO=Ring -x NCCL_P2P_LEVEL=3 \
 -x LD_LIBRARY_PATH -x PATH \
 -mca plm_rsh_args "-p 12345" \
 /workspace/nccl-tests/build/all_reduce_perf -b 1G -e 1G -f 2 -g 1 -n 20

The output is similar to the following:

iZ2zei2cgn3427b89ubkfvZ:2732:2765 [7] NCCL INFO comm 0x562feba84040 rank 7 nranks 16 cudaDev 7 busId f1000 commId 0x89cad9815cd2a14b - Init COMPLETE
iZ2zei2cgn3427b89ubkfvZ:2728:2770 [5] NCCL INFO comm 0x55d55c96d340 rank 5 nranks 16 cudaDev 5 busId ea000 commId 0x89cad9815cd2a14b - Init COMPLETE
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
  1073741824     268435456     float     sum      -1   158061    6.79   12.74      0   156821    6.85   12.84      0
iZ2zeiwklcnbixy8g0r8grZ:4068:4068 [2] NCCL INFO comm 0x563b4653f3c0 rank 10 nranks 16 cudaDev 2 busId 71000 - Destroy COMPLETE
iZ2zei2cgn3427b89ubkfvZ:2724:2724 [2] NCCL INFO comm 0x55c35266a000 rank 2 nranks 16 cudaDev 2 busId 71000 - Destroy COMPLETE
iZ2zeiwklcnbixy8g0r8grZ:4071:4071 [5] NCCL INFO comm 0x563a86fc7210 rank 13 nranks 16 cudaDev 5 busId ea000 - Destroy COMPLETE
iZ2zeiwklcnbixy8g0r8grZ:4067:4067 [1] NCCL INFO comm 0x559741cc9290 rank 9 nranks 16 cudaDev 1 busId 6a000 - Destroy COMPLETE
iZ2zeiwklcnbixy8g0r8grZ:4075:4075 [7] NCCL INFO comm 0x55d60e86e170 rank 15 nranks 16 cudaDev 7 busId f1000 - Destroy COMPLETE
iZ2zei2cgn3427b89ubkfvZ:2723:2723 [1] NCCL INFO comm 0x5596ca34a0b0 rank 1 nranks 16 cudaDev 1 busId 6a000 - Destroy COMPLETE
iZ2zei2cgn3427b89ubkfvZ:2728:2728 [5] NCCL INFO comm 0x55d55c96d340 rank 5 nranks 16 cudaDev 5 busId ea000 - Destroy COMPLETE
iZ2zeiwklcnbixy8g0r8grZ:4072:4072 [6] NCCL INFO comm 0x5609dddc4170 rank 14 nranks 16 cudaDev 6 busId f0000 - Destroy COMPLETE
iZ2zei2cgn3427b89ubkfvZ:2725:2725 [3] NCCL INFO comm 0x564411727220 rank 3 nranks 16 cudaDev 3 busId 72000 - Destroy COMPLETE
iZ2zei2cgn3427b89ubkfvZ:2726:2726 [4] NCCL INFO comm 0x557b9ed258a0 rank 4 nranks 16 cudaDev 4 busId e9000 - Destroy COMPLETE
iZ2zeiwklcnbixy8g0r8grZ:4069:4069 [3] NCCL INFO comm 0x55b879b75000 rank 11 nranks 16 cudaDev 3 busId 72000 - Destroy COMPLETE
iZ2zeiwklcnbixy8g0r8grZ:4070:4070 [4] NCCL INFO comm 0x557580c1a5f0 rank 12 nranks 16 cudaDev 4 busId e9000 - Destroy COMPLETE
iZ2zei2cgn3427b89ubkfvZ:2730:2730 [6] NCCL INFO comm 0x562b87361c0 rank 6 nranks 16 cudaDev 6 busId f0000 - Destroy COMPLETE
iZ2zeiwklcnbixy8g0r8grZ:4066:4066 [0] NCCL INFO comm 0x558bf49799a0 rank 8 nranks 16 cudaDev 0 busId 69000 - Destroy COMPLETE
iZ2zei2cgn3427b89ubkfvZ:2722:2722 [0] NCCL INFO comm 0x5589a9d7a980 rank 0 nranks 16 cudaDev 0 busId 69000 - Destroy COMPLETE
iZ2zei2cgn3427b89ubkfvZ:2732:2732 [7] NCCL INFO comm 0x562feba84040 rank 7 nranks 16 cudaDev 7 busId f1000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 12.7876
#

On the host (outside the container), run the following command to monitor traffic on the eRDMA network.

eadm stat -d erdma_0 -l

The output indicates traffic on the eRDMA network.

root@xxxZ:~# eadm stat -d erdma_0 -l
Monitoring erdma_0...    (press CTRL-C to stop)
 14:53:00  rx:      5.50 GiB/s 4537261 p/s      tx:      5.49 GiB/s 4538210 p/s

Elastic GPU Service:Improve network performance with an eRDMA image

Before you begin

Procedure

Verify the configuration

Related topics