All Products
Search
Document Center

Elastic GPU Service:Improve network performance with an eRDMA image

Last Updated:Jun 25, 2026

Integrating eRDMA into a container (Docker) environment allows applications to bypass the operating system kernel and directly access the host's physical eRDMA devices. This accelerates data transfer and communication, making it ideal for containerized applications that require large-scale data transfer and high-performance networking. This topic describes how to use an eRDMA container image to quickly configure eRDMA on a GPU-accelerated instance.

Note

If your services require large-scale RDMA networking capabilities, you can attach an elastic RDMA interface (ERI) to a supported GPU-accelerated instance type. For more information, see eRDMA overview.

Before you begin

Before you configure the eRDMA container image on a GPU-accelerated instance, obtain its details. You need to know the supported GPU-accelerated instance types before creating an instance, and the image address before pulling the image.

  1. Log on to the Container Registry console.

  2. In the navigation pane on the left, click Artifact Center.

  3. In the Repository Name search box, enter erdma and select the target image egs/erdma.

    The eRDMA container image is updated approximately every three months. The following table provides details about the image.

    Image name

    Version information

    Image address

    Supported instances

    Benefits

    eRDMA

    • Python: 3.10.12

    • CUDA: 12.4.1

    • cuDNN: 9.1.0.70

    • NCCL: 2.21.5

    • Base image: Ubuntu 22.04

    egs-registry.cn-hangzhou.cr.aliyuncs.com/egs/erdma:cuda12.4.1-cudnn9-ubuntu22.04

    The eRDMA container image supports all 8th-generation GPU-accelerated instances, such as ebmgn8is and gn8is.

    Note

    For more information about instances, see GPU compute-optimized instance families.

    • Access the Alibaba Cloud eRDMA network directly from a container.

    • Get an out-of-the-box experience with compatible eRDMA, drivers, and CUDA.

    eRDMA

    • Python: 3.10.12

    • CUDA: 12.1.1

    • cuDNN: 8.9.0.131

    • NCCL: 2.17.1

    • Base image: Ubuntu 22.04

    egs-registry.cn-hangzhou.cr.aliyuncs.com/egs/erdma:cuda12.1.1-cudnn8-ubuntu22.04

Procedure

After you install Docker on a GPU-accelerated instance and enable eRDMA in the Docker environment, you can directly access eRDMA devices from within a container. This procedure uses Ubuntu 20.04 as an example.

  1. Create a GPU-accelerated instance and configure eRDMA.

    For more information, see Enable eRDMA on a GPU-accelerated instance.

    We recommend that you create a GPU-accelerated instance with an eRDMA network card on the ECS console, and select the Install GPU Driver and Install eRDMA Software Stack options.

    Note

    After the GPU-accelerated instance is created, the system automatically installs the Tesla driver, CUDA, the cuDNN library, and the eRDMA software stack. This is faster than manual installation.

    In the Image section, select Ubuntu 20.04 64-bit. The GPU driver installation takes approximately 10 to 20 minutes. The installation extends the instance startup time and includes an automatic restart.

  2. Connect to the GPU-accelerated instance.

    For detailed instructions, see Connect to a Linux instance by using Workbench.

  3. Run the following commands to install Docker on the Ubuntu GPU-accelerated instance.

    sudo apt-get update
    sudo apt-get -y install ca-certificates curl
    sudo install -m 0755 -d /etc/apt/keyrings
    sudo curl -fsSL http://mirrors.cloud.aliyuncs.com/docker-ce/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
    sudo chmod a+r /etc/apt/keyrings/docker.asc
    echo \
      "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] http://mirrors.cloud.aliyuncs.com/docker-ce/linux/ubuntu \
      $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
      sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
    sudo apt-get update
    sudo apt-get install -y docker-ce docker-ce-cli containerd.io
  4. Run the following command to verify the Docker installation.

    docker -v
  5. Run the following commands to install the NVIDIA Container Toolkit package.

    curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
        sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
        sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
    sudo apt-get update
    sudo apt-get install -y nvidia-container-toolkit
  6. Run the following commands to enable Docker to start on boot and then restart the Docker service.

    sudo systemctl enable docker
    sudo systemctl restart docker
  7. Run the following command to pull the eRDMA container image.

    sudo docker pull egs-registry.cn-hangzhou.cr.aliyuncs.com/egs/erdma:cuda12.1.1-cudnn8-ubuntu22.04
  8. Run the following command to run the eRDMA container.

     sudo docker run -d -t --network=host --gpus all \
      --privileged \
      --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
      --name erdma \
      -v /root:/root \
      egs-registry.cn-hangzhou.cr.aliyuncs.com/egs/erdma:cuda12.1.1-cudnn8-ubuntu22.04

Verify the configuration

This example uses two GPU-accelerated instances, such as host1 and host2. A Docker environment is installed and an eRDMA container is successfully running on each instance.

  1. In the containers on host1 and host2, respectively, check whether the eRDMA network adapter is functioning properly.

    1. Run the following command to enter the container environment.

      sudo docker exec -it erdma bash
    2. Run the following command to check the eRDMA network devices inside the container.

      ibv_devinfo

      The output shows the state of both eRDMA network devices as PORT_ACTIVE, indicating they are working correctly.

      root@xxx:~/# ibv_devinfo
      hca_id: erdma_0
      	transport:			eRDMA (0)
      	fw_ver:				0.2.0
      	node_guid:			0216:3eff:fe2c:6aad
      	sys_image_guid:			0216:3eff:fe2c:6aad
      	vendor_id:			0x1ded
      	vendor_part_id:			4223
      	hw_ver:				0x0
      	phys_port_cnt:			1
      		port:	1
      			state:			PORT_ACTIVE (4)
      			max_mtu:		1024 (3)
      			active_mtu:		1024 (3)
      			sm_lid:			0
      			port_lid:		0
      			port_lmc:		0x00
      			link_layer:		Ethernet
      hca_id: erdma_1
      	transport:			eRDMA (0)
      	fw_ver:				0.2.0
      	node_guid:			0216:3eff:fe16:58b6
      	sys_image_guid:			0216:3eff:fe16:58b6
      	vendor_id:			0x1ded
      	vendor_part_id:			4223
      	hw_ver:				0x0
      	phys_port_cnt:			1
      		port:	1
      			state:			PORT_ACTIVE (4)
      			max_mtu:		1024 (3)
      			active_mtu:		1024 (3)
      			sm_lid:			0
      			port_lid:		0
      			port_lmc:		0x00
      			link_layer:		Ethernet
  2. In the container, run the nccl-test on host1 and host2.

    1. Run the following command to download the nccl-tests code.

      git clone https://github.com/NVIDIA/nccl-tests.git
    2. Run the following commands to compile nccl-tests.

      apt update
      apt install openmpi-bin libopenmpi-dev -y
      cd nccl-tests && make MPI=1 CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr/local/cuda MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi
    3. Establish a password-free connection between host1 and host2, and configure SSH to connect through port 12345.

      After you configure the SSH connection, you can test the password-free connection between the two containers by running the ssh -p 12345 ${host2} command from inside the container.

      1. In the container on host1, run the following command to generate an SSH key and copy the public key to the container on host2.

        ssh-keygen
        ssh-copy-id -i ~/.ssh/id_rsa.pub ${host2}
      2. In the container on host2, run the following command to install the SSH service and set the listening port for the SSH server to 12345.

        apt-get update && apt-get install ssh -y
        mkdir /run/sshd
        /usr/sbin/sshd -p 12345 
      3. In the container on host1, run the following command to test the password-free connection to the container on host2.

        ssh root@${host2} -p 12345
    4. In the container on host1, run the all_reduce_perf test.

      mpirun --allow-run-as-root -np 16 -npernode 8 -H 172.16.15.237:8,172.16.15.235:8 \
       --bind-to none -mca btl_tcp_if_include eth0 \
       -x NCCL_SOCKET_IFNAME=eth0 \
       -x NCCL_IB_DISABLE=0 \
       -x NCCL_IB_GID_INDEX=1 \
       -x NCCL_NET_GDR_LEVEL=5 \
       -x NCCL_DEBUG=INFO \
       -x NCCL_ALGO=Ring -x NCCL_P2P_LEVEL=3 \
       -x LD_LIBRARY_PATH -x PATH \
       -mca plm_rsh_args "-p 12345" \
       /workspace/nccl-tests/build/all_reduce_perf -b 1G -e 1G -f 2 -g 1 -n 20

      The output is similar to the following:

      iZ2zei2cgn3427b89ubkfvZ:2732:2765 [7] NCCL INFO comm 0x562feba84040 rank 7 nranks 16 cudaDev 7 busId f1000 commId 0x89cad9815cd2a14b - Init COMPLETE
      iZ2zei2cgn3427b89ubkfvZ:2728:2770 [5] NCCL INFO comm 0x55d55c96d340 rank 5 nranks 16 cudaDev 5 busId ea000 commId 0x89cad9815cd2a14b - Init COMPLETE
      #
      #                                                              out-of-place                       in-place
      #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
      #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
        1073741824     268435456     float     sum      -1   158061    6.79   12.74      0   156821    6.85   12.84      0
      iZ2zeiwklcnbixy8g0r8grZ:4068:4068 [2] NCCL INFO comm 0x563b4653f3c0 rank 10 nranks 16 cudaDev 2 busId 71000 - Destroy COMPLETE
      iZ2zei2cgn3427b89ubkfvZ:2724:2724 [2] NCCL INFO comm 0x55c35266a000 rank 2 nranks 16 cudaDev 2 busId 71000 - Destroy COMPLETE
      iZ2zeiwklcnbixy8g0r8grZ:4071:4071 [5] NCCL INFO comm 0x563a86fc7210 rank 13 nranks 16 cudaDev 5 busId ea000 - Destroy COMPLETE
      iZ2zeiwklcnbixy8g0r8grZ:4067:4067 [1] NCCL INFO comm 0x559741cc9290 rank 9 nranks 16 cudaDev 1 busId 6a000 - Destroy COMPLETE
      iZ2zeiwklcnbixy8g0r8grZ:4075:4075 [7] NCCL INFO comm 0x55d60e86e170 rank 15 nranks 16 cudaDev 7 busId f1000 - Destroy COMPLETE
      iZ2zei2cgn3427b89ubkfvZ:2723:2723 [1] NCCL INFO comm 0x5596ca34a0b0 rank 1 nranks 16 cudaDev 1 busId 6a000 - Destroy COMPLETE
      iZ2zei2cgn3427b89ubkfvZ:2728:2728 [5] NCCL INFO comm 0x55d55c96d340 rank 5 nranks 16 cudaDev 5 busId ea000 - Destroy COMPLETE
      iZ2zeiwklcnbixy8g0r8grZ:4072:4072 [6] NCCL INFO comm 0x5609dddc4170 rank 14 nranks 16 cudaDev 6 busId f0000 - Destroy COMPLETE
      iZ2zei2cgn3427b89ubkfvZ:2725:2725 [3] NCCL INFO comm 0x564411727220 rank 3 nranks 16 cudaDev 3 busId 72000 - Destroy COMPLETE
      iZ2zei2cgn3427b89ubkfvZ:2726:2726 [4] NCCL INFO comm 0x557b9ed258a0 rank 4 nranks 16 cudaDev 4 busId e9000 - Destroy COMPLETE
      iZ2zeiwklcnbixy8g0r8grZ:4069:4069 [3] NCCL INFO comm 0x55b879b75000 rank 11 nranks 16 cudaDev 3 busId 72000 - Destroy COMPLETE
      iZ2zeiwklcnbixy8g0r8grZ:4070:4070 [4] NCCL INFO comm 0x557580c1a5f0 rank 12 nranks 16 cudaDev 4 busId e9000 - Destroy COMPLETE
      iZ2zei2cgn3427b89ubkfvZ:2730:2730 [6] NCCL INFO comm 0x562b87361c0 rank 6 nranks 16 cudaDev 6 busId f0000 - Destroy COMPLETE
      iZ2zeiwklcnbixy8g0r8grZ:4066:4066 [0] NCCL INFO comm 0x558bf49799a0 rank 8 nranks 16 cudaDev 0 busId 69000 - Destroy COMPLETE
      iZ2zei2cgn3427b89ubkfvZ:2722:2722 [0] NCCL INFO comm 0x5589a9d7a980 rank 0 nranks 16 cudaDev 0 busId 69000 - Destroy COMPLETE
      iZ2zei2cgn3427b89ubkfvZ:2732:2732 [7] NCCL INFO comm 0x562feba84040 rank 7 nranks 16 cudaDev 7 busId f1000 - Destroy COMPLETE
      # Out of bounds values : 0 OK
      # Avg bus bandwidth    : 12.7876
      #
  3. On the host (outside the container), run the following command to monitor traffic on the eRDMA network.

    eadm stat -d erdma_0 -l

    The output indicates traffic on the eRDMA network.

    root@xxxZ:~# eadm stat -d erdma_0 -l
    Monitoring erdma_0...    (press CTRL-C to stop)
     14:53:00  rx:      5.50 GiB/s 4537261 p/s      tx:      5.49 GiB/s 4538210 p/s

Related topics

  • To enable RDMA-accelerated interconnectivity between instances in the same virtual private cloud (VPC), you can configure eRDMA on GPU-accelerated instances. For instructions, see Enable eRDMA on GPU-accelerated instances.

  • If your use case involves large-scale data transfers and high-performance networking, you can manually configure a Docker environment on a GPU-accelerated instance and enable eRDMA to improve efficiency. For instructions, see Enable eRDMA in a Docker container.