All Products
Search
Document Center

Platform For AI:eRDMA: Use high-performance networks for distributed training

Last Updated:Oct 29, 2025

Elastic Remote Direct Memory Access (eRDMA) is a cloud-based, elastic RDMA network from Alibaba Cloud. Specific GPU instance types, available as PAI general computing resources, support eRDMA. To use this feature, submit a DLC job with a specific image for these GPU instance types. The system automatically mounts eRDMA network interface cards in the container, accelerating the distributed training process.

Limits

  • This topic applies only to training jobs that are submitted using subscription-based general computing resources.

  • This feature requires NCCL version 2.19 or later.

  • The following table lists the GPU instance types on the PAI-DLC platform that support eRDMA and the number of eRDMA network interface cards available for each type.

    GPU instance type

    Number of eRDMA network interface cards

    ecs.ebmgn7v.32xlarge

    2

    ecs.ebmgn8v.48xlarge

    2

    ecs.ebmgn8is.32xlarge

    2

    ecs.ebmgn8i.32xlarge

    4

    ecs.gn8is.2xlarge

    1

    ecs.gn8is.4xlarge

    1

    ecs.gn8is-2x.8xlarge

    1

    ecs.gn8is-4x.16xlarge

    1

    ecs.gn8is-4x.16xlarge

    1

Preset environment variables

PAI automatically enables the eRDMA feature and sets the default NCCL environment variables for instance types that support eRDMA. You can adjust these variables based on the training framework, communication framework, and model characteristics. For optimal performance, use the default variables provided by the platform.

Public environment variables

Environment variable

Value

PYTHONUNBUFFERED

1

TZ

Set based on the region where the current job runs. The value is usually "Asia/Shanghai".

eRDMA high-performance network variables

Note

A hyphen (-) indicates that the environment variable is not applicable to the corresponding environment.

Environment variable

Value

NCCL_DEBUG

INFO

NCCL_SOCKET_IFNAME

eth0

NCCL_IB_TC

-

NCCL_IB_SL

-

NCCL_IB_GID_INDEX

1

NCCL_IB_HCA

erdma

NCCL_IB_TIMEOUT

-

NCCL_IB_QPS_PER_CONNECTION

8

NCCL_MIN_NCHANNELS

16

NCCL_NET_PLUGIN

none

Configure a custom image

When you submit training jobs that use general computing resources that support eRDMA, you can build and use a custom image. The custom image must meet the following requirements:

Environment requirements

  • CUDA 12.1 or later

  • NCCL 2.19 or later

  • Python 3

Install the eRDMA library

The installation process for the eRDMA library varies based on the Linux distribution of the image. The following example demonstrates how to install the eRDMA library on Ubuntu 22.04:

# Add the PGP signature.
wget -qO - http://mirrors.cloud.aliyuncs.com/erdma/GPGKEY | sudo gpg --dearmour -o /etc/apt/trusted.gpg.d/erdma.gpg

# Add the apt source.
mkdir -p /etc/apt/sources.list.d
echo "deb [ ] http://mirrors.cloud.aliyuncs.com/erdma/apt/ubuntu jammy/erdma main" | sudo tee /etc/apt/sources.list.d/erdma.list

# Update and install the eRDMA user-mode driver package.
sudo apt update
sudo apt install -y libibverbs1 ibverbs-providers ibverbs-utils librdmacm1

For more information about the installation process on other distributions, see Use eRDMA in Docker containers.

Example Dockerfile

# Replace ${user_docker_image_url} with your existing Docker image.
FROM ${user_docker_image_url}

# If the RDMA library is already installed in the image, uninstall it first.
RUN rm /etc/apt/sources.list.d/mellanox_mlnx_ofed.list && \
    apt remove -y libibverbs1 ibverbs-providers ibverbs-utils librdmacm1

RUN wget -qO - http://mirrors.aliyun.com/erdma/GPGKEY | gpg --dearmour -o /etc/apt/trusted.gpg.d/erdma.gpg && \
    echo "deb [ ] http://mirrors.aliyun.com/erdma/apt/ubuntu jammy/erdma main" | tee /etc/apt/sources.list.d/erdma.list && \
    apt update && apt install -y libibverbs1 ibverbs-providers ibverbs-utils librdmacm1

Use MPIJob to run an NCCL Test

To submit a training job that uses the MPIJob framework, configure the following key parameters. For more information about other parameters, see Submit an MPIJob training job.

Parameter

Description

Environment Information

Node Image

On the Image Address tab, enter the prepared custom image.

You can use the NCCL test image provided by PAI-DLC. This image has the eRDMA dependencies pre-installed: dsw-registry-vpc.<RegionID>.cr.aliyuncs.com/pai/nccl-tests:12.2.2-cudnn8-devel-ubuntu22.04-nccl2.19.3-1-85f9143.

Replace <RegionID> with the region ID. For example, the region ID for China (Ulanqab) is cn-wulanchabu. For more information about region IDs, see Regions and zones.

Startup Command

# The -np 16 and -npernode 8 flags specify that two nodes are used, with eight cards per node, for a total of 16 cards.
mpirun --allow-run-as-root -np 16 -npernode 8 --bind-to none -mca btl_tcp_if_include eth0 -x UCX_TLS=tcp -x UCX_NET_DEVICES=eth0 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_IB_DISABLE=0 -x NCCL_IB_GID_INDEX=1 -x NCCL_IB_QPS_PER_CONNECTION=8 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x NCCL_MIN_NCHANNELS=16 -x NCCL_ALGO=Ring -x PATH /opt/nccl-tests/build/all_reduce_perf -b 32K -e 4G -f 2 -g 1 -t 1 -n 20

Resource Information

Resource Source

Select Resource Quota.

Resource Quota

Select a created general computing resource quota, such as one with the ecs.ebmgn8v.48xlarge instance type. For more information about how to create a resource quota, see General computing resource quotas.

Framework

Select MPIJob.

Job Resource

Configure the following parameters:

  • Number of nodes: 2

  • GPU (cards): 8

  • CPU (cores): 64

  • Memory (GiB): 256

  • Shared memory (GiB): 256

The following figure shows sample results of an NCCL Test for eRDMA network bandwidth.image