All Products
Search
Document Center

Platform For AI:RDMA: High-performance networks for distributed training

Last Updated:Feb 27, 2026

In parallel computing for foundation models, optimize performance by reducing communication traffic, overlapping computation with communication, and improving communication efficiency. This topic describes how to configure high-performance RDMA networks for distributed training.

Limits

This topic applies only to training jobs that use Lingjun resources.

Configure high-performance network variables

Lingjun resources in Platform for AI (PAI) use RDMA networks with optimal NVIDIA Collective Communications Library (NCCL) environment variable settings. We recommend using PAI's default variables for optimal performance. You can also customize variables based on your training framework, communication needs, and model characteristics.

Default variables in PAI

The following table describes default NCCL variable settings in PAI for different Lingjun specifications.

Lingjun specification

NCCL environment variable

  • ml.gu7xf.c96m1600.8-gu108

  • ml.gu7xf.8xlarge-gu108

  • ml.gu7ef.c96m1600.8-gu100

  • ml.gu8xf.8xlarge-gu108

export NCCL_IB_TC=136
export NCCL_IB_SL=5
export NCCL_IB_GID_INDEX=3
export NCCL_SOCKET_IFNAME=eth
export NCCL_DEBUG=INFO
export NCCL_IB_HCA=mlx5
export NCCL_IB_TIMEOUT=22
export NCCL_IB_QPS_PER_CONNECTION=8
export NCCL_NET_PLUGIN=none

For more information, see the "NCCL environment variables" section of this topic.

NCCL environment variables

The following table describes key NCCL environment variables. For additional variables, see NCCL documentation.

Key NCCL environment variable

Description

NCCL_IB_TC

Traffic classification rules matching Alibaba Cloud network mapping. Missing or invalid values may adversely affect network performance.

NCCL_IB_GID_INDEX

Optimal global ID index. Missing or invalid values cause NCCL errors.

NCCL_SOCKET_IFNAME

Network interface for NCCL connection establishment. Recommended value varies by Lingjun specification. Missing or invalid values may cause connection failures.

NCCL_DEBUG

NCCL debug information level. Set to INFO for detailed NCCL logs to help troubleshoot performance issues.

NCCL_IB_HCA

InfiniBand devices for RDMA communication. Device count and naming rules vary by compute node. Missing or invalid values may adversely affect network performance.

NCCL_IB_TIMEOUT

Timeout duration for establishing RDMA connections. Increase this value to improve fault tolerance for training jobs. Missing or invalid values may cause training interruptions.

NCCL_IB_QPS_PER_CONNECTION

Number of queue pairs per connection. Increase this value to effectively improve network throughput.

Configure an image

You can use official images provided by DLC or custom images to submit training jobs with Lingjun resources.

Official image

image.png

Custom image

When creating and using a custom image, note the following requirements:

Environment requirements

  • CUDA 11.2 or later.

  • NCCL 2.12.10 or later.

  • Python 3.

Install the RDMA library

To use a custom image, manually install the RDMA library in the Docker file. Sample code:

RUN apt-get update && \
    apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends libnl-3-dev libnl-route-3-dev libnl-3-200 libnl-route-3-200 iproute2 udev dmidecode ethtool && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

RUN cd /tmp/ && \
    wget http://pythonrun.oss-cn-zhangjiakou.aliyuncs.com/rdma/nic-libs-mellanox-rdma-5.2-2/nic-lib-rdma-core-installer-ubuntu.tar.gz && \
    tar xzvf nic-lib-rdma-core-installer-ubuntu.tar.gz && \
    cd nic-lib-rdma-core-installer-ubuntu && \
    echo Y | /bin/bash install.sh && \
    cd .. && \
    rm -rf nic-lib-rdma-core-installer-ubuntu && \
    rm -f nic-lib-rdma-core-installer-ubuntu.tar.gz

References

For instructions on submitting training jobs with Lingjun resources, see Create a training job.