All Products
Search
Document Center

Platform For AI:RDMA: High-performance networks for distributed training

Last Updated:Apr 10, 2025

In parallel computing for foundation models, computing performance can be optimized by reducing communication traffic volume, overlapping computation with communication, and improving communication efficiency. This topic describes how to configure high-performance networks.

Limits

This topic applies only to training jobs that use Lingjun resources.

Configure high-performance network variables

Lingjun resources in Platform for AI (PAI) use Remote Direct Memory Access (RDMA) networks and adopt the optimal settings for NVIDIA Collective Communications Library (NCCL) environment variables. We recommend that you use the default variables in PAI to achieve optimal performance. You can also configure variables based on your training framework, communication framework, and model features.

Default variables in PAI

The following table describes the settings of the default NCCL variables in PAI based on different Lingjun specifications.

Lingjun specification

NCCL environment variable

  • ml.gu7xf.c96m1600.8-gu108

  • ml.gu7xf.8xlarge-gu108

  • ml.gu7ef.c96m1600.8-gu100

  • ml.gu8xf.8xlarge-gu108

export NCCL_IB_TC=136
export NCCL_IB_SL=5
export NCCL_IB_GID_INDEX=3
export NCCL_SOCKET_IFNAME=eth
export NCCL_DEBUG=INFO
export NCCL_IB_HCA=mlx5
export NCCL_IB_TIMEOUT=22
export NCCL_IB_QPS_PER_CONNECTION=8
export NCCL_NET_PLUGIN=none

For more information, see the "NCCL environment variables" section of this topic.

NCCL environment variables

The following table describes the key NCCL environment variables. For information about other environment variables, see NCCL documentation.

Key NCCL environment variable

Description

NCCL_IB_TC

The traffic classification rules that match the network mapping rules used by Alibaba Cloud. If you do not configure this variable or if you specify an invalid value, network performance may be adversely affected.

NCCL_IB_GID_INDEX

The optimal global ID index. If you do not configure this variable or if you specify an invalid value, NCCL encounters an error.

NCCL_SOCKET_IFNAME

The network interface that NCCL uses to establish a connection. The recommended value of this variable varies based on the Lingjun specification. If you do not configure this variable or if you specify an invalid value, NCCL may fail to establish a connection.

NCCL_DEBUG

The level of the NCCL debug information. We recommend that you set this variable to INFO to obtain more NCCL-related logs. This helps troubleshoot performance issues.

NCCL_IB_HCA

The InfiniBand devices that can be used for RDMA communication. The number and naming rules of InfiniBand devices vary based on the computing node. If you do not specify this variable or specify an invalid value, the network performance may be adversely affected.

NCCL_IB_TIMEOUT

The timeout duration for establishing RDMA connections. You can increase the value of this variable to improve the fault tolerance for training jobs. If you do not specify this variable or specify an invalid value, training jobs may be interrupted.

NCCL_IB_QPS_PER_CONNECTION

The number of queue pairs on each connection. You can increase the value of this variable to effectively improve network throughput.

Configure an image

You can use official images provided by DLC to submit training jobs that use Lingjun resources. You can also use custom images to submit training jobs.

Official image

image.png

Custom image

You can create and use a custom image. Take note of the following items:

Environment requirements

  • Compute Unified Device Architecture (CUDA) 11.2 or later is used.

  • NCCL 2.12.10 or later is used.

  • Python 3 is used.

Install the RDMA library

To use a custom image, you must manually install the RDMA library in the Docker file of the custom image. Sample code:

RUN apt-get update && \
    apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends libnl-3-dev libnl-route-3-dev libnl-3-200 libnl-route-3-200 iproute2 udev dmidecode ethtool && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

RUN cd /tmp/ && \
    wget http://pythonrun.oss-cn-zhangjiakou.aliyuncs.com/rdma/nic-libs-mellanox-rdma-5.2-2/nic-lib-rdma-core-installer-ubuntu.tar.gz && \
    tar xzvf nic-lib-rdma-core-installer-ubuntu.tar.gz && \
    cd nic-lib-rdma-core-installer-ubuntu && \
    echo Y | /bin/bash install.sh && \
    cd .. && \
    rm -rf nic-lib-rdma-core-installer-ubuntu && \
    rm -f nic-lib-rdma-core-installer-ubuntu.tar.gz

References

For information about how to submit a training job that uses Lingjun resources, see Submit training jobs.