In parallel computing for foundation models, optimize performance by reducing communication traffic, overlapping computation with communication, and improving communication efficiency. This topic describes how to configure high-performance RDMA networks for distributed training.
Limits
This topic applies only to training jobs that use Lingjun resources.
Configure high-performance network variables
Lingjun resources in Platform for AI (PAI) use RDMA networks with optimal NVIDIA Collective Communications Library (NCCL) environment variable settings. We recommend using PAI's default variables for optimal performance. You can also customize variables based on your training framework, communication needs, and model characteristics.
Default variables in PAI
The following table describes default NCCL variable settings in PAI for different Lingjun specifications.
|
Lingjun specification |
NCCL environment variable |
|
|
For more information, see the "NCCL environment variables" section of this topic.
NCCL environment variables
The following table describes key NCCL environment variables. For additional variables, see NCCL documentation.
|
Key NCCL environment variable |
Description |
|
NCCL_IB_TC |
Traffic classification rules matching Alibaba Cloud network mapping. Missing or invalid values may adversely affect network performance. |
|
NCCL_IB_GID_INDEX |
Optimal global ID index. Missing or invalid values cause NCCL errors. |
|
NCCL_SOCKET_IFNAME |
Network interface for NCCL connection establishment. Recommended value varies by Lingjun specification. Missing or invalid values may cause connection failures. |
|
NCCL_DEBUG |
NCCL debug information level. Set to INFO for detailed NCCL logs to help troubleshoot performance issues. |
|
NCCL_IB_HCA |
InfiniBand devices for RDMA communication. Device count and naming rules vary by compute node. Missing or invalid values may adversely affect network performance. |
|
NCCL_IB_TIMEOUT |
Timeout duration for establishing RDMA connections. Increase this value to improve fault tolerance for training jobs. Missing or invalid values may cause training interruptions. |
|
NCCL_IB_QPS_PER_CONNECTION |
Number of queue pairs per connection. Increase this value to effectively improve network throughput. |
Configure an image
You can use official images provided by DLC or custom images to submit training jobs with Lingjun resources.
Official image

Custom image
When creating and using a custom image, note the following requirements:
Environment requirements
-
CUDA 11.2 or later.
-
NCCL 2.12.10 or later.
-
Python 3.
Install the RDMA library
To use a custom image, manually install the RDMA library in the Docker file. Sample code:
RUN apt-get update && \
apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends libnl-3-dev libnl-route-3-dev libnl-3-200 libnl-route-3-200 iproute2 udev dmidecode ethtool && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
RUN cd /tmp/ && \
wget http://pythonrun.oss-cn-zhangjiakou.aliyuncs.com/rdma/nic-libs-mellanox-rdma-5.2-2/nic-lib-rdma-core-installer-ubuntu.tar.gz && \
tar xzvf nic-lib-rdma-core-installer-ubuntu.tar.gz && \
cd nic-lib-rdma-core-installer-ubuntu && \
echo Y | /bin/bash install.sh && \
cd .. && \
rm -rf nic-lib-rdma-core-installer-ubuntu && \
rm -f nic-lib-rdma-core-installer-ubuntu.tar.gz
References
For instructions on submitting training jobs with Lingjun resources, see Create a training job.