In parallel computing for foundation models, computing performance can be optimized by reducing communication traffic volume, overlapping computation with communication, and improving communication efficiency. This topic describes how to configure high-performance networks.
Limits
This topic applies only to training jobs that use Lingjun resources.
Configure high-performance network variables
Lingjun resources in Platform for AI (PAI) use Remote Direct Memory Access (RDMA) networks and adopt the optimal settings for NVIDIA Collective Communications Library (NCCL) environment variables. We recommend that you use the default variables in PAI to achieve optimal performance. You can also configure variables based on your training framework, communication framework, and model features.
Default variables in PAI
The following table describes the settings of the default NCCL variables in PAI based on different Lingjun specifications.
Lingjun specification | NCCL environment variable |
| |
For more information, see the "NCCL environment variables" section of this topic.
NCCL environment variables
The following table describes the key NCCL environment variables. For information about other environment variables, see NCCL documentation.
Key NCCL environment variable | Description |
NCCL_IB_TC | The traffic classification rules that match the network mapping rules used by Alibaba Cloud. If you do not configure this variable or if you specify an invalid value, network performance may be adversely affected. |
NCCL_IB_GID_INDEX | The optimal global ID index. If you do not configure this variable or if you specify an invalid value, NCCL encounters an error. |
NCCL_SOCKET_IFNAME | The network interface that NCCL uses to establish a connection. The recommended value of this variable varies based on the Lingjun specification. If you do not configure this variable or if you specify an invalid value, NCCL may fail to establish a connection. |
NCCL_DEBUG | The level of the NCCL debug information. We recommend that you set this variable to INFO to obtain more NCCL-related logs. This helps troubleshoot performance issues. |
NCCL_IB_HCA | The InfiniBand devices that can be used for RDMA communication. The number and naming rules of InfiniBand devices vary based on the computing node. If you do not specify this variable or specify an invalid value, the network performance may be adversely affected. |
NCCL_IB_TIMEOUT | The timeout duration for establishing RDMA connections. You can increase the value of this variable to improve the fault tolerance for training jobs. If you do not specify this variable or specify an invalid value, training jobs may be interrupted. |
NCCL_IB_QPS_PER_CONNECTION | The number of queue pairs on each connection. You can increase the value of this variable to effectively improve network throughput. |
Configure an image
You can use official images provided by DLC to submit training jobs that use Lingjun resources. You can also use custom images to submit training jobs.
Official image

Custom image
You can create and use a custom image. Take note of the following items:
Environment requirements
Compute Unified Device Architecture (CUDA) 11.2 or later is used.
NCCL 2.12.10 or later is used.
Python 3 is used.
Install the RDMA library
To use a custom image, you must manually install the RDMA library in the Docker file of the custom image. Sample code:
RUN apt-get update && \
apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends libnl-3-dev libnl-route-3-dev libnl-3-200 libnl-route-3-200 iproute2 udev dmidecode ethtool && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
RUN cd /tmp/ && \
wget http://pythonrun.oss-cn-zhangjiakou.aliyuncs.com/rdma/nic-libs-mellanox-rdma-5.2-2/nic-lib-rdma-core-installer-ubuntu.tar.gz && \
tar xzvf nic-lib-rdma-core-installer-ubuntu.tar.gz && \
cd nic-lib-rdma-core-installer-ubuntu && \
echo Y | /bin/bash install.sh && \
cd .. && \
rm -rf nic-lib-rdma-core-installer-ubuntu && \
rm -f nic-lib-rdma-core-installer-ubuntu.tar.gzReferences
For information about how to submit a training job that uses Lingjun resources, see Submit training jobs.