All Products
Search
Document Center

Platform For AI:RDMA: high-performance networks for distributed training

Last Updated:Feb 20, 2024

Optimal use of computing resources is crucial in parallel computing for foundation models. This can be achieved by minimizing communication traffic, overlapping computation with communication, and improving communication efficiency. Lingjun AI Computing Service (Lingjun) is a heterogeneous computing service that adopts the hardware-software co-design. Lingjun can ensure the maximum efficiency and utilization of high-performance computing resources and is suitable for large-scale deep learning and advanced intelligent computing tasks. When you submit Deep Learning Container (DLC) jobs in Platform for AI (PAI), you can configure high-performance networks for jobs that use Lingjun resources. This topic describes how to configure high-performance networks.

Limits

This topic applies only to DLC jobs that use serverless Lingjun resources.

Configure high-performance network variables

By default, Lingjun resources in PAI use Remote Direct Memory Access (RDMA) networks and adopt the optimal settings for NVIDIA Collective Communications Library (NCCL) environment variables. You can configure NCCL environment variables based on your training framework, communication framework, and model features. We recommend that you use the default settings in PAI to ensure optimal performance.

Default settings in PAI

The following table describes the default settings of NCCL environment variables in PAI for different Lingjun specifications.

Lingjun specification

NCCL environment variables

  • ml.gu7xf.c96m1600.8-gu108

  • ml.gu7xf.8xlarge-gu108

  • ml.gu7ef.c96m1600.8-gu100

export NCCL_IB_TC=136
export NCCL_IB_SL=5
export NCCL_IB_GID_INDEX=3
export NCCL_SOCKET_IFNAME=eth
export NCCL_DEBUG=INFO
export NCCL_IB_HCA=mlx5
export NCCL_IB_TIMEOUT=22
export NCCL_IB_QPS_PER_CONNECTION=8
export NCCL_NET_PLUGIN=none

ml.gu8xf.8xlarge-gu108

export NCCL_IB_TC=136
export NCCL_IB_SL=5
export NCCL_IB_GID_INDEX=3
export NCCL_SOCKET_IFNAME=eth
export NCCL_DEBUG=INFO
export NCCL_IB_HCA=mlx5
export NCCL_IB_TIMEOUT=22
export NCCL_IB_QPS_PER_CONNECTION=8
export NCCL_NET_PLUGIN=none

For more information, see the "NCCL environment variables" section of this topic.

NCCL environment variables

The following table describes the key NCCL environment variables. For information about other environment variables, see NCCL documentation.

Key NCCL environment variable

Description

NCCL_IB_TC

The network mapping rules that are used by Alibaba Cloud. If you do not configure this variable or if you specify an invalid value, network performance may be affected.

NCCL_IB_GID_INDEX

The Global ID index that is used by RDMA. If you do not configure this variable or if you specify an invalid value, NCCL encounters an error.

NCCL_SOCKET_IFNAME

The IP interface that NCCL uses to establish a connection. The recommended value of this variable varies based on the Lingjun specification. If you do not configure this variable or if you specify an invalid value, NCCL may fail to establish a connection.

NCCL_DEBUG

The debug information from NCCL. We recommend that you set this variable to INFO for faster troubleshooting.

NCCL_IB_HCA

The InfiniBand devices that are used for RDMA communication. The number and naming conventions of InfiniBand devices vary based on the computing node. If you do not specify this variable or specify an invalid value, the network performance may be affected.

NCCL_IB_TIMEOUT

The duration before a timeout error is triggered for RDMA connections. A longer timeout results in higher fault tolerance for training jobs. If you do not specify this variable or specify an invalid value, training jobs may be interrupted.

NCCL_IB_QPS_PER_CONNECTION

The number of InfiniBand queue pairs that are used for each connection. A larger number of queue pairs results in a higher network throughput.

Configure an image

Official image

You can use the official PAI image to submit DLC jobs that use Lingjun resources. For more information, see the "Use a resource quota" section in the Lingjun resource quotas topic. image.png

Custom image

If you want to create and use a custom image for DLC jobs that use Lingjun resources, take note of the following considerations.

Environment requirements

  • Compute Unified Device Architecture (CUDA) 11.2

  • NCCL 2.12.10

  • Python 3

Install the RDMA library

To leverage the high-performance RDMA networks that are provided by serverless Lingjun resources, install the corresponding RDMA library in the Docker file of the custom image.

Sample code:

RUN apt-get update && \
    apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends libnl-3-dev libnl-route-3-dev libnl-3-200 libnl-route-3-200 iproute2 udev dmidecode ethtool && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

RUN cd /tmp/ && \
    wget http://pythonrun.oss-cn-zhangjiakou.aliyuncs.com/rdma/nic-libs-mellanox-rdma-5.2-2/nic-lib-rdma-core-installer-ubuntu.tar.gz && \
    tar xzvf nic-lib-rdma-core-installer-ubuntu.tar.gz && \
    cd nic-lib-rdma-core-installer-ubuntu && \
    echo Y | /bin/bash install.sh && \
    cd .. && \
    rm -rf nic-lib-rdma-core-installer-ubuntu && \
    rm -f nic-lib-rdma-core-installer-ubuntu.tar.gz

References

For information about how to use serverless Lingjun resources to submit a training job, see Submit training jobs.