All Products
Search
Document Center

Platform For AI:Use instance interconnection for distributed training

Last Updated:Nov 05, 2025

DSW provides a multi-instance interconnection feature that lets you perform distributed development and training across multiple machines and GPUs.

Prerequisites

  • You must have multiple instances that are created from a general computing resource group or a Lingjun resource group and are located in the same VPC.

  • The Internet access gateway for the resource group that contains the instances must be set to Dedicated Gateway.

  • The instances must be in the same cluster. For example, you cannot interconnect Lingjun instances and general computing resource instances.

  • Only some instance types support Remote Direct Memory Access (RDMA) or enhanced RDMA (eRDMA). For more information, see Default variables (pre-configured by the platform) and Limits.

    DSW and DLC provide the same features for RDMA/eRDMA. For more information, see the DLC documentation.

Features

  • DSW provides pre-configured, high-performance network environment variables that are optimized for different resources and network architectures.

  • On nodes that support RDMA, you can use RDMA/eRDMA for interconnection.

  • You can interconnect instances using their instance IDs as DNS domain names.image

These features allow you to develop and debug distributed tasks across multiple machines and GPUs.

Procedure

  1. Use the DSW instance cloning feature to start the required number of instances with identical environments.

  2. (Optional) Install the RDMA/eRDMA library on the instances.

    1. For Lingjun resources, use an image that contains RDMA. For more information, see Configure an image.

    2. For general computing resources, follow the instructions in Install the eRDMA library.

  3. From one instance, run the ping command on the instance ID of another instance to test network connectivity. For example: ping dsw-l28wnjdlyzj*********.

  4. Configure and debug the distributed task using your chosen distributed framework.