All Products
Search
Document Center

Platform For AI:Use instance interconnection for distributed training

Last Updated:Mar 11, 2026

Perform distributed training across multiple DSW instances using instance interconnection.

Prerequisites

  • Create multiple instances from a general computing resource group or Lingjun resource group in the same VPC.

  • Set the Internet access gateway for the resource group to Private Gateway.

  • Place all instances in the same cluster. Lingjun instances and general computing resource instances cannot interconnect.

  • Some instance types support Remote Direct Memory Access (RDMA) or enhanced RDMA (eRDMA). For more information, see Default variables (pre-configured by the platform) and Limitations.

    DSW and DLC provide the same RDMA/eRDMA features. For more information, see the DLC documentation.

Key features

These features enable distributed task development and debugging across multiple machines and GPUs.

Steps

  1. Use DSW instance cloning to start the required number of instances with identical environments.

  2. (Optional) Install the RDMA/eRDMA library.

    1. For Lingjun resources, use an image that contains RDMA. For more information, see Configure an image.

    2. For general computing resources, follow the instructions in Install eRDMA library.

  3. Test network connectivity by running ping with the target instance ID. For example: ping dsw-l28wnjdlyzj*********.

  4. Configure and debug distributed tasks using your chosen distributed framework.