All Products
Search
Document Center

Platform For AI:Quick diagnostics

Last Updated:Jan 27, 2026

The quick diagnostics feature of PAI Lingjun Intelligent Computing Service lets you check the network and hardware status of Lingjun nodes and perform network tests based on multiple communication libraries and communication models. This topic describes the quick diagnostics feature.

Self-service diagnostics

Network diagnostics

Network diagnostics includes the Static Configuration Check and Dynamic Runtime Check for Instances in Billing State. This feature diagnoses the network status of Lingjun nodes and displays the results in a visual format.

  • Static Configuration Check performs static configuration diagnostics on Lingjun nodes. These diagnostics include system software, network, and GPU checks.

    1. You can log on to the Lingjun console.

    2. In the navigation pane on the left, you can choose Quick Diagnostics > Self-service Diagnostics.

    3. You can click the Network Diagnostics tab.

    4. You can click Static Configuration Check.

    5. In the Diagnostic Information area, you can select the target Cluster Name and Node ID from the drop-down lists.

    6. You can click Start Diagnostics.

  • Dynamic Runtime Check performs dynamic runtime diagnostics on Lingjun nodes. These diagnostics include TCP connectivity, TCP latency, and Remote Direct Memory Access (RDMA) connectivity checks.

    1. In the navigation pane on the left, you can choose Quick Diagnostics > Self-service Diagnostics.

    2. You can click the Network Diagnostics tab.

    3. You can click Dynamic Runtime Check.

    4. In the Diagnostic Information area, you can select the target Cluster Name and Node ID from the drop-down lists.

    5. You can click Start Diagnostics.

Server diagnostics

Server diagnostics checks the hardware status of Lingjun nodes and displays the results in a visual format.

  1. In the navigation pane on the left, you can choose Quick Diagnostics > Self-service Diagnostics.

  2. You can click the Server Diagnostics tab.

  3. You can click System Hardware Diagnostics.

  4. In the Diagnostic Information area, you can select the target Cluster Name and Node ID from the drop-down lists.

  5. You can click Start Diagnostics.

Network tests

Network tests consist of general network tests and communication library tests. General network tests include latency and traffic tests. Communication library tests are performed using the Alibaba Cloud Collective Communication Library (ACCL), NVIDIA Collective Communications Library (NCCL), and multiple communication models.

General network tests

  1. In the navigation pane on the left, you can choose Quick Diagnostics > Network Test.

  2. You can click the General Network Test tab.

  3. In the Test Information area, you can select a Network Protocol and Test Type, and then configure the corresponding parameters.

    RDMA protocol traffic test

    Configuration parameter

    Description

    Traffic Model

    • MtoN model: Tests the one-way connectivity from Clients nodes to Servers nodes. This includes scenarios such as a single Clients node to a single Servers node, and multiple Clients nodes to multiple Servers nodes.

    • Fullmesh model: Tests the connectivity between every pair of target Lingjun nodes.

    Test Duration

    Select a fixed duration from the drop-down list. Unit: seconds.

    QP

    The number of test process streams. This affects the test bandwidth.

    GDR

    If you enable GDR, the Lingjun network interface controllers (NICs) are attached to the corresponding GPUs for the traffic test.

    Cluster Name

    The name of the cluster where the target Lingjun nodes are located.

    Clients

    If you select the MtoN model, select the Clients nodes.

    Servers

    If you select the MtoN model, select the Servers nodes.

    Select Target Nodes

    If you select the Fullmesh model, select the target nodes.

    Node Port

    The starting port used for the test.

    RDMA protocol latency test

    Configuration parameter

    Description

    Cluster Name

    The name of the cluster where the target Lingjun nodes are located.

    Test Nodes

    The latency test checks the network latency between every pair of the selected Test Nodes.

    Node Port

    The starting port used for the test.

  4. You can click Start Diagnostics.

Communication library test

  1. In the navigation pane on the left, you can choose Quick Diagnostics > Network Test.

  2. You can click the Communication Library Test tab.

  3. You can configure the Test Information.

    Configuration parameter

    Description

    Communication Library Category

    Currently, only ACCL and NCCL are supported.

    Communication Model

    • ALLReduce: Aggregates data from multiple processes, reduces the data to a single value through an operation, and distributes the result to all processes.

    • ALLGather: Gathers data from all processes into one structure so that each process can access the data.

    • ALLGatherA: Adds data type parameters to ALLGather. It can transfer various data types, including big data types and custom data types.

    • ALLToALL: Distributes data from each process to other processes. Each process receives data from all other processes.

    • ALLToALLA: Adds data type and buffer parameters to ALLToAll. It can be used for communication between data of different sizes and types.

    • Broadcast: Distributes data from one process to all other processes.

    Number of GPUs

    The value range is 1 to 8.

    Cluster Name

    The name of the cluster where the target Lingjun nodes are located.

    Select Target Nodes

    Simply specify the IP address of each target node. You do not need to specify the Lingjun node to start the test.

    Node Port

    The starting port used for the test.

  4. You can click Start Test.

View reports

The diagnostic history displays reports for Self-service Diagnostics and Network Tests. The reports include the Report ID and Cluster Name. You can click the tabs to view reports for different diagnostic types. In the Operations column for a specific Report ID, you can click an option to perform an action.

  • View Report. View the results and details of the diagnosis.

  • Diagnose Again. Run the diagnosis again.