The quick diagnostics feature of PAI Lingjun Intelligent Computing Service lets you check the network and hardware status of Lingjun nodes and perform network tests based on multiple communication libraries and communication models. This topic describes the quick diagnostics feature.
Self-service diagnostics
Network diagnostics
Network diagnostics includes the Static Configuration Check and Dynamic Runtime Check for Instances in Billing State. This feature diagnoses the network status of Lingjun nodes and displays the results in a visual format.
Static Configuration Check performs static configuration diagnostics on Lingjun nodes. These diagnostics include system software, network, and GPU checks.
You can log on to the Lingjun console.
In the navigation pane on the left, you can choose Quick Diagnostics > Self-service Diagnostics.
You can click the Network Diagnostics tab.
You can click Static Configuration Check.
In the Diagnostic Information area, you can select the target Cluster Name and Node ID from the drop-down lists.
You can click Start Diagnostics.
Dynamic Runtime Check performs dynamic runtime diagnostics on Lingjun nodes. These diagnostics include TCP connectivity, TCP latency, and Remote Direct Memory Access (RDMA) connectivity checks.
In the navigation pane on the left, you can choose Quick Diagnostics > Self-service Diagnostics.
You can click the Network Diagnostics tab.
You can click Dynamic Runtime Check.
In the Diagnostic Information area, you can select the target Cluster Name and Node ID from the drop-down lists.
You can click Start Diagnostics.
Server diagnostics
Server diagnostics checks the hardware status of Lingjun nodes and displays the results in a visual format.
In the navigation pane on the left, you can choose Quick Diagnostics > Self-service Diagnostics.
You can click the Server Diagnostics tab.
You can click System Hardware Diagnostics.
In the Diagnostic Information area, you can select the target Cluster Name and Node ID from the drop-down lists.
You can click Start Diagnostics.
Network tests
Network tests consist of general network tests and communication library tests. General network tests include latency and traffic tests. Communication library tests are performed using the Alibaba Cloud Collective Communication Library (ACCL), NVIDIA Collective Communications Library (NCCL), and multiple communication models.
General network tests
In the navigation pane on the left, you can choose Quick Diagnostics > Network Test.
You can click the General Network Test tab.
In the Test Information area, you can select a Network Protocol and Test Type, and then configure the corresponding parameters.
RDMA protocol traffic test
Configuration parameter
Description
Traffic Model
MtoN model: Tests the one-way connectivity from Clients nodes to Servers nodes. This includes scenarios such as a single Clients node to a single Servers node, and multiple Clients nodes to multiple Servers nodes.
Fullmesh model: Tests the connectivity between every pair of target Lingjun nodes.
Test Duration
Select a fixed duration from the drop-down list. Unit: seconds.
QP
The number of test process streams. This affects the test bandwidth.
GDR
If you enable GDR, the Lingjun network interface controllers (NICs) are attached to the corresponding GPUs for the traffic test.
Cluster Name
The name of the cluster where the target Lingjun nodes are located.
Clients
If you select the MtoN model, select the Clients nodes.
Servers
If you select the MtoN model, select the Servers nodes.
Select Target Nodes
If you select the Fullmesh model, select the target nodes.
Node Port
The starting port used for the test.
RDMA protocol latency test
Configuration parameter
Description
Cluster Name
The name of the cluster where the target Lingjun nodes are located.
Test Nodes
The latency test checks the network latency between every pair of the selected Test Nodes.
Node Port
The starting port used for the test.
You can click Start Diagnostics.
Communication library test
In the navigation pane on the left, you can choose Quick Diagnostics > Network Test.
You can click the Communication Library Test tab.
You can configure the Test Information.
Configuration parameter
Description
Communication Library Category
Currently, only ACCL and NCCL are supported.
Communication Model
ALLReduce: Aggregates data from multiple processes, reduces the data to a single value through an operation, and distributes the result to all processes.
ALLGather: Gathers data from all processes into one structure so that each process can access the data.
ALLGatherA: Adds data type parameters to ALLGather. It can transfer various data types, including big data types and custom data types.
ALLToALL: Distributes data from each process to other processes. Each process receives data from all other processes.
ALLToALLA: Adds data type and buffer parameters to ALLToAll. It can be used for communication between data of different sizes and types.
Broadcast: Distributes data from one process to all other processes.
Number of GPUs
The value range is 1 to 8.
Cluster Name
The name of the cluster where the target Lingjun nodes are located.
Select Target Nodes
Simply specify the IP address of each target node. You do not need to specify the Lingjun node to start the test.
Node Port
The starting port used for the test.
You can click Start Test.
View reports
The diagnostic history displays reports for Self-service Diagnostics and Network Tests. The reports include the Report ID and Cluster Name. You can click the tabs to view reports for different diagnostic types. In the Operations column for a specific Report ID, you can click an option to perform an action.
View Report. View the results and details of the diagnosis.
Diagnose Again. Run the diagnosis again.