Cloud Native Application Performance Optimizer (CNP) is an all-in-one platform that is used to evaluate, analyze, and optimize the performance of cloud-native applications. CNP aims to improve the performance of cloud applications, automatically and efficiently evaluate the training performance of Lingjun clusters, and provide suggestions on performance optimization. This topic describes how to use CNP to perform a performance evaluation.
Go to the CNP platform
Log on to the Intelligent Computing Lingjun console.
In the left-side navigation pane, choose Performance Evaluation > CNP Performance Evaluation.
On the CNP platform, start a performance evaluation and view the evaluation results.
In the lower-left corner of a page, click Return to the Lingjun console to return to the Intelligent Computing Lingjun console.
Start a performance evaluation
Step 1: Select a cluster
Click Start Evaluation on the Welcome page or click Initiate Evaluation on the Performance Evaluation page to go to the Select Cluster step.
Select one of your clusters whose performance you want to evaluate.
Authorize CNP to access Deep Learning Containers (DLC). After you configure the required parameters, click Click Authorize and Test Connectivity. If the connectivity test is passed, a message is returned to inform you that the connection is successful. Otherwise, the cause of a connection failure is returned. The following table describes the common failure causes and solutions:
Failure cause
Recommended solution
The connection times out.
Enable a whitelist for access to CNP and try again.
The specified information is invalid.
Check the values that you specified for the AccessID, AccessKey, Workspace, and Endpoint parameters. Modify the invalid values and try again.
The D3001 error code is returned, which indicates that a Security Token Service (STS) token failed to be obtained.
The D3002 error code is returned, which indicates that a service-linked role for CNP failed to be created.
The D3003 error code is returned, which indicates that an Application Real-Time Monitoring Service (ARMS) instance failed to be created.
The D3004 error code is returned, which indicates that ARMS is not activated.
Activate ARMS.
The D3005 error code is returned, which indicates that ARMS information failed to be obtained.
The D3006 error code is returned, which indicates that the current account does not have permissions to create a service-linked role for CNP.
Grant the account the permissions to create a service-linked role for CNP.
After the connectivity test is passed, click Next Step to go to the Select Test Plan step.
Step 2: Select a test plan
Select a test plan template
By default, the system provides two test plan templates. You can select one of the test plan templates based on your business scenario.
Plans | Test content | Scale of the cluster to be tested |
Test plan for LLM-based scenarios |
|
|
Test plan for image recognition scenarios |
|
|
Create a custom test plan
If the test plan templates provided by the system cannot meet your test requirements, you can create a custom test plan.
Single GPU test: The number of nodes can be customized. By default, MatMul is used in this test case.
Single machine test: The number of nodes can be customized. By default, Bert-base is used in this test case.
AI model test: The AI models and the number of GPUs to be evaluated for a cluster can be customized.
The following AI models are supported: LLaMA-7B, Stable Diffusion, Swin Transformer, Bert-base, and UNet.
By default, the basic parameter settings are used. You can view the parameter settings on the test details page.
Estimate the evaluation duration
After you select a test plan, the evaluation duration is automatically estimated based on the test content that is contained in the test plan. The evaluation duration is estimated based on the maximum scale of the cluster that you selected in Step 1. If the available scale of your cluster does not reach its maximum scale, the actual evaluation duration is longer than the estimated duration.
Start the evaluation
After Step 1 and Step 2 are complete, click Start Evaluation to start the evaluation and wait for the evaluation results.
View the evaluation progress and results
After the test plan is created, you can view the status and progress of the test plan in real time on the Plans tab. Find the test plan and click Details in the Actions column to go to the details page of the test plan and view the progress of each test.
Single GPU test
The test is passed.
If no suspected faulty GPUs or warning GPUs are detected among the tested GPUs, the single GPU test is passed.
NoteSuspected faulty GPU: The task that tests the GPU fails, and the GPU may be faulty.
Warning GPU: The TFLOPS of the GPU is beyond the normal threshold range for more than 5% of the number of iterations.
Logic for calculating the normal threshold range: Take the median TFLOPS of all GPUs in each iteration as a baseline, and compare 103% and 97% of the baseline with quadrupled sigma (quadrupled standard deviation). The greater values are used as the maximum and minimum thresholds of the normal threshold range.
Test results are abnormal.
If at least one suspected faulty GPU or warning GPU is detected among the tested GPUs, the single GPU test returns abnormal results.
In the evaluation task list, you can click the plus (+) icon to view the details of suspected faulty GPUs or warning GPUs. You can report the detected abnormal nodes to the O&M team for further troubleshooting. Click Evaluation Details in the Actions column to view the results of the evaluation task.
Single machine test
The test is passed.
If no suspected faulty nodes or warning nodes are detected among the tested nodes, the single machine test is passed.
NoteSuspected faulty node: The DLC job on the node fails, and the node may be faulty.
Warning node: The throughput of the node is beyond the normal threshold range for more than 5% of the number of iterations.
Logic for calculating the normal threshold range: Take the median throughput of all nodes in each iteration as a baseline, and compare 103% and 97% of the baseline with quadrupled sigma (quadrupled standard deviation). The greater values are used as the maximum and minimum thresholds of the normal threshold range.
Test results are abnormal.
If at least one suspected faulty node or warning node is detected among the tested nodes, the single machine test returns abnormal results.
In the evaluation task list, you can click the plus (+) icon to view the details of suspected faulty nodes or warning nodes. You can report the detected abnormal nodes to the O&M team for further troubleshooting. Click Evaluation Details in the Actions column to view the results of the evaluation task.
AI model test
Test progress
Pending: All tasks are ready to be run.
Completed: All tasks are successfully run, failed, or stopped.
Stopped: All tasks are stopped.
Running: Some tasks are complete, and some tasks are ready to be run or being run.
Evaluation tasks
You can view all tasks that are included in the AI model test of the current test plan. If you want to stop a task that is in progress, you can click Stop. All tasks can be deleted.
WarningThe data of deleted or failed tasks is not collected on the performance dashboard. Proceed with caution when you delete tasks.
View the evaluation results on the performance dashboard
Go to the performance dashboard of a test plan
If the test plan is in the Completed state, you can click Performance Report in the Actions column to view the results of the test plan on the performance dashboard. The performance dashboard displays the evaluation tasks that are successfully run in the AI model test of the test plan.
Content displayed on the performance dashboard
Scalability of test models
The performance dashboard displays the trend of throughput based on the number of GPUs tested in the current test plan for each model, which indicates the performance scalability of the model in the cluster. Results are not compared between different models.
Formula: Scalability score = log₂(Model throughput/Throughput of the model of the lowest specifications)
In this example, the GPT3-175B model is used for illustrative purposes only and Mock data is processed.
Number of GPUs | Throughput | Scalability score | Theoretical scalability score |
64 | 10 | ||
128 | 18 | log₂(18/10) | log₂2 |
256 | 35 | log₂(35/10) | log₂4 |
512 | 69 | log₂(69/10) | log₂8 |
1,024 | 137 | log₂(137/10) | log₂16 |
Note: The performance scalability is better if the scalability score is closer to the theoretical scalability score.
Evaluation result details
In the evaluation result details, you can view the metrics such as throughput, MFU, and iteration latency for each model based on the number of GPUs tested in the current test plan. The y-axis indicates the number of GPUs, and the x-axis indicates the metric value.