All Products
Search
Document Center

Platform For AI:Use CNP to perform a performance evaluation

Last Updated:Jan 10, 2025

Cloud Native Application Performance Optimizer (CNP) is an all-in-one platform that is used to evaluate, analyze, and optimize the performance of cloud-native applications. CNP aims to improve the performance of cloud applications, automatically and efficiently evaluate the training performance of Lingjun clusters, and provide suggestions on performance optimization. This topic describes how to use CNP to perform a performance evaluation.

Go to the CNP platform

  1. Log on to the Intelligent Computing Lingjun console.

  2. In the left-side navigation pane, choose Performance Evaluation > CNP Performance Evaluation.

  3. On the CNP platform, start a performance evaluation and view the evaluation results.

  4. In the lower-left corner of a page, click Return to the Lingjun console to return to the Intelligent Computing Lingjun console.

Start a performance evaluation

Step 1: Select a cluster

Click Start Evaluation on the Welcome page or click Initiate Evaluation on the Performance Evaluation page to go to the Select Cluster step.

  • Select one of your clusters whose performance you want to evaluate.

  • Authorize CNP to access Deep Learning Containers (DLC). After you configure the required parameters, click Click Authorize and Test Connectivity. If the connectivity test is passed, a message is returned to inform you that the connection is successful. Otherwise, the cause of a connection failure is returned. The following table describes the common failure causes and solutions:

    Failure cause

    Recommended solution

    The connection times out.

    Enable a whitelist for access to CNP and try again.

    The specified information is invalid.

    Check the values that you specified for the AccessID, AccessKey, Workspace, and Endpoint parameters. Modify the invalid values and try again.

    The D3001 error code is returned, which indicates that a Security Token Service (STS) token failed to be obtained.

    The D3002 error code is returned, which indicates that a service-linked role for CNP failed to be created.

    The D3003 error code is returned, which indicates that an Application Real-Time Monitoring Service (ARMS) instance failed to be created.

    The D3004 error code is returned, which indicates that ARMS is not activated.

    Activate ARMS.

    The D3005 error code is returned, which indicates that ARMS information failed to be obtained.

    The D3006 error code is returned, which indicates that the current account does not have permissions to create a service-linked role for CNP.

    Grant the account the permissions to create a service-linked role for CNP.

After the connectivity test is passed, click Next Step to go to the Select Test Plan step.

Step 2: Select a test plan

Select a test plan template

By default, the system provides two test plan templates. You can select one of the test plan templates based on your business scenario.

Plans

Test content

Scale of the cluster to be tested

Test plan for LLM-based scenarios

  • Single GPU test: MatMul (matrix operator)

  • Single machine test: Bert-base

  • AI model test: LLaMA-7B

  • Single GPU test: By default, the test is performed based on the maximum scale of your cluster.

  • Single machine test: By default, the test is performed based on the maximum scale of your cluster.

  • AI model test: By default, the system creates evaluation tasks for 8 GPUs, 16 GPUs, 32 GPUs, 64 GPUs, 128 GPUs, 256 GPUs, and 512 GPUs based on the maximum scale of your cluster. If the maximum scale of your cluster is 100 GPUs, the system creates evaluation tasks only for 8 GPUs, 16 GPUs, 32 GPUs, and 64 GPUs.

Test plan for image recognition scenarios

  • Single GPU test: MatMul (matrix operator)

  • Single machine test: Bert-base

  • AI model test: Swin Transformer and Stable Diffusion

  • Single GPU test: By default, the test is performed based on the maximum scale of your cluster.

  • Single machine test: By default, the test is performed based on the maximum scale of your cluster.

  • AI model test: By default, the system creates evaluation tasks for 8 GPUs, 16 GPUs, 32 GPUs, and 64 GPUs based on the maximum scale of your cluster. If the maximum scale of your cluster is 16 GPUs, the system creates evaluation tasks only for 8 GPUs and 16 GPUs.

Create a custom test plan

If the test plan templates provided by the system cannot meet your test requirements, you can create a custom test plan.

  1. Single GPU test: The number of nodes can be customized. By default, MatMul is used in this test case.

  2. Single machine test: The number of nodes can be customized. By default, Bert-base is used in this test case.

  3. AI model test: The AI models and the number of GPUs to be evaluated for a cluster can be customized.

Note
  • The following AI models are supported: LLaMA-7B, Stable Diffusion, Swin Transformer, Bert-base, and UNet.

  • By default, the basic parameter settings are used. You can view the parameter settings on the test details page.

Estimate the evaluation duration

After you select a test plan, the evaluation duration is automatically estimated based on the test content that is contained in the test plan. The evaluation duration is estimated based on the maximum scale of the cluster that you selected in Step 1. If the available scale of your cluster does not reach its maximum scale, the actual evaluation duration is longer than the estimated duration.

Start the evaluation

After Step 1 and Step 2 are complete, click Start Evaluation to start the evaluation and wait for the evaluation results.

View the evaluation progress and results

After the test plan is created, you can view the status and progress of the test plan in real time on the Plans tab. Find the test plan and click Details in the Actions column to go to the details page of the test plan and view the progress of each test.

Single GPU test

  • The test is passed.

    If no suspected faulty GPUs or warning GPUs are detected among the tested GPUs, the single GPU test is passed.

    Note
    • Suspected faulty GPU: The task that tests the GPU fails, and the GPU may be faulty.

    • Warning GPU: The TFLOPS of the GPU is beyond the normal threshold range for more than 5% of the number of iterations.

    • Logic for calculating the normal threshold range: Take the median TFLOPS of all GPUs in each iteration as a baseline, and compare 103% and 97% of the baseline with quadrupled sigma (quadrupled standard deviation). The greater values are used as the maximum and minimum thresholds of the normal threshold range.

  • Test results are abnormal.

    If at least one suspected faulty GPU or warning GPU is detected among the tested GPUs, the single GPU test returns abnormal results.

    In the evaluation task list, you can click the plus (+) icon to view the details of suspected faulty GPUs or warning GPUs. You can report the detected abnormal nodes to the O&M team for further troubleshooting. Click Evaluation Details in the Actions column to view the results of the evaluation task.

Single machine test

  • The test is passed.

    If no suspected faulty nodes or warning nodes are detected among the tested nodes, the single machine test is passed.

    Note
    • Suspected faulty node: The DLC job on the node fails, and the node may be faulty.

    • Warning node: The throughput of the node is beyond the normal threshold range for more than 5% of the number of iterations.

    • Logic for calculating the normal threshold range: Take the median throughput of all nodes in each iteration as a baseline, and compare 103% and 97% of the baseline with quadrupled sigma (quadrupled standard deviation). The greater values are used as the maximum and minimum thresholds of the normal threshold range.

  • Test results are abnormal.

    If at least one suspected faulty node or warning node is detected among the tested nodes, the single machine test returns abnormal results.

    In the evaluation task list, you can click the plus (+) icon to view the details of suspected faulty nodes or warning nodes. You can report the detected abnormal nodes to the O&M team for further troubleshooting. Click Evaluation Details in the Actions column to view the results of the evaluation task.

AI model test

  • Test progress

    Pending: All tasks are ready to be run.

    Completed: All tasks are successfully run, failed, or stopped.

    Stopped: All tasks are stopped.

    Running: Some tasks are complete, and some tasks are ready to be run or being run.

  • Evaluation tasks

    You can view all tasks that are included in the AI model test of the current test plan. If you want to stop a task that is in progress, you can click Stop. All tasks can be deleted.

    Warning

    The data of deleted or failed tasks is not collected on the performance dashboard. Proceed with caution when you delete tasks.

View the evaluation results on the performance dashboard

Go to the performance dashboard of a test plan

If the test plan is in the Completed state, you can click Performance Report in the Actions column to view the results of the test plan on the performance dashboard. The performance dashboard displays the evaluation tasks that are successfully run in the AI model test of the test plan.

Content displayed on the performance dashboard

Scalability of test models

The performance dashboard displays the trend of throughput based on the number of GPUs tested in the current test plan for each model, which indicates the performance scalability of the model in the cluster. Results are not compared between different models.

Formula: Scalability score = log₂(Model throughput/Throughput of the model of the lowest specifications)

Note

In this example, the GPT3-175B model is used for illustrative purposes only and Mock data is processed.

Number of GPUs

Throughput

Scalability score

Theoretical scalability score

64

10

128

18

log₂(18/10)

log₂2

256

35

log₂(35/10)

log₂4

512

69

log₂(69/10)

log₂8

1,024

137

log₂(137/10)

log₂16

Note: The performance scalability is better if the scalability score is closer to the theoretical scalability score.

Evaluation result details

In the evaluation result details, you can view the metrics such as throughput, MFU, and iteration latency for each model based on the number of GPUs tested in the current test plan. The y-axis indicates the number of GPUs, and the x-axis indicates the metric value.