How to use PerfTracker for performance diagnostics in DLC - Platform For AI

PerfTracker is an online tool for performance analysis and diagnostics of large-scale model training. It performs high-precision, online monitoring of full-stack software and hardware. If a job experiences performance issues, PerfTracker collects the execution records for all CUDA kernel functions and Python functions from each worker, along with hardware monitoring records. It then generates an analysis report to automatically diagnose the causes of performance loss. This includes locating slow nodes, identifying bottleneck functions or functions with abnormally long execution times, and detecting hang issues. This topic describes how to use PerfTracker.

Limitation

Currently, PerfTracker supports only PyTorch framework jobs.

Function introduction

Key features

Collects job execution records online. When a job has performance issues, PerfTracker collects the execution records of all functions for each worker, including CUDA kernel functions and Python functions. It also collects high-precision monitoring data for hardware, such as GPUs, NVLink, PCIe, and DRAM, at 100-microsecond granularity.
Performs function-level performance analysis. PerfTracker processes high-precision hardware and software monitoring data to generate performance reports for each function. It automatically diagnoses the causes of performance loss, such as slow nodes and bottleneck functions or functions with abnormally long execution times. This provides a basis for in-depth manual analysis and guides performance optimization efforts.

Solutions

Collects fine-grained function execution records for online training jobs. This online, high-granularity data collection improves the real-time performance and accuracy of diagnostics compared to offline reproduction and profiling.
Uses efficient performance analysis algorithms based on function execution records from multiple nodes. PerfTracker automates tasks that typically require manual performance analysis expertise, which enables efficient performance analysis and problem identification.

How it works

PerfTracker consists of a Collector and PerfDisplay. The Collector runs inside the user's job container and is independent of the training job process. PerfDisplay provides a visualization page that you can open on your local machine. The following figure shows the architecture of PerfTracker.

PerfTracker Collector: Performs high-precision online monitoring of full-stack information. It uses the Torch profiler API and nsys to collect raw monitoring data. A PerfTracker Collector can collect the following types of data:
- Execution records of all functions during job runtime. This includes CUDA kernel functions such as compute and communication kernels, functions that launch kernels to the GPU, GPU memory operations, and Python functions. This data is used for code-level performance analysis and provides a complete record of program behavior.
- Monitoring data for various hardware metrics, such as GPUs, NVLink, PCIe, and DRAM, at 100-microsecond precision.
The following figure shows examples of the collected data.
- CUDA kernel functions and GPU memory operations
- Python functions and GPU kernel launch operations
- Hardware monitoring data
PerfDisplay: Summarizes and analyzes the collected data to generate performance analysis reports and visual outputs.

How to use

Preparations

Prepare the training code. Import the PerfTracker module and mark the steps.

Import the PerfTracker module at the beginning of the training code. The following is sample code:

try:
    from c4d_perftracker_collector.PerfTracker import PerfTracker
    my_tracer = PerfTracker()
except:
    my_tracer = None

Mark the steps in the training code.
To use PerfTracker, you must mark steps in the training code. Each time tracer.step() is executed, PerfTracker records it. This allows the background process to control the number of iterations for which to collect data.
```
while iteration < args.train_iters:

    ...	# Training code
    if my_tracer is not None:
        my_tracer.step() # Mark a step
```

The following is a simple training.py code example that includes the import and step marking:

import torch
import time
import torch.distributed as dist
import argparse
try:
    from c4d_perftracker_collector.PerfTracker import PerfTracker
    my_tracer = PerfTracker()
except:
    my_tracer = None

dist.init_process_group("nccl")
torch.cuda.set_device(dist.get_rank())
# Check if CUDA is available.
if torch.cuda.is_available():
    print("CUDA is available!")
    device = torch.device('cuda')  # Use the default CUDA device.
else:
    print("CUDA is not available.")
    device = torch.device('cpu')  # If CUDA is not available, use the CPU.

def matmul():
    matrix_a = torch.randn(1000, 1000)
    matrix_b = torch.randn(1000, 1000)

    # Move the matrices to the CUDA device.
    matrix_a = matrix_a.to(device)
    matrix_b = matrix_b.to(device)

    # Perform matrix multiplication.
    result = torch.matmul(matrix_a, matrix_b)
    result_cpu = result.to('cpu')
    print(result_cpu)

    del matrix_a, matrix_b, result
    torch.cuda.empty_cache()

for i in range(1000):
    matmul()
    time.sleep(dist.get_rank())
    print("Epoch:", i)
    if my_tracer is not None:
        my_tracer.step()
    dist.barrier()

Upload the prepared training code file (training.py) to your storage directory. This example uses an Object Storage Service (OSS) bucket.

Create a training job

When you create a training job, add a startup command to download and install PerfTracker. The other configurations are the same as for a regular job. After you complete the configuration, click OK to create the job. The following is an example:

Parameter		Description
Environment Information	Image config	Select PyTorch 2.0 or later. This example uses `easyanimate:1.1.5-pytorch2.2.0-gpu-py310-cu118-ubuntu22.04`.
	Mount storage	Click OSS. Select the OSS directory where the training code file is stored. Set Mount Path to `/mnt/data/`.
	Startup Command	`# Download the PerfTracker installation package. wget -t 5 -w 2 -P /mnt/data https://network-research-lingjun-open-oss.oss-cn-hangzhou.aliyuncs.com/files/c4d_perftracker_collector-1.4.0-py3-none-any.whl # Install PerfTracker. pip3 install /mnt/data/c4d_perftracker_collector-1.4.0-py3-none-any.whl # Run the training code. For example, training.py. CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=4 /mnt/data/training.py` `/mnt/data/` is the dataset mount path. Replace it with your directory. Note For large-scale tasks, downloading the installation package at startup may be slow because of high concurrency. Download the PerfTracker installation package to your storage directory from the command line before you start the task.
Resource Information	Framework	Select PyTorch.
Resource Information	Job Resource	Select a resource specification. This example uses `ecs.gn6e-c12g1.12xlarge`.

截屏2025-12-11 10

While the job is running, click the name of the target job. On the Overview tab, go to the Instance section and click Access Container in the Actions column for the master instance.
Run the following command to save the analysis results. You can use these results to diagnose the cause of any performance issues.
- Method 1: Save only the analysis results. Do not save the raw trace.
```
c4d_perftracker --trigger-on --auto-analyze --output-dir /mnt/data/
```
- Method 2: If you have enough storage space, such as on CPFS or OSS, you can use the following command. This command saves both the analysis results and the raw trace, which allows for manual confirmation after the issue is diagnosed. Note that a worker's trace file is typically several hundred megabytes. You can delete it manually after diagnosis.
```
c4d_perftracker --trigger-on --auto-analyze --output-dir /mnt/data/ --save-raw-trace all
```
  The /mnt/data/ parameter specifies the directory where the raw trace is saved. You can set this to your dataset mount directory. The raw trace is then saved to the dataset.
After the analysis results are saved, you can view the analysis report in PerfDisplay. For more information, see the next section. PerfTracker also provides a wide range of parameters to help locate the root cause of performance issues. For more information, see the appendix.

View analysis results

Generate and save the analysis results as described in the Analysis mode step. After the command runs successfully, the system generates a <timestamp>/PerfDisplay folder in the current directory.
Copy the PerfDisplay folder to the data source's mount directory, /mnt/data. Then, download the PerfDisplay directory to your local machine. For instructions, see ossutil command-line interface 2.0. You can compress the folder before you download it.
```
tar -cvf trace.tar PerfDisplay/
mv trace.tar /mnt/data
```
In a local terminal, go to the PerfDisplay folder and run the sudo python3 app.py command (the sudo command may not be required on Linux). Then, open http://127.0.0.1:5000/ in a browser to view the task performance report on the visualization page.

PerfTracker displays performance reports for all functions that affect job performance and indicates whether performance anomalies exist. Functions are displayed by category. The web page also provides a series of options with detailed prompts and examples. The following sections describe the contents of the report.

GPU compute functions

GPU Compute:
[2025-03-04 06:04:00,046 PerfTracker] (compute_functions.py 131) INFO: {
    "min/median/max GPU utilization (in [0,1])": [
        0.27586059769318555,
        0.28605496203987174,
        0.2945494558115959
    ],
    "workers with abnormal GPU utilization": {},
    "major_kernel_executions": {
        "void multi_tensor_apply_kernel<TensorListMetadata<4>, AdamFunctor<float, float, int>, float, float, float, float, float, float, adamMode_t, float>(long, int volatile*, TensorListMetadata<4>, AdamFunctor<float, float, int>, float, float, float, float, float, float, adamMode_t, float)320_1_1|512_1_1": {
            "median cost per execution (ms)": 403.7,
            "bottleneck ratio (in [0,1])": 0.01608086667957405
        },
        "sm80_xmma_gemm_f16f16_f16f32_f32_nn_n_tilesize160x128x32_stage4_warpsize2x2x1_tensor16x8x16_kernel7_16_1|128_1_1": {
            "median cost per execution (ms)": 130.0,
            "bottleneck ratio (in [0,1])": 0.015779752711771233
        },
        "ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_32x5_nt16_32_1|128_1_1": {
            "median cost per execution (ms)": 132.60000000000002,
            "bottleneck ratio (in [0,1])": 0.013880912782219888
        },
        "void (anonymous namespace)::indexing_backward_kernel<c10::Half, 4>(long const*, long const*, c10::Half const*, c10::Half*, long, long, long, long, bool)256_16_1|32_4_1": {
            "median cost per execution (ms)": 1202.25,
            "bottleneck ratio (in [0,1])": 0.012148757934008617
        },
        "ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_32x5_nt16_24_1|128_1_1": {
            "median cost per execution (ms)": 105.6,
            "bottleneck ratio (in [0,1])": 0.005656117080836238
        }
    },
    "workers with potential GPU issues": [],
    "detailed report": {}
}

Report details:

"min/median/max GPU utilization (in [0,1])": For this job, the GPU utilization across all workers has a maximum of 29.4%, a minimum of 27.5%, and a median of 28.6%.
"workers with abnormal GPU utilization": This field is empty, which means no workers have GPU utilization values that are significant outliers. If this field were not empty, it would list the outlier worker IDs and their GPU utilization.
"major_kernel_executions": Lists the execution details of the GPU kernels with the longest total running times. This includes the median cost per execution and the bottleneck ratio, which is the percentage of the total end-to-end running time.
"workers with potential GPU issues": Lists the IDs of workers where GPU kernel functions run slowly. If this field is empty, all workers are performing normally.
"detailed report": Appears when "workers with potential GPU issues" is not empty. It lists which specific worker ran which kernel function slower than normal workers and by how much.

GPU memory operation functions

GPU memory operations:
[2025-03-04 06:04:00,048 PerfTracker] (gpu_mem.py 37) INFO: {
    "Memcpy DtoD (Device -> Device)": {
        "avg bottleneck ratio (in [0,1])": 0.010486858246092,
        "abnormal_workers": {
            "job_x08j11173.cloud.sqa.na131_2_122482.json": 0.010614755325049817,
            "job_x08j11173.cloud.sqa.na131_8_122483.json": 0.0105935370201344,
            "job_x08j11173.cloud.sqa.na131_1_122484.json": 0.010571838319204434,
            "job_x08j11173.cloud.sqa.na131_0_122485.json": 0.010551186610995748,
            "job_x08j11173.cloud.sqa.na131_2_122487.json": 0.010408514784026183,
            "job_x08j11173.cloud.sqa.na131_5_122489.json": 0.010394903160689894,
            "job_x08j11173.cloud.sqa.na131_8_122486.json": 0.010387693451926115,
            "job_x08j11173.cloud.sqa.na131_9_122488.json": 0.010372437296709398
        }
    }
}

Report details:

"avg bottleneck ratio (in [0,1])": During the monitoring period, the average bottleneck ratio for Memcpy DtoD operations in this job was 1.048%.
"abnormal_workers": The Memcpy DtoD function had an abnormally long execution time on eight workers. For GPU memory operation functions, a bottleneck ratio greater than 0.01 (1%) is considered abnormal. The bottleneck ratio is the running time, excluding the time that overlaps with computation.

Collective communication

Communication:
{
    "nvlink ring send": {
        "ncclDevKernel_AllReduce_Sum_f16_RING_LL(ncclDevComm*, unsigned long, ncclWork*)": {
            "example_of_normal_worker": {
                "worker": "job_x08j11173.cloud.sqa.na131_0_66930.json",
                "different from other workers": 0,
                "features": {
                    "bottleneck ratio (in [0,1])": 0.2743985289797289,
                    "avg throughput (%)": 73.75921390374332,
                    "throughput std (%)": 11.384679144385027
                }
            },
            "abnormal_workers": []
        }
    },
    "nvlink ring recv": {
        "ncclDevKernel_AllReduce_Sum_f16_RING_LL(ncclDevComm*, unsigned long, ncclWork*)": {
            "example_of_normal_worker": {
                "worker": "job_x08j11173.cloud.sqa.na131_3_66933.json",
                "different from other workers": 2,
                "features": {
                    "bottleneck ratio (in [0,1])": 0.27346865947352955,
                    "avg throughput (%)": 72.70337362637363,
                    "throughput std (%)": 12.658093406593407
                }
            },
            "abnormal_workers": []
        }
    },
    "pcie sendrecv send": {
        "ncclDevKernel_SendRecv(ncclDevComm*, unsigned long, ncclWork*)": {
            "example_of_normal_worker": {
                "worker": "job_x08j11173.cloud.sqa.na131_0_66930.json",
                "different from other workers": 3,
                "features": {
                    "bottleneck ratio (in [0,1])": 0.07248997985478658,
                    "avg throughput (%)": 46.667,
                    "throughput std (%)": 14.636000000000001
                }
            },
            "abnormal_workers": []
        }
    },
    "pcie sendrecv recv": {
        "ncclDevKernel_SendRecv(ncclDevComm*, unsigned long, ncclWork*)": {
            "example_of_normal_worker": {
                "worker": "job_x08j11173.cloud.sqa.na131_7_66936.json",
                "different from other workers": 1,
                "features": {
                    "bottleneck ratio (in [0,1])": 0.0643436909425455,
                    "avg throughput (%)": 54.833333333333336,
                    "throughput std (%)": 14.166666666666666
                }
            },
            "abnormal_workers": []
        }
    },
    "pcie ring send": {
        "ncclDevKernel_AllReduce_Sum_f16_RING_LL(ncclDevComm*, unsigned long, ncclWork*)": {
            "example_of_normal_worker": {
                "worker": "job_x08j11173.cloud.sqa.na131_0_66930.json",
                "different from other workers": 0,
                "features": {
                    "bottleneck ratio (in [0,1])": 0.2743985289797289,
                    "avg throughput (%)": 41.36698734177215,
                    "throughput std (%)": 14.653768987341774
                }
            },
            "abnormal_workers": []
        }
    },
    "pcie ring recv": {
        "ncclDevKernel_AllReduce_Sum_f16_RING_LL(ncclDevComm*, unsigned long, ncclWork*)": {
            "example_of_normal_worker": {
                "worker": "job_x08j11173.cloud.sqa.na131_0_66930.json",
                "different from other workers": 0,
                "features": {
                    "bottleneck ratio (in [0,1])": 0.2743985289797289,
                    "avg throughput (%)": 41.5311475409836,
                    "throughput std (%)": 15.282721311475411
                }
            },
            "abnormal_workers": []
        }
    }
}

This report categorizes collective communication functions by communication type and provides a performance analysis for each type.

"example_of_normal_worker": Lists the normal performance parameters for the function's execution. This includes the bottleneck ratio (the percentage of end-to-end time, excluding time that overlaps with computation), average throughput, and throughput standard deviation.
"abnormal_workers": If not empty, this field lists all workers with abnormal communication function performance and their corresponding performance metrics.

CUDA runtime

CUDA Runtime:
[2025-03-04 06:04:00,047 PerfTracker] (cuda_runtimes.py 43) INFO: {
    "cudaLaunchKernel": {
        "avg bottleneck ratio (in [0,1])": 0.039727736621541394,
        "avg execution time / monitoring duration (in [0,1])": 0.06956947111288565,
        "abnormal_workers": {
            "job_x08j11173.cloud.sqa.na131_5_122489.json": 0.05342638907019616,
            "job_x08j11173.cloud.sqa.na131_8_122483.json": 0.05125160206973098,
            "job_x08j11173.cloud.sqa.na131_2_122487.json": 0.04770049253555521,
            "job_x08j11173.cloud.sqa.na131_8_122486.json": 0.04358845044879828,
            "job_x08j11173.cloud.sqa.na131_0_122485.json": 0.042635952262081556,
            "job_x08j11173.cloud.sqa.na131_9_122488.json": 0.0354174573296689,
            "job_x08j11173.cloud.sqa.na131_1_122484.json": 0.023585242093250733,
            "job_x08j11173.cloud.sqa.na131_2_122482.json": 0.02021630716304934
        }
    }
}

Report details:

"avg bottleneck ratio (in [0,1])": The average bottleneck ratio for cudaLaunchKernel in this job during the monitoring period was 3.97%. This value excludes time that overlaps with computation.
"avg execution time / monitoring duration (in [0,1])": The average execution time of cudaLaunchKernel was 6.95% of the total monitoring duration. This value includes time that overlaps with computation.
"abnormal_workers": The cudaLaunchKernel function had an abnormally long execution time on eight workers. For CUDA runtime functions, a bottleneck ratio greater than 0.01 (1%) is considered abnormal. The bottleneck ratio is the running time, excluding the time that overlaps with computation.

Python functions

Python functions:
[2025-03-04 06:04:00,048 PerfTracker] (python_functions.py 43) INFO: {
    "pretrain_gpt.py: <module>|megatron/training.py: pretrain|megatron/training.py: train|megatron/training.py: train_step|megatron/core/pipeline_parallel/schedules.py: forward_backward_pipelining_without_interleaving|megatron/core/pipeline_parallel/schedules.py: backward_step|megatron/core/pipeline_parallel/schedules.py: custom_backward|<built-in method run_backward of torch._C._EngineBase object at 0x>": {
        "job_x08j11173.cloud.sqa.na131_2_122487.json": 0.16970858578301054,
        "job_x08j11173.cloud.sqa.na131_5_122489.json": 0.16821543761561655,
        "job_x08j11173.cloud.sqa.na131_0_122485.json": 0.16787961852913025,
        "job_x08j11173.cloud.sqa.na131_8_122483.json": 0.16769273336153187,
        "job_x08j11173.cloud.sqa.na131_8_122486.json": 0.14482595694389258,
        "job_x08j11173.cloud.sqa.na131_9_122488.json": 0.10359829140378449,
        "job_x08j11173.cloud.sqa.na131_1_122484.json": 0.06543764774209325,
        "job_x08j11173.cloud.sqa.na131_2_122482.json": 0.06217541348063737
    },
    "pretrain_gpt.py: <module>|megatron/training.py: pretrain|megatron/training.py: train|megatron/training.py: train_step|megatron/core/pipeline_parallel/schedules.py: forward_backward_pipelining_without_interleaving|megatron/core/pipeline_parallel/schedules.py: forward_step|pretrain_gpt.py: forward_step|nn.Module: DistributedDataParallel_0|torch/nn/modules/module.py: _call_impl|megatron/core/distributed/distributed_data_parallel.py: forward|nn.Module: Float16Module_0|torch/nn/modules/module.py: _call_impl|megatron/model/module.py: forward|nn.Module: GPTModel_0|torch/nn/modules/module.py: _call_impl|megatron/model/gpt_model.py: forward|nn.Module: TransformerLanguageModel_0|torch/nn/modules/module.py: _call_impl|megatron/model/language_model.py: forward|nn.Module: ParallelTransformer_0|torch/nn/modules/module.py: _call_impl|megatron/model/transformer.py: forward": {
        "job_x08j11173.cloud.sqa.na131_9_122488.json": 0.02471835416438489,
        "job_x08j11173.cloud.sqa.na131_0_122485.json": 0.02022024568555683,
        "job_x08j11173.cloud.sqa.na131_2_122482.json": 0.015394834126935101,
        "job_x08j11173.cloud.sqa.na131_2_122487.json": 0.011625367332189284
    },
    "pretrain_gpt.py: <module>|megatron/training.py: pretrain|megatron/training.py: train|megatron/training.py: train_step": {
        "job_x08j11173.cloud.sqa.na131_0_122485.json": 0.012272193902698852
    },
    "autograd::engine::evaluate_function: LinearWithGradAccumulationAndAsyncCommunicationBackward|LinearWithGradAccumulationAndAsyncCommunicationBackward|torch/autograd/function.py: apply|torch/cuda/amp/autocast_mode.py: decorate_bwd|megatron/core/tensor_parallel/layers.py: backward|<built-in method matmul of Tensor object at 0x>|aten::matmul|aten::mm": {
        "job_x08j11173.cloud.sqa.na131_2_122487.json": 0.014066713574814782,
        "job_x08j11173.cloud.sqa.na131_0_122485.json": 0.013168949365116213,
        "job_x08j11173.cloud.sqa.na131_8_122483.json": 0.013000378454189552,
        "job_x08j11173.cloud.sqa.na131_5_122489.json": 0.012500119397472594,
        "job_x08j11173.cloud.sqa.na131_8_122486.json": 0.012470581043494208
    },
    "autograd::engine::evaluate_function: FastLayerNormFNBackward|FastLayerNormFNBackward|torch/autograd/function.py: apply|apex/contrib/layer_norm/layer_norm.py: backward|<built-in method ln_bwd of PyCapsule object at 0x>": {
        "job_x08j11173.cloud.sqa.na131_0_122485.json": 0.010127612754279463
    },
    "pretrain_gpt.py: <module>|megatron/training.py: pretrain|megatron/training.py: train|megatron/training.py: train_step|megatron/core/pipeline_parallel/schedules.py: forward_backward_pipelining_without_interleaving": {
        "job_x08j11173.cloud.sqa.na131_2_122487.json": 0.01041679269251709
    },
    "autograd::engine::evaluate_function: torch::autograd::AccumulateGrad": {
        "job_x08j11173.cloud.sqa.na131_8_122486.json": 0.013633967050768714
    }
}

This report lists all Python functions whose execution time accounts for more than 1% of the total time, excluding time that overlaps with GPU computation, communication, or other operations. The functions are grouped by name. For each function, the report lists all workers where the execution time percentage is greater than 1%, along with the corresponding percentage on those workers.

Appendix

More PerfTracker parameters

Parameters
- --steps: Specifies the number of steps for which to collect data. The default value is 2. A value of 2 or 3 is recommended.
- --wait: Specifies the number of steps to wait before starting collection. The default value is 5.
- --output-dir: The directory where the visualization page and trace files are saved. The default value is /tmp/perftracker/output. If you do not add this parameter, the raw trace files are deleted after the analysis report is generated.
- --save-raw-trace: Specifies whether to save the raw trace of the workers. Each worker corresponds to a file that is several hundred MB to 2 or 3 GB in size. Valid values are none (do not save), master (save only the rank 0 node), or all (save all). If this parameter is not specified, the default value is none.
Usage examples
- Use all default parameters. After you run the command, PerfTracker waits for 5 steps, collects the monitoring trace for 2 steps, generates a performance report, and then deletes the raw trace.
```
c4d_perftracker --trigger-on --auto-analyze
```
- Specify a value for the --wait parameter. PerfTracker waits for the specified number of steps before starting a collection round.
```
c4d_perftracker --trigger-on --auto-analyze --wait 10
```
- Set the --steps parameter to 3. PerfTracker collects data for 3 steps.
```
c4d_perftracker --trigger-on --auto-analyze --steps 3
```
- If the command line includes the --save-raw-trace parameter, the trace is automatically saved. The default path is /tmp/perftracker/output.
```
c4d_perftracker --trigger-on --auto-analyze --save-raw-trace all
```
- Automatically save the master's trace to a specified path, such as /path/to/trace.
```
c4d_perftracker --trigger-on --auto-analyze --save-raw-trace master --output-dir /path/to/trace
```
- If the value of the --save-raw-trace parameter in the command line is all, the traces of all workers are saved.
```
c4d_perftracker --trigger-on --auto-analyze --output-dir --save-raw-trace all
```
- Example of mixed parameter configuration:
```
c4d_perftracker --trigger-on --auto-analyze --wait 5 --steps 10 --output-dir /path/to/trace --save-raw-trace all
```
  After you run this command, PerfTracker waits for 5 steps, then starts collecting data for the next 10 steps. The traces of all workers are saved to the /path/to/trace directory, and the trace analysis report is displayed in the AIMaster container.

Related references

For more information about the principles of PerfTracker, see PerfTracker: Online Performance Troubleshooting for Large-scale Model Training in Production.