Analyze CUDA Performance with AI Profiling on ACK GPU Containers - ACK

AI Profiling is a non-intrusive performance analysis tool for GPU containers running in Kubernetes. It uses extended Berkeley Packet Filter (eBPF) and dynamic process injection to collect data from running containers without modifying application code or restarting pods. It covers five profiling dimensions: Python process execution, CPU function calls, system calls, CUDA library interactions, and CUDA kernel operations.

This topic describes how to deploy and use AI Profiling from the command line to identify performance bottlenecks in containerized GPU workloads.

Prerequisites

Before you begin, make sure you have:

An ACK cluster with Elastic Compute Service (ECS) or Lingjun nodes — profiling jobs only run on these node types
The kubectl plugin download link and the profiling image address, obtained by Submit a ticket to the Container Service team
(If using Python profiling) Verified USDT availability in your workload container — see Verify USDT availability

Verify USDT availability

Python profiling depends on User Statically-Defined Tracing (USDT) in the Python interpreter. Run the following command inside your workload container:

python -c "import sysconfig; print(sysconfig.get_config_var('WITH_DTRACE'))"

The output must be 1 for Python profiling to work. If the output is 0 or empty, Python profiling is unavailable in that container.

Step 1: Install the kubectl plugin

AI Profiling is deployed using the kubectl-prof plugin. This example uses the Linux amd64 binary.

Download and install the plugin:

wget https://xxxxxxxxxxxxxxxxx.aliyuncs.com/kubectl_prof_linux_amd64
mv kubectl_prof_linux_amd64 /usr/local/bin/kubectl-prof
chmod +x /usr/local/bin/kubectl-prof

Verify the installation:

kubectl prof deploy -h

Expected output:

deploy the profiling tool pod

Usage:
  kubectl-profile deploy [flags]

Aliases:
  deploy, run

Flags:
      --container string        Specify the target container name
  -d, --duration uint           Specify the profiling duration in seconds (default 60)
  -h, --help                    help for deploy
      --image string            Specify the profiling tool image
      --kubeconfig string       Specify the kubeconfig file
      --memory-limit string     Specify the memory limit (default "1Gi")
      --memory-request string   Specify the memory request (default "128Mi")
      --namespace string        Specify the target pod namespace (default "default")
      --node string             Specify the node name
      --pod string              Specify the target pod name
      --region-id string        Specify the region-id
      --ttl uint                Specify the ttl (default 60)

Step 2: Deploy a profiling pod

A profiling pod runs on the same node as the target container and collects profiling data without touching the application.

Get the namespace, name, and node of the application pod. This example uses a PyTorch training job:

NAME                              READY   STATUS    RESTARTS   AGE   IP               NODE
pytorch-train-worker-sample       1/1     Running   0          82s   172.23.224.197   cn-beijing.10.0.17.XXX

Deploy the profiling pod using the information from the previous step:

kubectl prof deploy \
    --image xxxxxxxxxx \              # Profiling image address provided by Alibaba Cloud
    --duration 100000 \              # Lifetime of the profiling pod in seconds
    --namespace default \            # Namespace of the application pod
    --region-id cn-beijing \         # Alibaba Cloud region ID
    --pod pytorch-train-worker-sample \  # Application pod name
    --container pytorch \            # Container name within the application pod
    --memory-limit 10G \             # Memory limit for the profiling pod
    --memory-request 1G              # Memory request for the profiling pod

Step 3: Run a profiling job

After profiling completes, you export the result file and visualize it in TensorBoard or Chrome Tracing (Perfetto).

Start profiling

Get the profiling pod name:

kubectl get pod

Expected output:

NAME                                     READY   STATUS    RESTARTS   AGE
ai-profiler-89bf5b305acf2ec-xxxxx        2/2     Running   0          1m

Open a shell in the profiling pod's debugger container:

kubectl exec -ti ai-profiler-89bf5b305acf2ec-xxxxx -c debugger -- bash

List GPU processes and generate a profiling command template:

If you plan to use Python-level profiling, set the environment variable shown at the end of the output before running the profiling command.

llmtracker generateCommand

Expected output:

I0314 11:42:42.389890 2948136 generate.go:51] GPU PIDs in container:

I0314 11:42:42.389997 2948136 generate.go:53] PID: xxxxx, Name: {"pid":xxxxx}
I0314 11:42:42.390008 2948136 generate.go:69] The profiling command is:

llmtracker profile\
-p <ProcessID-To-Profiling>\
-t <Profiling-Type(python,cuda,syscall,cpu or all)>\
-o /tmp/data.json\
-v 5\
--cpu-buffer-size <CPU-Buffer-Size, recommand to 20>\
--probe-file <Enable-CUDA-Lib-Profile-File>\
-d <Duration-To-Profiling>\
--delay <Delay-Time>\
--enable-cuda-kernel <Enable-CUDA-Kenrel-Profile(true or none)>

I0314 14:37:12.436071 3083714 generate.go:86] Profiling Python Path is: /usr/bin/python3.10. If you want to profiling Python, please ser the environment variable:
export EBPF_USDT_PYTHON_PATH=/usr/bin/python3.10

Modify the generated command for your scenario and run it. The following parameters control profiling scope and duration: Difference between `-t cuda` and `--enable-cuda-kernel`: For all available parameters, run llmtracker profile -h.

-t cuda uses eBPF to record CUDA library symbol calls — the call time and parameters of each API function — to analyze the actual call behavior of the process.
--enable-cuda-kernel uses process injection to collect execution details of individual CUDA kernel functions, enabling inspection of task flow states on the GPU.

Parameter	Default	Description
`-p`	—	PID to profile. Can be specified multiple times to profile multiple processes.
`-t`	—	Profiling type: `python`, `cuda`, `syscall`, `cpu`, or `all`.
`-o`	`/tmp/data.json`	Output file path.
`-v`	—	Log verbosity level.
`--cpu-buffer-size`	`20`	CPU buffer size for eBPF data collection.
`--probe-file`	—	JSON configuration file for CUDA library profiling. See CUDA library configuration file for the format or use the default template.
`-d`	—	Profiling duration in seconds. We recommend keeping this value below 60s, as prolonged profiling durations may generate excessive data, which can lead to increased memory consumption and storage load.
`--delay`	—	Delay before profiling starts, in seconds. Set this to at least 2 if CUDA kernel profiling is enabled.
`--enable-cuda-kernel`	—	Set to `true` to enable CUDA kernel profiling.

Run the profiling command. The following example enables all profiling types (including CUDA kernel profiling), uses probe.json for CUDA library configuration, writes output to /tmp/data.json, delays 3 seconds before starting, and profiles for 5 seconds:
```
export EBPF_USDT_PYTHON_PATH=/usr/bin/python3.10
llmtracker profile -p xxxxx -t all -o /tmp/data.json -v 5 --enable-cuda-kernel true --cpu-buffer-size 20 --probe-file probe.json --delay 3 -d 5
```

Export results

Convert the result file to a standard format for timeline visualization:

llmtracker export -i /tmp/data.json -o /output/out.json --cupti-dir /tmp

Include --cupti-dir /tmp if the result file contains CUDA kernel profiling data.

Step 4: Visualize profiling results

TensorBoard

If your cluster uses Object Storage Service (OSS) or Apsara File Storage NAS for storage, start a TensorBoard pod in the cluster that mounts a PVC containing the profiling result data, then open TensorBoard to view the timeline. See View TensorBoard for details.

The timeline in TensorBoard looks like this:

Chrome Tracing (Perfetto)

For local storage, copy the result file to your local machine and open it at https://ui.perfetto.dev/.

The timeline in Chrome Tracing looks like this:

CUDA library configuration file

The --probe-file parameter accepts a JSON configuration file that defines which library symbols to trace. Use the default template below or build a custom one.

Build a custom configuration

Use ldd to get the link dependencies of your target library and identify which library files to trace.

Extract symbol names from the target library. The following example uses libnccl.so:

readelf -Ws libnccl.so.2 | grep pnccl

Expected output:

...
   223: 00000000000557d0   650 FUNC    GLOBAL DEFAULT   11 pncclGroupStart
   224: 0000000000050200   243 FUNC    GLOBAL DEFAULT   11 pncclRedOpDestroy
   225: 0000000000062081   656 FUNC    GLOBAL DEFAULT   11 pncclCommAbort
   227: 000000000006320c   721 FUNC    GLOBAL DEFAULT   11 pncclCommUserRank
   228: 0000000000064ee0    20 FUNC    GLOBAL DEFAULT   11 pncclGetVersion
   231: 0000000000045f60  1778 FUNC    GLOBAL DEFAULT   11 pncclAllGather
   232: 00000000000604f8  1578 FUNC    GLOBAL DEFAULT   11 pncclCommInitAll
   233: 000000000004ff20   728 FUNC    GLOBAL DEFAULT   11 pncclRedOpCreatePreMulSum
   238: 0000000000074520   653 FUNC    GLOBAL DEFAULT   11 pncclCommDeregister
   240: 00000000000474b0    30 FUNC    GLOBAL DEFAULT   11 pncclBcast
   243: 000000000006173d   789 FUNC    GLOBAL DEFAULT   11 pncclCommFinalize
   244: 00000000000483d0  2019 FUNC    GLOBAL DEFAULT   11 pncclSend
...

Build the JSON configuration file. The file defines the information needed for Probe, including: the relative path of the target library file in the container for uprobes, the symbol of the method to monitor in the library for uprobes, and the symbol of the system method to monitor for kprobes.

Default template

<details> <summary>Expand to view the default reference template</summary>

[
    {
        "category": "cuda",
        "uprobes": [
            {
                "type": "cuda",
                "libraries": [
                    {
                        "library": "/usr/lib/x86_64-linux-gnu/libcuda.so.1",
                        "symbols": [
                            "cuStreamSynchronize",
                            "cuMemcpyHtoD_v2",
                            "cuMemcpyDtoH_v2",
                            "cuMemcpyDtoD_v2",
                            "cuMemcpyDtoA_v2",
                            "cuMemcpyAtoD_v2",
                            "cuMemcpyHtoA_v2",
                            "cuMemcpyAtoH_v2",
                            "cuMemcpyAtoA_v2",
                            "cuMemcpyHtoAAsync_v2",
                            "cuMemcpyAtoHAsync_v2",
                            "cuMemcpy2D_v2",
                            "cuMemcpy2DUnaligned_v2",
                            "cuMemcpy3D_v2",
                            "cuMemcpyHtoDAsync_v2",
                            "cuMemcpyDtoHAsync_v2",
                            "cuMemcpyDtoDAsync_v2",
                            "cuMemcpy2DAsync_v2",
                            "cuMemcpy3DAsync_v2"
                        ]
                    },
                    {
                        "library": "/usr/local/lib/python3.10/dist-packages/nvidia/cuda_runtime/lib/libcudart.so.12",
                        "symbols": [
                            "cudaLaunchKernel",
                            "cudaLaunchKernelExC"
                        ]
                    }
                ]
            },
            {
                "type": "cuBLAS",
                "libraries": [
                    {
                        "library": "/usr/local/lib/python3.10/dist-packages/nvidia/cublas/lib/libcublasLt.so.12",
                        "symbols": [
                            "cublasLtMatmul",
                            "cublasLtMatrixTransform"
                        ]
                    },
                    {
                        "library": "/usr/local/lib/python3.10/dist-packages/nvidia/cublas/lib/libcublas.so.12",
                        "symbols": [
                            "cublasGemmEx",
                            "cublasGemmBatchedEx",
                            "cublasGemmStridedBatchedEx",
                            "cublasGemmGroupedBatchedEx",
                            "cublasSgemmEx",
                            "cublasCgemmEx"
                        ]
                    }
                ]
            },
            {
                "type": "nccl",
                "libraries": [
                    {
                        "library": "/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2",
                        "symbols": [
                            "pncclAllReduce",
                            "pncclAllGather",
                            "pncclReduce",
                            "pncclBroadcast",
                            "pncclBcast",
                            "pncclReduceScatter",
                            "pncclSend",
                            "pncclRecv",
                            "pncclGroupStart",
                            "pncclGroupEnd",
                            "_Z20ncclGroupJobCompleteP12ncclGroupJob",
                            "_Z17ncclGroupJobAbortP12ncclGroupJob",
                            "pncclCommInitRank",
                            "pncclCommInitAll",
                            "pncclCommFinalize",
                            "pncclCommDestroy"
                        ]
                    }
                ]
            },
            {
                "type": "torch",
                "libraries": [
                    {
                        "library": "/usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so",
                        "symbols": [
                            "_ZN5torch4cuda4nccl4recvERN2at6TensorEPvN3c104cuda10CUDAStreamEi",
                            "_ZN5torch4cuda4nccl4sendERKN2at6TensorEPvN3c104cuda10CUDAStreamEi",
                            "_ZN4c10d16ProcessGroupNCCL8WorkNCCL17synchronizeStreamEv",
                            "_ZN5torch4cuda4nccl7all2allERSt6vectorIN2at6TensorESaIS4_EES7_PvRN3c104cuda10CUDAStreamE",
                            "_ZN5torch4cuda4nccl7scatterERKSt6vectorIN2at6TensorESaIS4_EERS4_PvRN3c104cuda10CUDAStreamEi"
                        ]
                    }
                ]
            }
        ]
    }
]

</details>