Using AI Profiling via command line - Container Service for Kubernetes

The rise of Large Language Models (LLMs) has driven demand for fine-grained performance analysis and optimization in AI training and inference. Organizations running on GPU-accelerated nodes require online performance analysis of GPU containers. In Kubernetes environments, AI Profiling serves as a non-intrusive performance analysis tool leveraging extended Berkeley Packet Filter (eBPF) and dynamic process injection. It supports online detection of containerized GPU tasks through five key areas: Python process execution, CPU function calls, system calls, CUDA library interactions, and CUDA kernel operations. By analyzing collected data, you can precisely identify performance bottlenecks in container applications and understand resource utilization levels to optimize applications. For online applications, this dynamically mountable and unloadable profiling tool enables detailed real-time analysis without code modifications. This topic describes how to use AI Profiling via command line.

Preparations

The Python profiling capability in the current AI Profiling version depends on the User Statically-Defined Tracing (USDT) feature of Python interpreters. To use Python profiling, run the following command in your workload container to verify USDT availability:
```
python -c "import sysconfig; print(sysconfig.get_config_var('WITH_DTRACE'))"
```
The output must be 1 for Python profiling. Otherwise, Python profiling is unavailable.
Note
Profiling tasks only support running on ACK clusters with Elastic Compute Service (ECS) or Lingjun nodes.
To use this feature, Submit a ticket to contact the Container Service team for the latest kubectl-plugin download link and the latest profiling image address.

Procedure

Step 1: Deploy kubectl plugin

AI Profiling is deployed using kubectl-plugin. The procedure is as follows:

Run the following command to install the plugin. This example uses Linux_amd64.

wget https://xxxxxxxxxxxxxxxxx.aliyuncs.com/kubectl_prof_linux_amd64
mv kubectl_prof_linux_amd64 /usr/local/bin/kubectl-prof
chmod +x /usr/local/bin/kubectl-prof

Run the following command to check whether the plugin is installed successfully:

kubectl prof deploy -h

Expected output:

deploy the profiling tool pod

Usage:
  kubectl-profile deploy [flags]

Aliases:
  deploy, run

Flags:
      --container string        Specify the target container name
  -d, --duration uint           Specify the profiling duration in seconds (default 60)
  -h, --help                    help for deploy
      --image string            Specify the profiling tool image
      --kubeconfig string       Specify the kubeconfig file
      --memory-limit string     Specify the memory limit (default "1Gi")
      --memory-request string   Specify the memory request (default "128Mi")
      --namespace string        Specify the target pod namespace (default "default")
      --node string             Specify the node name
      --pod string              Specify the target pod name
      --region-id string        Specify the region-id
      --ttl uint                Specify the ttl (default 60)

Step 2: Select target application container and create profiling task

Select an application pod and obtain its Namespace, Name, and Node parameters. This example uses a PyTorch training job.

NAME                              READY   STATUS    RESTARTS   AGE   IP               NODE
pytorch-train-worker-sample   	  1/1     Running   0          82s   172.23.224.197   cn-beijing.10.0.17.XXX

Run the following command to submit a profiling job with the obtained parameters. Specify the pod and container to be profiled. The profiling job will create a profiling pod on the node where the target container of the application pod is located.

kubectl prof deploy \
    --image xxxxxxxxxx \                # Replace with the profiling image address provided by Alibaba Cloud
    --duration 100000 \ 		# The duration of the profiling pod environment
    --namespace default \ 		# The application pod namespace 
    --region-id cn-beijing \		# The Alibaba Cloud Region ID of the environment
    --pod pytorch-train-worker-sample \ # The application pod name
    --container pytorch \		# The application pod's container name
    --memory-limit 10G \		# Memory limit of the profiling pod
    --memory-request 1G			# Memory request of the profiling pod

Step 3: Trigger profiling

Run the following command to view the profiling pod information:

kubectl get pod

Expected output:

NAME                                                   READY   STATUS    RESTARTS   AGE
ai-profiler-89bf5b305acf2ec-xxxxx                      2/2     Running   0          1m

Run the following command to enter the profiling pod:

kubectl exec -ti ai-profiler-89bf5b305acf2ec-xxxxx -c debugger -- bash

Run the following command to list all GPU processes and generate a profiling command template:

llmtracker generateCommand

Expected output:

I0314 11:42:42.389890 2948136 generate.go:51] GPU PIDs in container:

I0314 11:42:42.389997 2948136 generate.go:53] PID: xxxxx, Name: {"pid":xxxxx}
I0314 11:42:42.390008 2948136 generate.go:69] The profiling command is:

llmtracker profile\
-p <ProcessID-To-Profiling>\
-t <Profiling-Type(python,cuda,syscall,cpu or all)>\
-o /tmp/data.json\
-v 5\
--cpu-buffer-size <CPU-Buffer-Size, recommand to 20>\
--probe-file <Enable-CUDA-Lib-Profile-File>\
-d <Duration-To-Profiling>\
--delay <Delay-Time>\
--enable-cuda-kernel <Enable-CUDA-Kenrel-Profile(true or none)>

I0314 14:37:12.436071 3083714 generate.go:86] Profiling Python Path is: /usr/bin/python3.10. If you want to profiling Python, please ser the environment variable:
export EBPF_USDT_PYTHON_PATH=/usr/bin/python3.10

Note

If you need to enable Python-level profiling, you must first set the environment variable shown in the output in the profiling environment.

Parameters and descriptions are as follows:

Parameter	Description
`-p`	Specifies the PID to be profiled. This parameter can be used multiple times to support multiple PIDs.
`-t`	Specifies the profiling type. Options are python, cuda, syscall, cpu. Use `all` to enable all profiling types.
`-o`	Specifies the path and name of the profiling output file. Default: `/tmp/data.json`.
`-v`	Specifies the log output level.
`--cpu-buffer-size`	Specifies the CPU buffer size for eBPF data collection. Default: 20.
`--probe-file`	Specifies the template file required for CUDA Lib Profiling. Refer to the writing specifications, or directly use the default template.
`-d`	Sets the duration of the profiling task in seconds. We recommend maintaining this value below 60s, as prolonged profiling durations may generate excessive data, which can lead to increased memory consumption and storage load.
`--delay`	Sets the delay time for profiling to start in seconds. If you enable CUDA Kernel profiling, we recommend maintaining this value above 2.
`--enable-cuda-kernel`	Specifies whether to enable CUDA Kernel profiling. Set this parameter to true to enable.

Differences between -t cuda and --enable-cuda-kernel:

-t cuda uses eBPF to collect CUDA library Symbol calls, including the call time and parameters of each API function, to analyze the actual call situations in the process.
--enable-cuda-kernel uses process injection technology to collect specific execution information of CUDA Kernel functions, enabling detailed examination of task flow states on the GPU side.

For more complete parameter information, run the llmtracker profile -h command.

Use the following example to execute profiling, modifying the generated profiling command as needed:
Note
This example enables all profiling items (including CUDA Kernel Profiling), configures CUDA Lib as probe.json, sets the output file path to /tmp/data.json, and adds --delay 3 -d 5 to indicate a 3-second delay before starting and a 5-second profiling duration.
```
export EBPF_USDT_PYTHON_PATH=/usr/bin/python3.10
llmtracker profile -p xxxxx -t all -o /tmp/data.json -v 5 --enable-cuda-kernel true --cpu-buffer-size 20 --probe-file probe.json --delay 3 -d 5
```
Run the following command to format the result file and export it:
Note
- This step converts the result file into a standard format for display in TimeLine.
- If the result file contains CUDA Kernel profiling data, you must add the parameter --cupti-dir and set it to the fixed path /tmp.
```
llmtracker export -i /tmp/data.json -o /output/out.json --cupti-dir /tmp
```

Step 4: Display profiling results

Using TensorBoard for display and analysis

If you use storage such as OSS or NAS, you can refer to View TensorBoard for methods to view result data. Start a TensorBoard Pod in the cluster that mounts a PVC containing the profiling result data, and open TensorBoard to view the related data.

Using Chrome Tracing for display and analysis

If you use local storage, you need to copy the generated profiling result file to your local machine, and then view the file using Chrome Tracing (Perfetto).

Display effects

TensorBoard display

The TimeLine displayed using TensorBoard is as follows:

Chrome Tracing display

The TimeLine displayed locally using Chrome Tracing is as follows:

AI Profiling appendix

CUDA Lib configuration file

Obtain recursive dependencies of target libraries and further filter library files. After filtering or confirming the library files you want to track, you can use the ldd command to obtain the link dependencies of the target library files, thereby determining the range of library files from which effective data can be collected.

After determining the target library files, you need to confirm the template symbols in the library. This step uses libnccl.so as an example. Run the following command to obtain all symbol information in the library:

 readelf -Ws libnccl.so.2 | grep pnccl

Expected output:

...
   223: 00000000000557d0   650 FUNC    GLOBAL DEFAULT   11 pncclGroupStart
   224: 0000000000050200   243 FUNC    GLOBAL DEFAULT   11 pncclRedOpDestroy
   225: 0000000000062081   656 FUNC    GLOBAL DEFAULT   11 pncclCommAbort
   227: 000000000006320c   721 FUNC    GLOBAL DEFAULT   11 pncclCommUserRank
   228: 0000000000064ee0    20 FUNC    GLOBAL DEFAULT   11 pncclGetVersion
   231: 0000000000045f60  1778 FUNC    GLOBAL DEFAULT   11 pncclAllGather
   232: 00000000000604f8  1578 FUNC    GLOBAL DEFAULT   11 pncclCommInitAll
   233: 000000000004ff20   728 FUNC    GLOBAL DEFAULT   11 pncclRedOpCreatePreMulSum
   238: 0000000000074520   653 FUNC    GLOBAL DEFAULT   11 pncclCommDeregister
   240: 00000000000474b0    30 FUNC    GLOBAL DEFAULT   11 pncclBcast
   243: 000000000006173d   789 FUNC    GLOBAL DEFAULT   11 pncclCommFinalize
   244: 00000000000483d0  2019 FUNC    GLOBAL DEFAULT   11 pncclSend
...

Assemble the JSON configuration file needed for Profiling by constructing a JSON file similar to the following format. This configuration file should define the information needed for Probe, including the relative path of the target library file in the container for UProbe, the Symbol of the method to be monitored in the library file, and the Symbol of the system method to be monitored in KProbe. The default reference template is as follows.

Expand to view the default reference template

[
    {
        "category": "cuda",
        "uprobes": [
            {
                "type": "cuda",
                "libraries": [
                    {
                        "library": "/usr/lib/x86_64-linux-gnu/libcuda.so.1",
                        "symbols": [
							"cuStreamSynchronize",
                            "cuMemcpyHtoD_v2",
                            "cuMemcpyDtoH_v2",
                            "cuMemcpyDtoD_v2",
                            "cuMemcpyDtoA_v2",
                            "cuMemcpyAtoD_v2",
                            "cuMemcpyHtoA_v2",
                            "cuMemcpyAtoH_v2",
                            "cuMemcpyAtoA_v2",
                            "cuMemcpyHtoAAsync_v2",
                            "cuMemcpyAtoHAsync_v2",
                            "cuMemcpy2D_v2",
                            "cuMemcpy2DUnaligned_v2",
                            "cuMemcpy3D_v2",
                            "cuMemcpyHtoDAsync_v2",
                            "cuMemcpyDtoHAsync_v2",
                            "cuMemcpyDtoDAsync_v2",
                            "cuMemcpy2DAsync_v2",
                            "cuMemcpy3DAsync_v2"
                        ]
                    },
                    {
                        "library": "/usr/local/lib/python3.10/dist-packages/nvidia/cuda_runtime/lib/libcudart.so.12",
                        "symbols": [
                            "cudaLaunchKernel",
                            "cudaLaunchKernelExC"
                        ]
                    }
                ]
            },
            {
                "type": "cuBLAS",
                "libraries": [
                    {
                        "library": "/usr/local/lib/python3.10/dist-packages/nvidia/cublas/lib/libcublasLt.so.12",
                        "symbols": [
                            "cublasLtMatmul",
                            "cublasLtMatrixTransform"
                        ]
                    },
                    {
                        "library": "/usr/local/lib/python3.10/dist-packages/nvidia/cublas/lib/libcublas.so.12",
                        "symbols": [
                            "cublasGemmEx",
                            "cublasGemmBatchedEx",
                            "cublasGemmStridedBatchedEx",
                            "cublasGemmGroupedBatchedEx",
                            "cublasSgemmEx",
                            "cublasCgemmEx"
                        ]
                    }
                ]
            },
            {
                "type": "nccl",
                "libraries": [
                    {
                        "library": "/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2",
                        "symbols": [
                            "pncclAllReduce",
                            "pncclAllGather",
                            "pncclReduce",
                            "pncclBroadcast",
                            "pncclBcast",
                            "pncclReduceScatter",
                            "pncclSend",
                            "pncclRecv",
                            "pncclGroupStart",
                            "pncclGroupEnd",
                            "_Z20ncclGroupJobCompleteP12ncclGroupJob",
                            "_Z17ncclGroupJobAbortP12ncclGroupJob",
                            "pncclCommInitRank",
                            "pncclCommInitAll",
                            "pncclCommFinalize",
                            "pncclCommDestroy"
                        ]
                    }
                ]
            },
            {
                "type": "torch",
                "libraries": [
                    {
                        "library": "/usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so",
                        "symbols": [
                            "_ZN5torch4cuda4nccl4recvERN2at6TensorEPvN3c104cuda10CUDAStreamEi",
                            "_ZN5torch4cuda4nccl4sendERKN2at6TensorEPvN3c104cuda10CUDAStreamEi",
                            "_ZN4c10d16ProcessGroupNCCL8WorkNCCL17synchronizeStreamEv",
                            "_ZN5torch4cuda4nccl7all2allERSt6vectorIN2at6TensorESaIS4_EES7_PvRN3c104cuda10CUDAStreamE",
                            "_ZN5torch4cuda4nccl7scatterERKSt6vectorIN2at6TensorESaIS4_EERS4_PvRN3c104cuda10CUDAStreamEi"
                        ]
                    }
                ]
            }
        ]
    }
]