All Products
Search
Document Center

Container Service for Kubernetes:Example of AI profiling

Last Updated:Jun 26, 2025

This topic uses virtual large language mode (vLLM) inference scenarios as examples to show how to analyze the AI profiling results of GPU-accelerated pods running in Container Service for Kubernetes (ACK) clusters. This topic focuses on how to analyze the execution process of Python processes, CPU calls, system calls, Compute Unified Device Architecture (CUDA) libraries, and CUDA kernel functions by using the visualization page of online performance detection results. This helps you locate performance bottlenecks, identify optimization opportunities, and improve GPU utilization and application efficiency.

Example of vLLM inference

Sample environment

  • Framework: vLLM 0.5.0

  • Model: Qwen2-7B

  • GPU: NVIDIA A10

  • Profiling duration: 5s

  • Enabled profiling items: all

Result analysis

Model loading

  1. View the overall process of model loading: The enlarged view reveals that the entire model loading process is divided into three steps: data reading, data copying, and decoding.imageimageIn Python profiling, you can separate different stages with dividing lines, as shown in the following figure. Since this example shows the first startup, PageCache is not used. Therefore, the time required by data transmission is particularly noticeable.image

  2. Check the profiling items for each stage. The enlarged view reveals that the main differences are concentrated in the openat, mmap, read, and ioctl system calls, all of which are I/O-related operations. Since safetensors.torch.load_file is used for random reading when reading the model, mmap and read calls appear frequently.image

  3. The data transmission process shows that system calls are concentrated in poll, epoll, and futex. Multiple CPUs are frequently used and switched. In terms of CUDA calls, there are as many cuMemcpyHtoD transmissions as there are model safetensors. Therefore, based on the model loading process of vLLM, you can identify the path of model data transmission: disk or network -> memory (PageCache) -> GPU memory. Add network and disk I/O data to the figure to help you clearly observe the path.image

    image

  4. Observe the decoding process and focus on GPU-related library calls. The matrix multiplication executed in each batch and its upstream and downstream calls (such as cuLaunchKernel, cuMemcpyDtoD, cublasGemmEx) regularly appear and can be aligned with the corresponding Python Decoding method. Therefore, the stages they represent can be clearly distinguished.image

Model inference

Note

This example describes the process of discovering and speculating problems by using a manually simulated NCCL hang.

  1. Since the NCCL hang usually manifests as being trapped in kernel mode or NCCL-level communication/IO blocking, you can reproduce the scenario by simulating interruptions in NCCL communication between different processes. Run the following commands to suspend and restore processes as a simulation of the interruption and recovery process of NCCL:

    kill -STOP <PID>
    kill -CONT <PID>

    Normal inference duration without interruption:

    image.png

    Inference duration with a 5-second interruption:

    image.png

  2. During the process of active interruption, you can observe from the above steps that each inference in the example takes about 15 seconds. With a simulated interruption of 5 seconds, the total duration for the inference request after recovery becomes 20 seconds, which aligns with the expected outcome.

    After observation, each inference process of vLLM consists of multiple inference calculation parts, as shown in the red section in the following figure. The general process is cuda memory copy H2D -> nccl broadcast -> cublas compute -> nccl send and recv -> cuda memory copy D2H. Therefore, the possible interruption points for NCCL are during the broadcast and send and recv processes.image

  3. The interruption point determines the blocked NCCL action. Enable 60 seconds of GPU-only profiling and convert the profiling data to Chrome Tracing format data. The following list describes two cases of NCCL being blocked:

    1. nccl send and recv

      The visualization results are as follows:

      image.png

      During the end time of an inference, there is usually one pnccl Send in one replica, and a corresponding pnccl Recv in another replica. When the simulation for the NCCL hang is enabled, you can see the following situation in the profiling view:

      image.png

      image.png

      image.png

      After one replica issues an ncclSend call, there is no corresponding ncclRecv action call from the other replica, and the GPU-related method call enters an idle state for a period of time. Another replica executes the corresponding ncclRecv action only after the waiting process status is restored. Without understanding the underlying logic, it can be inferred that this phenomenon is caused by an NCCL hang issue.

    2. nccl broadcast

      The following figure shows the execution of the hang operation before nccl broadcast:

      image.png

      Zoom in on the method executed before the hang.

      image.png

      image.png

      You can see the execution of cuMemcpyH2D. After the execution, the process hangs. Then, move to the position after the hang recovery.

      image.png

      image.png

      Zoom in on the corresponding position. You can see that the first operation executed after the hang recovery is the NCCL broadcast, followed by subsequent standard actions. Based on this phenomenon, the problem is caused by the NCCL hang.