All Products
Search
Document Center

Container Compute Service:Disable specific metric collection for GPU pods

Last Updated:Sep 22, 2025

The default GPU monitoring enabled by Alibaba Cloud Container Service (ACS) conflicts with performance profilers such as NVIDIA Nsight. This happens because certain metrics require exclusive access to the GPU's profiling counters and can only be collected by a single process. On affected GPU types including T4, A10, L20 (GN8IS), and P16EN, this conflict may prevent profilers from collecting data and can generate CUPTI or DCGM errors. The solution is to temporarily disable the collection of these conflicting metrics while profiling.

Procedure

Check metric collection status

  1. Connect to the container of the GPU pod through a terminal.

    1. Log on to the ACS console. In the left-side navigation pane, click Clusters.

    2. On the Clusters page, find the cluster you want and click its name. In the left-side pane, choose Workloads > Pods.

    3. In the Actions column of the target GPU pod, click Terminal to connect to the container.

  2. Check the metric collection status.

    curl -X POST http://localhost:9501/profmetric/status

    Expected output:

    • Normal collection: Metrics are being collected. Pause Available shows In 0 seconds.

      Status: Collecting
      Pause Available: In 0 seconds
    • Paused collection: The Resume Countdown shows the time until collection automatically resumes.

      Status: Paused
      Resume Countdown: In 600 seconds
    • Normal collection (in cooldown period): The Pause Available value indicates that collection has recently resumed, and a cooldown period must pass before it can be paused again.

      Status: Collecting
      Pause Available: In 60 seconds

Pause metric collection

Use default parameters

  1. Connect to the container of the GPU pod.

  2. Pause metric collection for the default period of 600 seconds.

    curl -X POST http://localhost:9501/profmetric/pause

    Expected output:

    Successfully pause metrics collection for 600 seconds

Specify a pause duration

  1. Connect to the container of the GPU pod.

  2. Run the following command, using the -d flag to specify the custom duration time in seconds. The maximum and default value is 600 seconds.

    curl -X POST http://localhost:9501/profmetric/pause -d "time=200"

    Expected output:

    Successfully pause metrics collection for 200 seconds

Resume metric collection

  1. Connect to the container of the GPU pod.

  2. Resume metric collection.

    curl -X POST http://localhost:9501/profmetric/resume

    Expected output:

    Successfully resumed metrics collection

Apply in production

  1. Choose an appropriate time window: Temporarily disabling metric collection can lead to missed or false-positive alerts. We recommend performing profiling during off-peak hours to minimize the impact on your monitoring and alerting systems.

  2. Implement a recovery mechanism: In your profiling scripts, ensure that you include a mechanism (e.g., a trap in a shell script) to execute the curl -X POST http://localhost:9501/profmetric/resume command in the event of an unexpected script exit. This guarantees that metrics collection is always restored.

Affected metrics and models

To use profilers with the GPU models listed below, you must temporarily disable the following metrics to avoid conflicts.

Metric name

Description

Affected GPU models

DCGM_FI_PROF_GR_ENGINE_ACTIVE

The percentage of time over a sampling period that the Graphics or Compute engine is active.

T4, A10, L20 (GN8IS)

DCGM_FI_PROF_PIPE_FP64_ACTIVE

The utilization of the double-precision (FP64) pipe.

T4, A10, L20 (GN8IS)

DCGM_FI_PROF_PIPE_FP32_ACTIVE

The utilization of the Fused Multiply-Add (FMA) pipe for single-precision (FP32) and integer operations.

T4, A10, L20 (GN8IS)

DCGM_FI_PROF_PIPE_FP16_ACTIVE

The utilization of the half-precision (FP16) pipe.

T4, A10, L20 (GN8IS)

DCGM_FI_PROF_SM_ACTIVE

The percentage of time over a sampling period that at least one warp is active on a Streaming Multiprocessor (SM).

T4, A10, L20 (GN8IS), P16EN

DCGM_FI_PROF_SM_OCCUPANCY

The ratio of resident warps to the maximum supported warps on an SM over a sampling period.

T4, A10, L20 (GN8IS), P16EN

DCGM_FI_PROF_PIPE_TENSOR_ACTIVE

The utilization of the Tensor Core (HMMA/IMMA) pipe.

T4, A10, L20 (GN8IS), P16EN

DCGM_FI_PROF_DRAM_ACTIVE

The percentage of time the device memory is busy sending or receiving data.

T4, A10, L20 (GN8IS), P16EN

DCGM_FI_PROF_NVLINK_TX_BYTES

The total number of bytes transmitted over NvLink, including both header and payload.

T4, A10, L20 (GN8IS), P16EN

DCGM_FI_PROF_NVLINK_RX_BYTES

The total number of bytes received over NvLink, including both header and payload.

T4, A10, L20 (GN8IS), P16EN

DCGM_CUSTOM_PROF_TENS_TFPS_USED

The utilization of the GPU's Tensor Cores.

T4, A10, L20 (GN8IS), P16EN

FAQ

Can I use this method to pause metric collection for GPU models other than those listed above?

Yes, this method can be used on other GPU models. However, it is generally unnecessary, as most other GPU models do not experience conflicts between DCGM metric collection and NVIDIA profilers.

Is there a cooldown period for toggling metrics collection?

Yes, there is. After resuming metric collection, you must wait for a 60-second cooldown period before you can pause it again. If you attempt to pause collection during this period, you will receive an error message similar to the following:

Operation is rejected. Reason: gpu metrics collection switch is just modified, please retry after about 54 seconds

You must wait for the cooldown period to end before you try again.