All Products
Search
Document Center

Container Compute Service:Disable GPU pod metric collection to prevent conflicts with profilers

Last Updated:Feb 27, 2026

Alibaba Cloud Container Compute Service (ACS) collects GPU profiling metrics by default using NVIDIA Data Center GPU Manager (DCGM). These metrics rely on GPU hardware performance counters, which only allow access by a single process at a time. When a profiler such as NVIDIA Nsight Systems or Nsight Compute runs alongside DCGM, both compete for the same hardware counters, causing CUPTI or DCGM errors. To avoid this conflict, temporarily pause the collection of profiling metrics while profiling your workloads.

Affected GPU models: T4, A10, L20 (GN8IS), and P16EN.

Connect to the GPU pod terminal

All operations in this document require a terminal connection to the GPU pod container.

  1. Log on to the ACS console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the target cluster and click its name. In the left-side navigation pane, choose Workloads > Pods.

  3. In the Actions column of the target GPU pod, click Terminal to connect to the container.

Check metric collection status

In the GPU pod terminal, run the following command:

curl -X POST http://localhost:9501/profmetric/status

The response shows one of the following states:

StateExample outputMeaning
Normal collectionStatus: CollectingMetrics are actively collected.
Pause Available: In 0 secondsPause is available immediately.
PausedStatus: PausedCollection is paused.
Resume Countdown: In 600 secondsTime remaining until collection resumes automatically.
CooldownStatus: CollectingCollection recently resumed.
Pause Available: In 60 secondsTime remaining before pause becomes available again.

Pause metric collection

Run the pause command in the GPU pod terminal. Collection pauses for a default duration of 600 seconds. To specify a shorter duration, use the -d flag. The maximum value is 600 seconds.

Default duration (600 seconds):

curl -X POST http://localhost:9501/profmetric/pause

Expected output:

Successfully pause metrics collection for 600 seconds

Custom duration:

curl -X POST http://localhost:9501/profmetric/pause -d "time=200"

Expected output:

Successfully pause metrics collection for 200 seconds

Resume metric collection

To resume metric collection before the pause timer expires, run the following command in the GPU pod terminal:

curl -X POST http://localhost:9501/profmetric/resume

Expected output:

Successfully resumed metrics collection

API reference

EndpointMethodDescription
http://localhost:9501/profmetric/statusPOSTCheck current collection status
http://localhost:9501/profmetric/pausePOSTPause collection (default: 600 seconds)
http://localhost:9501/profmetric/pause -d "time=<seconds>"POSTPause collection for a custom duration (max: 600 seconds)
http://localhost:9501/profmetric/resumePOSTResume collection manually

Production considerations

Schedule profiling during off-peak hours

Temporarily disabling metric collection may lead to missed or false-positive alerts. Schedule profiling during off-peak hours to minimize the impact on your monitoring and alerting systems.

Implement a recovery mechanism

In your profiling scripts, ensure that you include a mechanism (such as a trap in a shell script) to execute the curl -X POST http://localhost:9501/profmetric/resume command in the event of an unexpected script exit. This guarantees that metrics collection is always restored.

Affected metrics

The following DCGM profiling metrics conflict with NVIDIA profilers on the listed GPU models. When you call the pause endpoint, collection of all these metrics is paused.

Metric nameDescriptionAffected GPU models
DCGM_FI_PROF_GR_ENGINE_ACTIVEThe percentage of time over a sampling period that the Graphics or Compute engine is active.T4, A10, L20 (GN8IS)
DCGM_FI_PROF_PIPE_FP64_ACTIVEThe utilization of the double-precision (FP64) pipe.T4, A10, L20 (GN8IS)
DCGM_FI_PROF_PIPE_FP32_ACTIVEThe utilization of the Fused Multiply-Add (FMA) pipe for single-precision (FP32) and integer operations.T4, A10, L20 (GN8IS)
DCGM_FI_PROF_PIPE_FP16_ACTIVEThe utilization of the half-precision (FP16) pipe.T4, A10, L20 (GN8IS)
DCGM_FI_PROF_SM_ACTIVEThe percentage of time over a sampling period that at least one warp is active on a Streaming Multiprocessor (SM).T4, A10, L20 (GN8IS), P16EN
DCGM_FI_PROF_SM_OCCUPANCYThe ratio of resident warps to the maximum supported warps on an SM over a sampling period.T4, A10, L20 (GN8IS), P16EN
DCGM_FI_PROF_PIPE_TENSOR_ACTIVEThe utilization of the Tensor Core (HMMA/IMMA) pipe.T4, A10, L20 (GN8IS), P16EN
DCGM_FI_PROF_DRAM_ACTIVEThe percentage of time the device memory is busy sending or receiving data.T4, A10, L20 (GN8IS), P16EN
DCGM_FI_PROF_NVLINK_TX_BYTESThe total number of bytes transmitted over NVLink, including both header and payload.T4, A10, L20 (GN8IS), P16EN
DCGM_FI_PROF_NVLINK_RX_BYTESThe total number of bytes received over NVLink, including both header and payload.T4, A10, L20 (GN8IS), P16EN
DCGM_CUSTOM_PROF_TENS_TFPS_USEDThe utilization of the GPU's Tensor Cores.T4, A10, L20 (GN8IS), P16EN

FAQ

Can I pause metric collection for GPU models not listed above?

Yes. However, doing so is generally unnecessary, because most other GPU models do not experience conflicts between DCGM metric collection and NVIDIA profilers.

Is there a cooldown period after resuming?

Yes. After metric collection resumes, a 60-second cooldown period must pass before you can pause again. If you attempt to pause during the cooldown, the API returns an error:

Operation is rejected. Reason: gpu metrics collection switch is just modified, please retry after about 54 seconds