Alibaba Cloud Container Compute Service (ACS) collects GPU profiling metrics by default using NVIDIA Data Center GPU Manager (DCGM). These metrics rely on GPU hardware performance counters, which only allow access by a single process at a time. When a profiler such as NVIDIA Nsight Systems or Nsight Compute runs alongside DCGM, both compete for the same hardware counters, causing CUPTI or DCGM errors. To avoid this conflict, temporarily pause the collection of profiling metrics while profiling your workloads.
Affected GPU models: T4, A10, L20 (GN8IS), and P16EN.
Connect to the GPU pod terminal
All operations in this document require a terminal connection to the GPU pod container.
Log on to the ACS console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the target cluster and click its name. In the left-side navigation pane, choose Workloads > Pods.
In the Actions column of the target GPU pod, click Terminal to connect to the container.
Check metric collection status
In the GPU pod terminal, run the following command:
curl -X POST http://localhost:9501/profmetric/statusThe response shows one of the following states:
| State | Example output | Meaning |
|---|---|---|
| Normal collection | Status: Collecting | Metrics are actively collected. |
Pause Available: In 0 seconds | Pause is available immediately. | |
| Paused | Status: Paused | Collection is paused. |
Resume Countdown: In 600 seconds | Time remaining until collection resumes automatically. | |
| Cooldown | Status: Collecting | Collection recently resumed. |
Pause Available: In 60 seconds | Time remaining before pause becomes available again. |
Pause metric collection
Run the pause command in the GPU pod terminal. Collection pauses for a default duration of 600 seconds. To specify a shorter duration, use the -d flag. The maximum value is 600 seconds.
Default duration (600 seconds):
curl -X POST http://localhost:9501/profmetric/pauseExpected output:
Successfully pause metrics collection for 600 secondsCustom duration:
curl -X POST http://localhost:9501/profmetric/pause -d "time=200"Expected output:
Successfully pause metrics collection for 200 secondsResume metric collection
To resume metric collection before the pause timer expires, run the following command in the GPU pod terminal:
curl -X POST http://localhost:9501/profmetric/resumeExpected output:
Successfully resumed metrics collectionAPI reference
| Endpoint | Method | Description |
|---|---|---|
http://localhost:9501/profmetric/status | POST | Check current collection status |
http://localhost:9501/profmetric/pause | POST | Pause collection (default: 600 seconds) |
http://localhost:9501/profmetric/pause -d "time=<seconds>" | POST | Pause collection for a custom duration (max: 600 seconds) |
http://localhost:9501/profmetric/resume | POST | Resume collection manually |
Production considerations
Schedule profiling during off-peak hours
Temporarily disabling metric collection may lead to missed or false-positive alerts. Schedule profiling during off-peak hours to minimize the impact on your monitoring and alerting systems.
Implement a recovery mechanism
In your profiling scripts, ensure that you include a mechanism (such as a trap in a shell script) to execute the curl -X POST http://localhost:9501/profmetric/resume command in the event of an unexpected script exit. This guarantees that metrics collection is always restored.
Affected metrics
The following DCGM profiling metrics conflict with NVIDIA profilers on the listed GPU models. When you call the pause endpoint, collection of all these metrics is paused.
| Metric name | Description | Affected GPU models |
|---|---|---|
| DCGM_FI_PROF_GR_ENGINE_ACTIVE | The percentage of time over a sampling period that the Graphics or Compute engine is active. | T4, A10, L20 (GN8IS) |
| DCGM_FI_PROF_PIPE_FP64_ACTIVE | The utilization of the double-precision (FP64) pipe. | T4, A10, L20 (GN8IS) |
| DCGM_FI_PROF_PIPE_FP32_ACTIVE | The utilization of the Fused Multiply-Add (FMA) pipe for single-precision (FP32) and integer operations. | T4, A10, L20 (GN8IS) |
| DCGM_FI_PROF_PIPE_FP16_ACTIVE | The utilization of the half-precision (FP16) pipe. | T4, A10, L20 (GN8IS) |
| DCGM_FI_PROF_SM_ACTIVE | The percentage of time over a sampling period that at least one warp is active on a Streaming Multiprocessor (SM). | T4, A10, L20 (GN8IS), P16EN |
| DCGM_FI_PROF_SM_OCCUPANCY | The ratio of resident warps to the maximum supported warps on an SM over a sampling period. | T4, A10, L20 (GN8IS), P16EN |
| DCGM_FI_PROF_PIPE_TENSOR_ACTIVE | The utilization of the Tensor Core (HMMA/IMMA) pipe. | T4, A10, L20 (GN8IS), P16EN |
| DCGM_FI_PROF_DRAM_ACTIVE | The percentage of time the device memory is busy sending or receiving data. | T4, A10, L20 (GN8IS), P16EN |
| DCGM_FI_PROF_NVLINK_TX_BYTES | The total number of bytes transmitted over NVLink, including both header and payload. | T4, A10, L20 (GN8IS), P16EN |
| DCGM_FI_PROF_NVLINK_RX_BYTES | The total number of bytes received over NVLink, including both header and payload. | T4, A10, L20 (GN8IS), P16EN |
| DCGM_CUSTOM_PROF_TENS_TFPS_USED | The utilization of the GPU's Tensor Cores. | T4, A10, L20 (GN8IS), P16EN |
FAQ
Can I pause metric collection for GPU models not listed above?
Yes. However, doing so is generally unnecessary, because most other GPU models do not experience conflicts between DCGM metric collection and NVIDIA profilers.
Is there a cooldown period after resuming?
Yes. After metric collection resumes, a 60-second cooldown period must pass before you can pause again. If you attempt to pause during the cooldown, the API returns an error:
Operation is rejected. Reason: gpu metrics collection switch is just modified, please retry after about 54 seconds