The default GPU monitoring enabled by Alibaba Cloud Container Service (ACS) conflicts with performance profilers such as NVIDIA Nsight. This happens because certain metrics require exclusive access to the GPU's profiling counters and can only be collected by a single process. On affected GPU types including T4, A10, L20 (GN8IS), and P16EN, this conflict may prevent profilers from collecting data and can generate CUPTI or DCGM errors. The solution is to temporarily disable the collection of these conflicting metrics while profiling.
Procedure
Check metric collection status
Connect to the container of the GPU pod through a terminal.
Log on to the ACS console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster you want and click its name. In the left-side pane, choose .
In the Actions column of the target GPU pod, click Terminal to connect to the container.
Check the metric collection status.
curl -X POST http://localhost:9501/profmetric/statusExpected output:
Normal collection: Metrics are being collected.
Pause AvailableshowsIn 0 seconds.Status: Collecting Pause Available: In 0 secondsPaused collection: The
Resume Countdownshows the time until collection automatically resumes.Status: Paused Resume Countdown: In 600 secondsNormal collection (in cooldown period): The
Pause Availablevalue indicates that collection has recently resumed, and a cooldown period must pass before it can be paused again.Status: Collecting Pause Available: In 60 seconds
Pause metric collection
Use default parameters
Pause metric collection for the default period of 600 seconds.
curl -X POST http://localhost:9501/profmetric/pauseExpected output:
Successfully pause metrics collection for 600 seconds
Specify a pause duration
Run the following command, using the
-dflag to specify the custom durationtimein seconds. The maximum and default value is 600 seconds.curl -X POST http://localhost:9501/profmetric/pause -d "time=200"Expected output:
Successfully pause metrics collection for 200 seconds
Resume metric collection
Resume metric collection.
curl -X POST http://localhost:9501/profmetric/resumeExpected output:
Successfully resumed metrics collection
Apply in production
Choose an appropriate time window: Temporarily disabling metric collection can lead to missed or false-positive alerts. We recommend performing profiling during off-peak hours to minimize the impact on your monitoring and alerting systems.
Implement a recovery mechanism: In your profiling scripts, ensure that you include a mechanism (e.g., a trap in a shell script) to execute the
curl -X POST http://localhost:9501/profmetric/resumecommand in the event of an unexpected script exit. This guarantees that metrics collection is always restored.
Affected metrics and models
To use profilers with the GPU models listed below, you must temporarily disable the following metrics to avoid conflicts.
Metric name | Description | Affected GPU models |
DCGM_FI_PROF_GR_ENGINE_ACTIVE | The percentage of time over a sampling period that the Graphics or Compute engine is active. | T4, A10, L20 (GN8IS) |
DCGM_FI_PROF_PIPE_FP64_ACTIVE | The utilization of the double-precision (FP64) pipe. | T4, A10, L20 (GN8IS) |
DCGM_FI_PROF_PIPE_FP32_ACTIVE | The utilization of the Fused Multiply-Add (FMA) pipe for single-precision (FP32) and integer operations. | T4, A10, L20 (GN8IS) |
DCGM_FI_PROF_PIPE_FP16_ACTIVE | The utilization of the half-precision (FP16) pipe. | T4, A10, L20 (GN8IS) |
DCGM_FI_PROF_SM_ACTIVE | The percentage of time over a sampling period that at least one warp is active on a Streaming Multiprocessor (SM). | T4, A10, L20 (GN8IS), P16EN |
DCGM_FI_PROF_SM_OCCUPANCY | The ratio of resident warps to the maximum supported warps on an SM over a sampling period. | T4, A10, L20 (GN8IS), P16EN |
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE | The utilization of the Tensor Core (HMMA/IMMA) pipe. | T4, A10, L20 (GN8IS), P16EN |
DCGM_FI_PROF_DRAM_ACTIVE | The percentage of time the device memory is busy sending or receiving data. | T4, A10, L20 (GN8IS), P16EN |
DCGM_FI_PROF_NVLINK_TX_BYTES | The total number of bytes transmitted over NvLink, including both header and payload. | T4, A10, L20 (GN8IS), P16EN |
DCGM_FI_PROF_NVLINK_RX_BYTES | The total number of bytes received over NvLink, including both header and payload. | T4, A10, L20 (GN8IS), P16EN |
DCGM_CUSTOM_PROF_TENS_TFPS_USED | The utilization of the GPU's Tensor Cores. | T4, A10, L20 (GN8IS), P16EN |
FAQ
Can I use this method to pause metric collection for GPU models other than those listed above?
Yes, this method can be used on other GPU models. However, it is generally unnecessary, as most other GPU models do not experience conflicts between DCGM metric collection and NVIDIA profilers.
Is there a cooldown period for toggling metrics collection?
Yes, there is. After resuming metric collection, you must wait for a 60-second cooldown period before you can pause it again. If you attempt to pause collection during this period, you will receive an error message similar to the following:
Operation is rejected. Reason: gpu metrics collection switch is just modified, please retry after about 54 secondsYou must wait for the cooldown period to end before you try again.