Disable GPU pod metric collection to prevent conflicts with profilers - Container Compute Service

The default GPU monitoring enabled by Alibaba Cloud Container Service (ACS) conflicts with performance profilers such as NVIDIA Nsight. This happens because certain metrics require exclusive access to the GPU's profiling counters and can only be collected by a single process. On affected GPU types including T4, A10, L20 (GN8IS), and P16EN, this conflict may prevent profilers from collecting data and can generate CUPTI or DCGM errors. The solution is to temporarily disable the collection of these conflicting metrics while profiling.

Procedure

Check metric collection status

Connect to the container of the GPU pod through a terminal.
1. Log on to the ACS console. In the left-side navigation pane, click Clusters.
2. On the Clusters page, find the cluster you want and click its name. In the left-side pane, choose Workloads > Pods.
3. In the Actions column of the target GPU pod, click Terminal to connect to the container.
Check the metric collection status.
```
curl -X POST http://localhost:9501/profmetric/status
```
Expected output:
- Normal collection: Metrics are being collected. Pause Available shows In 0 seconds.
```
Status: Collecting
Pause Available: In 0 seconds
```
- Paused collection: The Resume Countdown shows the time until collection automatically resumes.
```
Status: Paused
Resume Countdown: In 600 seconds
```
- Normal collection (in cooldown period): The Pause Available value indicates that collection has recently resumed, and a cooldown period must pass before it can be paused again.
```
Status: Collecting
Pause Available: In 60 seconds
```

Pause metric collection

Use default parameters

Connect to the container of the GPU pod.

Pause metric collection for the default period of 600 seconds.

curl -X POST http://localhost:9501/profmetric/pause

Expected output:

Successfully pause metrics collection for 600 seconds

Specify a pause duration

Connect to the container of the GPU pod.
Run the following command, using the -d flag to specify the custom duration time in seconds. The maximum and default value is 600 seconds.
```
curl -X POST http://localhost:9501/profmetric/pause -d "time=200"
```
Expected output:
```
Successfully pause metrics collection for 200 seconds
```

Resume metric collection

Connect to the container of the GPU pod.

Resume metric collection.

curl -X POST http://localhost:9501/profmetric/resume

Expected output:

Successfully resumed metrics collection

Apply in production

Choose an appropriate time window: Temporarily disabling metric collection can lead to missed or false-positive alerts. We recommend performing profiling during off-peak hours to minimize the impact on your monitoring and alerting systems.
Implement a recovery mechanism: In your profiling scripts, ensure that you include a mechanism (e.g., a trap in a shell script) to execute the curl -X POST http://localhost:9501/profmetric/resume command in the event of an unexpected script exit. This guarantees that metrics collection is always restored.

Affected metrics and models

To use profilers with the GPU models listed below, you must temporarily disable the following metrics to avoid conflicts.

Metric name	Description	Affected GPU models
DCGM_FI_PROF_GR_ENGINE_ACTIVE	The percentage of time over a sampling period that the Graphics or Compute engine is active.	T4, A10, L20 (GN8IS)
DCGM_FI_PROF_PIPE_FP64_ACTIVE	The utilization of the double-precision (FP64) pipe.	T4, A10, L20 (GN8IS)
DCGM_FI_PROF_PIPE_FP32_ACTIVE	The utilization of the Fused Multiply-Add (FMA) pipe for single-precision (FP32) and integer operations.	T4, A10, L20 (GN8IS)
DCGM_FI_PROF_PIPE_FP16_ACTIVE	The utilization of the half-precision (FP16) pipe.	T4, A10, L20 (GN8IS)
DCGM_FI_PROF_SM_ACTIVE	The percentage of time over a sampling period that at least one warp is active on a Streaming Multiprocessor (SM).	T4, A10, L20 (GN8IS), P16EN
DCGM_FI_PROF_SM_OCCUPANCY	The ratio of resident warps to the maximum supported warps on an SM over a sampling period.	T4, A10, L20 (GN8IS), P16EN
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE	The utilization of the Tensor Core (HMMA/IMMA) pipe.	T4, A10, L20 (GN8IS), P16EN
DCGM_FI_PROF_DRAM_ACTIVE	The percentage of time the device memory is busy sending or receiving data.	T4, A10, L20 (GN8IS), P16EN
DCGM_FI_PROF_NVLINK_TX_BYTES	The total number of bytes transmitted over NvLink, including both header and payload.	T4, A10, L20 (GN8IS), P16EN
DCGM_FI_PROF_NVLINK_RX_BYTES	The total number of bytes received over NvLink, including both header and payload.	T4, A10, L20 (GN8IS), P16EN
DCGM_CUSTOM_PROF_TENS_TFPS_USED	The utilization of the GPU's Tensor Cores.	T4, A10, L20 (GN8IS), P16EN

FAQ

Can I use this method to pause metric collection for GPU models other than those listed above?

Yes, this method can be used on other GPU models. However, it is generally unnecessary, as most other GPU models do not experience conflicts between DCGM metric collection and NVIDIA profilers.

Is there a cooldown period for toggling metrics collection?

Yes, there is. After resuming metric collection, you must wait for a 60-second cooldown period before you can pause it again. If you attempt to pause collection during this period, you will receive an error message similar to the following:

Operation is rejected. Reason: gpu metrics collection switch is just modified, please retry after about 54 seconds

You must wait for the cooldown period to end before you try again.