ACK Edge clusters let you manage GPU-accelerated nodes across data centers and edge environments. By connecting an ACK Edge cluster to Managed Service for Prometheus, you can monitor GPU nodes at the edge using the same dashboards and metrics pipeline as cloud nodes.
How edge GPU monitoring works
ACK Edge clusters support infrastructure as a service (IaaS) resources—including data center nodes, third-party cloud nodes, and IoT devices—connected over Express Connect circuits or the Internet.
For Internet-connected edge nodes, Managed Service for Prometheus uses Raven to bridge the monitoring path between the cloud and the edge. The following diagram shows how this works:
Managed Service for Prometheus collects metrics by node name instead of node IP address. CoreDNS uses its Hosts plugin to resolve edge node names to the Raven Service.
When Managed Service for Prometheus accesses the Raven Service, it selects a gateway node from the Service backend to communicate with the edge network domain.
The Raven-agent on the gateway node establishes an encrypted channel with the Raven-agent on the on-premises data center gateway node. Both Layer 3 and Layer 7 network communication are supported.
On the on-premises data center gateway node, Raven-agent retrieves monitoring data by accessing the GPU collection port of the target node.
Metric types
GPU monitoring in ACK Edge provides two categories of metrics:
| Metric type | Description | Billing |
|---|---|---|
| DCGM (Data Center GPU Manager) Exporter metrics | Standard GPU metrics compatible with the DCGM Exporter. See the supported metrics list. | — |
| Custom metrics | Extended metrics for specific scenarios beyond the DCGM Exporter standard set. See the custom metrics list. | Billed. Review Billing overview before enabling. |
To monitor and manage resource usage costs, see View resource usage.
Prerequisites
Before you begin, make sure you have:
An ACK Edge cluster with network connectivity between edge nodes and the cloud over Express Connect circuits or the Internet
kubectl configured to connect to your cluster. See Use kubectl to connect to the ACK cluster
Monitor edge GPU nodes
Step 1: Enable Managed Service for Prometheus
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find your cluster and click its name. In the left-side pane, choose Operations > Prometheus Monitoring.
On the Prometheus Monitoring page, follow the instructions to install the required component and check the dashboards. The system installs the component and checks the dashboards automatically. After installation is complete, click each tab to view metrics.
Step 2: Add edge GPU-accelerated nodes
For information about how to add edge GPU-accelerated nodes, see Add a GPU-accelerated node.
Step 3: Deploy a workload to verify GPU metrics
This step deploys a TensorFlow Benchmark workload using exclusive GPU scheduling to confirm that GPU metrics are collected correctly. To run applications with GPU sharing instead, see Work with multi-GPU sharing.
Create a file named
tensorflow.yamlwith the following content:apiVersion: batch/v1 kind: Job metadata: name: tensorflow-benchmark-exclusive spec: parallelism: 1 template: metadata: labels: app: tensorflow-benchmark-exclusive spec: containers: - name: tensorflow-benchmark image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3 command: - bash - run.sh - --num_batches=5000000 - --batch_size=8 resources: limits: nvidia.com/gpu: 1 # Request one GPU. workingDir: /root restartPolicy: NeverApply the Job to your cluster:
kubectl apply -f tensorflow.yaml
Step 4: View GPU monitoring dashboards
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find your cluster and click its name. In the left-side pane, choose Operations > Prometheus Monitoring.
On the Prometheus Monitoring page, click the GPU Monitoring tab.
Click GPUs - Cluster Dimension to view cluster-level GPU metrics. For details, see View the cluster dashboard.

Click GPUs - Nodes to view node-level GPU metrics. For details, see View the node dashboard.

Step 5: Query GPU metrics in ARMS
To query specific GPU metrics using Metrics Explorer:
Custom metrics used by GPU monitoring are billed. Before enabling custom metrics, review the Billing overview to understand how fees are calculated. Fees vary by cluster size and number of applications. To monitor and manage resource usage, see View resource usage.
Log on to the ARMS console.
In the left-side navigation pane, choose Metric Center > Metrics Explorer.
Select a Prometheus instance from the drop-down list at the top of the page.
In the A section, select the metrics you want to query and click Run query. Select a mode based on your requirements.
