All Products
Search
Document Center

Container Service for Kubernetes:Best practices for monitoring GPU resources in ACK Edge clusters

Last Updated:Mar 26, 2026

ACK Edge clusters let you manage GPU-accelerated nodes across data centers and edge environments. By connecting an ACK Edge cluster to Managed Service for Prometheus, you can monitor GPU nodes at the edge using the same dashboards and metrics pipeline as cloud nodes.

How edge GPU monitoring works

ACK Edge clusters support infrastructure as a service (IaaS) resources—including data center nodes, third-party cloud nodes, and IoT devices—connected over Express Connect circuits or the Internet.

For Internet-connected edge nodes, Managed Service for Prometheus uses Raven to bridge the monitoring path between the cloud and the edge. The following diagram shows how this works:

Raven architecture diagram
  1. Managed Service for Prometheus collects metrics by node name instead of node IP address. CoreDNS uses its Hosts plugin to resolve edge node names to the Raven Service.

  2. When Managed Service for Prometheus accesses the Raven Service, it selects a gateway node from the Service backend to communicate with the edge network domain.

  3. The Raven-agent on the gateway node establishes an encrypted channel with the Raven-agent on the on-premises data center gateway node. Both Layer 3 and Layer 7 network communication are supported.

  4. On the on-premises data center gateway node, Raven-agent retrieves monitoring data by accessing the GPU collection port of the target node.

Metric types

GPU monitoring in ACK Edge provides two categories of metrics:

Metric typeDescriptionBilling
DCGM (Data Center GPU Manager) Exporter metricsStandard GPU metrics compatible with the DCGM Exporter. See the supported metrics list.
Custom metricsExtended metrics for specific scenarios beyond the DCGM Exporter standard set. See the custom metrics list.Billed. Review Billing overview before enabling.

To monitor and manage resource usage costs, see View resource usage.

Prerequisites

Before you begin, make sure you have:

  • An ACK Edge cluster with network connectivity between edge nodes and the cloud over Express Connect circuits or the Internet

  • kubectl configured to connect to your cluster. See Use kubectl to connect to the ACK cluster

Monitor edge GPU nodes

Step 1: Enable Managed Service for Prometheus

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find your cluster and click its name. In the left-side pane, choose Operations > Prometheus Monitoring.

  3. On the Prometheus Monitoring page, follow the instructions to install the required component and check the dashboards. The system installs the component and checks the dashboards automatically. After installation is complete, click each tab to view metrics.

Step 2: Add edge GPU-accelerated nodes

For information about how to add edge GPU-accelerated nodes, see Add a GPU-accelerated node.

Step 3: Deploy a workload to verify GPU metrics

This step deploys a TensorFlow Benchmark workload using exclusive GPU scheduling to confirm that GPU metrics are collected correctly. To run applications with GPU sharing instead, see Work with multi-GPU sharing.

  1. Create a file named tensorflow.yaml with the following content:

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: tensorflow-benchmark-exclusive
    spec:
      parallelism: 1
      template:
        metadata:
          labels:
            app: tensorflow-benchmark-exclusive
        spec:
          containers:
          - name: tensorflow-benchmark
            image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3
            command:
            - bash
            - run.sh
            - --num_batches=5000000
            - --batch_size=8
            resources:
              limits:
                nvidia.com/gpu: 1 # Request one GPU.
            workingDir: /root
          restartPolicy: Never
  2. Apply the Job to your cluster:

    kubectl apply -f tensorflow.yaml

Step 4: View GPU monitoring dashboards

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find your cluster and click its name. In the left-side pane, choose Operations > Prometheus Monitoring.

  3. On the Prometheus Monitoring page, click the GPU Monitoring tab.

Step 5: Query GPU metrics in ARMS

To query specific GPU metrics using Metrics Explorer:

Important

Custom metrics used by GPU monitoring are billed. Before enabling custom metrics, review the Billing overview to understand how fees are calculated. Fees vary by cluster size and number of applications. To monitor and manage resource usage, see View resource usage.

  1. Log on to the ARMS console.

  2. In the left-side navigation pane, choose Metric Center > Metrics Explorer.

  3. Select a Prometheus instance from the drop-down list at the top of the page.

  4. In the A section, select the metrics you want to query and click Run query. Select a mode based on your requirements.

    Metrics Explorer

What's next