All Products
Search
Document Center

Container Service for Kubernetes:Best practices for monitoring GPU resources in ACK Edge clusters

Last Updated:Jun 04, 2025

ACK Edge clusters allows you to manage GPU-accelerated nodes in the data center and at the edge. You can manage heterogeneous computing power across multiple regions and environments. You can connect an ACK Edge cluster to Managed Service for Prometheus. This way, GPU-accelerated nodes in the data center and at the edge can be monitored in the same way as nodes in the cloud.

Observability principle of edge nodes

ACK Edge clusters allow you to access infrastructure as a service (IaaS) resources, such as nodes in data centers, third-party cloud vendors, and IoT devices, by using Express Connect circuits or the Internet. Edge nodes communicate with the cloud by using Express Connect circuits to ensure that Managed Service for Prometheus can access the edge nodes. This ensures that the observability can run as expected. Managed Service for Prometheus uses Raven to monitor edge nodes over the Internet. The following figure shows the steps:

image
  1. Managed Service for Prometheus collects metrics based on node names instead of node IP addresses. During domain name resolution, CoreDNS configures the Hosts plug-in to resolve edge node names to Raven Service.

  2. When Managed Service for Prometheus accesses the Raven Service, it selects a gateway node from the backend of the Service to communicate with the network domain at the edge.

  3. The Raven-agent on the gateway node establishes an encrypted channel with the Raven-agent on the gateway node on the on-premises data center. Layer 3 and Layer 7 network communication are supported.

  4. On the gateway node of the on-premises data center network domain, Raven-agent obtains monitoring data by accessing the GPU collection port of the target node.

Monitor edge GPU-accelerated nodes

Step 1: Enable Managed Service for Prometheus

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Operations > Prometheus Monitoring.

  3. On the Prometheus Monitoring page, follow the instructions to install the required component and check the relevant dashboards.

    The system automatically installs the component and checks the dashboards. After the installation is complete, click each tab to view metrics.

Step 2: Add edge GPU-accelerated nodes

For more information about how to add edge GPU-accelerated nodes, see Add a GPU-accelerated node.

Step 3: Deploy an application on the connected GPU-accelerated node to verify the correctness of GPU-related metrics

In this example, a TensorFlow Benchmark project is used. The exclusive GPU scheduling capability is used. You can also run applications that share GPU resources on edge GPU-accelerated nodes. For more information, see Work with multi-GPU sharing.

  1. Use kubectl to connect to the ACK cluster.

  2. Create a job and save it as a tensorflow.yaml file.

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: tensorflow-benchmark-exclusive
    spec:
      parallelism: 1
      template:
        metadata:
          labels:
            app: tensorflow-benchmark-exclusive
        spec:
          containers:
          - name: tensorflow-benchmark
            image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3
            command:
            - bash
            - run.sh
            - --num_batches=5000000
            - --batch_size=8
            resources:
              limits:
                nvidia.com/gpu: 1 #Apply for a GPU. 
            workingDir: /root
          restartPolicy: Never
  3. Deploy the job in the cluster.

    kubectl apply -f tensorflow.yaml

Step 4: View the GPU monitoring dashboard

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Operations > Prometheus Monitoring.

  3. On the Prometheus Monitoring page, click the GPU Monitoring tab.

    • Click the GPUs - Cluster Dimension tab to view the dashboard of the cluster. For more information, see View the cluster dashboard. 123

    • Click the GPUs - Nodes tab to view the dashboard of GPU-accelerated nodes. For more information about the dashboards, see View the node dashboard. 22

Step 5: View the monitoring metrics of edge GPU-accelerated nodes

The GPU exporter used by GPU monitoring is compatible with the metrics provided by the Data Center GPU Manager (DCGM) exporter. The GPU exporter also provides custom metrics to meet the requirements of specific scenarios. For more information about the DCGM exporter, see DCGM exporter.

GPU monitoring includes metrics supported by the DCGM exporter and custom metrics. You can perform the following operations to view GPU-related metrics:

Important
  • Fees are charged for custom metrics used by GPU monitoring.

  • Before you enable this feature, we recommend that you read Billing overview to understand the billing rules of custom metrics. The fees may vary based on the cluster size and number of applications. You can follow the steps in View resource usage to monitor and manage resource usage.

  1. Log on to the ARMS console.

  2. In the left-side navigation pane, choose Metric Center > Metrics Explorer.

  3. Select a Prometheus instance from the drop-down list at the top of the page.

  4. In the A section, select metrics and click Run query. Select a mode based on your business requirements.

    33