By Zeyu Zhao (Yisong)
ACK Edge coordinates services in the cloud and edges for data centers and edge scenarios. ACK Edge allows you to manage GPU-accelerated nodes in the data center and the edge. This way, you can uniformly manage heterogeneous computing power across multiple regions and environments. Managed Service for Prometheus is a fully managed monitoring service that is interfaced with the open source Prometheus ecosystem. Managed Service for Prometheus monitors various components and provides multiple ready-to-use dashboards.
The integration of ACK Edge and Managed Service for Prometheus provides an advanced observability experience that is consistent with that of the cloud for GPU-accelerated nodes in data centers and edge computing. This topic describes how to use this combination to achieve efficient monitoring of GPU-accelerated nodes and shares related best practices.
ACK Edge supports connecting to IaaS resources, such as nodes in data centers, nodes of third-party cloud service providers, and IoT devices through leased lines and the Internet. In leased line scenarios, nodes can communicate with the cloud to ensure that nodes can be observed normally. ACK Edge provides a consistent observability experience for edge nodes over the Internet. The following figure shows that the Prometheus Server cannot directly access GPU-accelerated nodes. You can use Raven to implement observability for edge nodes over the Internet.
The system automatically installs the component and checks the dashboards. After the installation is completed, you can click each tab to view metrics.
For more information about how to add an edge node, see Add a GPU-accelerated node.
After the edge node is connected, you can run GPU applications in the node to check whether GPU metrics can be collected as normal. In this example, a Job is created in each node to run a TensorFlow benchmark. In this example, GPU-only applications are used. You can also run GPU-sharing applications on GPU-accelerated edge nodes. For more information, see Configure GPU sharing without GPU memory isolation.
1. Create a Job file.
apiVersion: batch/v1
kind: Job
metadata:
name: tensorflow-benchmark-exclusive
spec:
parallelism: 1
template:
metadata:
labels:
app: tensorflow-benchmark-exclusive
spec:
containers:
- name: tensorflow-benchmark
image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3
command:
- bash
- run.sh
- --num_batches=5000000
- --batch_size=8
resources:
limits:
nvidia.com/gpu: 1 #Apply for a GPU.
workingDir: /root
restartPolicy: Never
2. Create resources.
Use kubectl apply to create resources.
In the left-side navigation pane, choose Tasks > Create from YAML. Copy the preceding YAML file and click Create.
GPU monitoring 2.0 consists of a cluster dashboard and a node dashboard. Each dashboard provides multiple panels. For more information, see Panels on the dashboards.
1. Log on to the ACK console. In the left-side navigation pane, click Clusters.
2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Operations > Prometheus Monitoring.
3. On the Prometheus Monitoring page, click the GPU Monitoring tab. Then, click the GPUs - Cluster Dimension tab.
4. On the Prometheus Monitoring page, click the GPU Monitoring tab. Then, click the GPUs - Node Dimension tab and select the node that you want to view from the GPUNode drop-down list.
The GPU exporter used by GPU monitoring 2.0 is compatible with the metrics provided by the DCGM exporter. The GPU exporter also provides custom metrics to meet the requirements of specific scenarios. For more information about the DCGM exporter, see DCGM exporter.
For more information about the supported GPU metrics, see Introduction to metrics. You can perform the following steps to view the GPU metrics.
Solving GPU Shortages in IDC with Alibaba Cloud ACK Edge and Virtual Nodes for DeepSeek Deployment
210 posts | 33 followers
FollowAlibaba Cloud Native - October 18, 2023
Alibaba Container Service - May 27, 2025
Alibaba Container Service - July 10, 2025
Alibaba Developer - January 5, 2022
Alibaba Container Service - March 12, 2024
Alibaba Clouder - July 15, 2020
210 posts | 33 followers
FollowFollow our step-by-step best practices guides to build your own business case.
Learn MoreAlibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.
Learn MoreProvides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resources
Learn MoreA secure image hosting platform providing containerized image lifecycle management
Learn MoreMore Posts by Alibaba Container Service