Observability Principles and Best Practices of GPU-accelerated Edge Nodes

By Zeyu Zhao (Yisong)

ACK Edge coordinates services in the cloud and edges for data centers and edge scenarios. ACK Edge allows you to manage GPU-accelerated nodes in the data center and the edge. This way, you can uniformly manage heterogeneous computing power across multiple regions and environments. Managed Service for Prometheus is a fully managed monitoring service that is interfaced with the open source Prometheus ecosystem. Managed Service for Prometheus monitors various components and provides multiple ready-to-use dashboards.

The integration of ACK Edge and Managed Service for Prometheus provides an advanced observability experience that is consistent with that of the cloud for GPU-accelerated nodes in data centers and edge computing. This topic describes how to use this combination to achieve efficient monitoring of GPU-accelerated nodes and shares related best practices.

Edge Node Observability Principle

ACK Edge supports connecting to IaaS resources, such as nodes in data centers, nodes of third-party cloud service providers, and IoT devices through leased lines and the Internet. In leased line scenarios, nodes can communicate with the cloud to ensure that nodes can be observed normally. ACK Edge provides a consistent observability experience for edge nodes over the Internet. The following figure shows that the Prometheus Server cannot directly access GPU-accelerated nodes. You can use Raven to implement observability for edge nodes over the Internet.

Prometheus collects metrics by using node names instead of node IP addresses. When resolving domain names, CoreDNS configures the hosts plug-in to resolve edge node names to the Raven service.
Prometheus accesses the Raven service and selects a gateway node at the backend of the service to communicate with the network domain at the edge.
The Raven agent on the gateway node establishes an encrypted channel with the Raven agent on the on-premises gateway node, supporting Layer 3 and Layer 7 communication.
The Raven agent on the gateway node of the on-premises network domain accesses the GPU collection port of the target node to obtain monitoring data.

Best Practices for GPU-accelerated Edge Nodes

Step 1: Enable Managed Service for Prometheus

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Operations > Prometheus Monitoring.
On the Prometheus Monitoring page, follow the on-screen instructions to install the required component and check the relevant dashboards.

The system automatically installs the component and checks the dashboards. After the installation is completed, you can click each tab to view metrics.

Step 2: Add GPU-accelerated nodes

For more information about how to add an edge node, see Add a GPU-accelerated node.

Step 3: Deploy the application

After the edge node is connected, you can run GPU applications in the node to check whether GPU metrics can be collected as normal. In this example, a Job is created in each node to run a TensorFlow benchmark. In this example, GPU-only applications are used. You can also run GPU-sharing applications on GPU-accelerated edge nodes. For more information, see Configure GPU sharing without GPU memory isolation.

1. Create a Job file.

apiVersion: batch/v1
kind: Job
metadata:
  name: tensorflow-benchmark-exclusive
spec:
  parallelism: 1
  template:
    metadata:
      labels:
        app: tensorflow-benchmark-exclusive
    spec:
      containers:
      - name: tensorflow-benchmark
        image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3
        command:
        - bash
        - run.sh
        - --num_batches=5000000
        - --batch_size=8
        resources:
          limits:
            nvidia.com/gpu: 1 #Apply for a GPU. 
        workingDir: /root
      restartPolicy: Never

2. Create resources.

Use kubectl apply to create resources.

In the left-side navigation pane, choose Tasks > Create from YAML. Copy the preceding YAML file and click Create.

Step 4: View dashboards provided by GPU monitoring 2.0

GPU monitoring 2.0 consists of a cluster dashboard and a node dashboard. Each dashboard provides multiple panels. For more information, see Panels on the dashboards.

1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Operations > Prometheus Monitoring.

3. On the Prometheus Monitoring page, click the GPU Monitoring tab. Then, click the GPUs - Cluster Dimension tab.

4. On the Prometheus Monitoring page, click the GPU Monitoring tab. Then, click the GPUs - Node Dimension tab and select the node that you want to view from the GPUNode drop-down list.

Step 5: View the monitoring metrics of GPU-accelerated edge nodes

The GPU exporter used by GPU monitoring 2.0 is compatible with the metrics provided by the DCGM exporter. The GPU exporter also provides custom metrics to meet the requirements of specific scenarios. For more information about the DCGM exporter, see DCGM exporter.

For more information about the supported GPU metrics, see Introduction to metrics. You can perform the following steps to view the GPU metrics.

Log on to the ARMS console.
In the top navigation bar, select the region where the cluster is deployed.
In the left-side navigation pane, choose Metric Center > Metrics Exploration > Select Instance.
In the search box, enter the monitoring metrics that you want to view and the corresponding filter conditions.

Community

Observability Principles and Best Practices of GPU-accelerated Edge Nodes

Edge Node Observability Principle

Best Practices for GPU-accelerated Edge Nodes

Step 1: Enable Managed Service for Prometheus

Step 2: Add GPU-accelerated nodes

Step 3: Deploy the application

Step 4: View dashboards provided by GPU monitoring 2.0

Step 5: View the monitoring metrics of GPU-accelerated edge nodes

Read previous post:

Read next post:

Alibaba Container Service

You may also like

Comments

Alibaba Container Service

Related Products

Best Practices

Container Service for Kubernetes

ACK One

Container Registry