Use Managed Service for Prometheus to monitor GPU metrics - Elastic Container Instance

After you enable Managed Service for Prometheus for a Kubernetes cluster, you can use the predefined dashboards to monitor the performance metrics of GPU-accelerated elastic container instances in the Kubernetes cluster. This topic describes how to use Managed Service for Prometheus to monitor a GPU-accelerated elastic container instance.

Prerequisites

A Container Service for Kubernetes (ACK) cluster is created and Managed Service for Prometheus is enabled for the cluster. For more information, see Enable Managed Service for Prometheus for an ACK Serverless cluster.

Procedure

Log on to the ACK console.

Create a GPU-accelerated elastic container instance.

In the following sample YAML file, a Deployment is created.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-monitor
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test
  template:
    metadata:
      labels:
        app: test
        alibabacloud.com/eci: "true" 
      annotations:
       k8s.aliyun.com/eci-use-specs : "ecs.gn6i-c4g1.xlarge"     # Specify a GPU-accelerated instance type.
    spec:
      containers:
      - name: bert-container
        image: registry.cn-beijing.aliyuncs.com/eci_open/nginx:1.14.2
        ports:
        - containerPort: 80
        resources:
          limits:
            nvidia.com/gpu: 1   # Specify the number of GPUs allocated to a container.

View GPU metrics.
1. On the Overview tab of the Cluster Information page, click Prometheus Monitoring in the upper-right corner.
2. Click the GPU Monitoring tab to view monitoring data.
  After Managed Service for Prometheus is enabled for the ACK serverless cluster, you can monitor GPU-accelerated elastic container instances in the cluster without the need to deploy additional plug-ins. By default, Managed Service for Prometheus provides ready-to-use monitoring dashboards.
  For more information, see Panels and Introduction to metrics.