Use Managed Service for Prometheus to monitor GPU metrics - Elastic Container Instance

After you enable Managed Service for Prometheus for a Kubernetes cluster, you can use predefined dashboards to monitor the performance metrics of GPU-accelerated elastic container instances in the cluster. This topic describes how to use Managed Service for Prometheus to monitor a GPU-accelerated elastic container instance.

Prerequisites

A Container Service for Kubernetes (ACK) Serverless cluster is created and Managed Service for Prometheus is enabled for the cluster. For more information, see Enable Managed Service for Prometheus.

Procedure

Log on to the ACK console.

Create a GPU-accelerated elastic container instance.

In the following sample YAML file, a Deployment is created:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-monitor
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test
  template:
    metadata:
      labels:
        app: test
        alibabacloud.com/eci: "true" 
      annotations:
       k8s.aliyun.com/eci-use-specs : "ecs.gn6i-c4g1.xlarge"     # Specify a GPU-accelerated instance type.
    spec:
      containers:
      - name: bert-container
        image: registry.cn-beijing.aliyuncs.com/eci_open/nginx:1.14.2
        ports:
        - containerPort: 80
        resources:
          limits:
            nvidia.com/gpu: 1   # Specify the number of GPUs that you want to allocate to a container.

View GPU metrics.
1. On the Overview tab of the Cluster Information page, click Prometheus Monitoring in the upper-right corner.
2. On the Prometheus Monitoring page, click the GPU Monitoring tab to view monitoring data.
  After Managed Service for Prometheus is enabled for the ACK serverless cluster, you can monitor GPU-accelerated elastic container instances in the cluster without the need to deploy additional plug-ins. By default, Managed Service for Prometheus provides ready-to-use monitoring dashboards. For more information, see Panels and Introduction to metrics.