KServe provides a set of default Prometheus metrics to help you monitor the performance and health of your model services. This topic uses a Qwen-7B-Chat-Int8 model on an NVIDIA V100 GPU as an example to demonstrate how to configure Prometheus monitoring for the KServe framework.
Prerequisites
Arena client version 0.9.15 or later is installed. For more information, see Configure the Arena client.
The ack-kserve component is installed. For more information, see Install the ack-kserve component.
The Alibaba Cloud Prometheus monitoring component is enabled. For more information, see Enable Alibaba Cloud Prometheus monitoring.
Step 1: Deploy a KServe application
Run the following command to deploy a KServe application for Scikit-learn.
arena serve kserve \ --name=sklearn-iris \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/ai-sample/kserve-sklearn-server:v0.12.0 \ --cpu=1 \ --memory=200Mi \ --enable-prometheus=true \ --metrics-port=8080 \ "python -m sklearnserver --model_name=sklearn-iris --model_dir=/models --http_port=8080"Expected output:
service/sklearn-iris-metric-svc created # A service named sklearn-iris-metric-svc is created. inferenceservice.serving.kserve.io/sklearn-iris created # The KServe InferenceService resource sklearn-iris is created. servicemonitor.monitoring.coreos.com/sklearn-iris-svcmonitor created # A ServiceMonitor resource is created to integrate with the Prometheus monitoring system and collect monitoring data exposed by the sklearn-iris-metric-svc service. INFO[0004] The Job sklearn-iris has been submitted successfully # The job is submitted to the cluster. INFO[0004] You can run `arena serve get sklearn-iris --type kserve -n default` to check the job statusThe output indicates that Arena has successfully started a deployment for a KServe service that uses a scikit-learn model and has integrated Prometheus monitoring.
Run the following command to create the
./iris-input.jsonfile with the following JSON content. This file is used for inference input requests.cat <<EOF > "./iris-input.json" { "instances": [ [6.8, 2.8, 4.8, 1.4], [6.0, 3.4, 4.5, 1.6] ] } EOFRun the following command to retrieve the IP address of the NGINX Ingress gateway and the hostname for the InferenceService URL from the cluster.
NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'` SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3)Run the following command to use the Hey stress testing tool to access the service multiple times and generate monitoring data.
NoteFor more information about the Hey stress testing tool, see Hey.
hey -z 2m -c 20 -m POST -host $SERVICE_HOSTNAME -H "Content-Type: application/json" -D ./iris-input.json http://${NGINX_INGRESS_IP}:80/v1/models/sklearn-iris:predictExpected output:
The output summarizes the system's performance during the test. It includes key metrics such as processing speed, data throughput, and response latency. This information helps you evaluate the system's efficiency and stability.
(Optional) Manually retrieve application metrics to confirm that they are exposed correctly.
The following steps describe how to collect monitoring metrics from a specific pod related to
sklearn-irisin an ACK cluster and view the data on your local host. You do not need to log on to the pod or expose the pod's port to an external network.Run the following command to forward port 8080 of the pod whose name contains `sklearn-iris` to port 8080 of your local host. The pod name is specified by the
$POD_NAMEvariable. Requests sent to port 8080 of the local host are transparently forwarded to port 8080 of the pod.# Get the pod name. POD_NAME=`kubectl get po|grep sklearn-iris |awk -F ' ' '{print $1}'` # Forward port 8080 of the pod to the local host using port-forward. kubectl port-forward pod/$POD_NAME 8080:8080Expected output:
Forwarding from 127.0.0.1:8080 -> 8080 Forwarding from [::1]:8080 -> 8080The output shows that connections to the local host through both IPv4 and IPv6 are correctly forwarded to port 8080 of the pod.
In a browser, enter the following URL to access port 8080 of the pod and view the metrics.
http://localhost:8080/metricsExpected output:
The output displays various performance and status metrics from the application in the pod. This confirms that the request was successfully forwarded to the application service in the pod.
Step 2: Query KServe application metrics
Log on to the ARMS console.
In the navigation pane on the left, click Integration Management, and then click Query Dashboards.
On the Dashboard List page, click the Kubernetes Pod dashboard to go to the Grafana page.
In the navigation pane on the left, click Explore. Enter the search statement
request_predict_seconds_bucketto query the application metric values.NoteData collection has a delay of approximately 5 minutes.

FAQ and solutions
FAQ
How do I confirm that data for the request_predict_seconds_bucket metric is collected successfully?
Solutions
Log on to the ARMS console.
In the navigation pane on the left, click Integration Management. On the Integrated Environments page, click the Container Service tab. Click the name of the target container environment, and then click the Self-Monitoring tab.
In the navigation pane on the left, click Targets. If `default/sklearn-iris-svcmonitor/0 (1/1 up)` is displayed, the metric data is being collected successfully.
References
For more information about the default metrics provided by the KServe framework, see the KServe community document KServe Prometheus Metrics.