All Products
Search
Document Center

Container Service for Kubernetes:Configure Prometheus monitoring for KServe to monitor the performance and health of model services

Last Updated:Dec 01, 2025

KServe provides a set of default Prometheus metrics to help you monitor the performance and health of your model services. This topic uses a Qwen-7B-Chat-Int8 model on an NVIDIA V100 GPU as an example to demonstrate how to configure Prometheus monitoring for the KServe framework.

Prerequisites

Step 1: Deploy a KServe application

  1. Run the following command to deploy a KServe application for Scikit-learn.

    arena serve kserve \
        --name=sklearn-iris \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/ai-sample/kserve-sklearn-server:v0.12.0 \
        --cpu=1 \
        --memory=200Mi \
        --enable-prometheus=true \
        --metrics-port=8080 \
        "python -m sklearnserver --model_name=sklearn-iris --model_dir=/models --http_port=8080"

    Expected output:

    service/sklearn-iris-metric-svc created # A service named sklearn-iris-metric-svc is created.
    inferenceservice.serving.kserve.io/sklearn-iris created # The KServe InferenceService resource sklearn-iris is created.
    servicemonitor.monitoring.coreos.com/sklearn-iris-svcmonitor created # A ServiceMonitor resource is created to integrate with the Prometheus monitoring system and collect monitoring data exposed by the sklearn-iris-metric-svc service.
    INFO[0004] The Job sklearn-iris has been submitted successfully # The job is submitted to the cluster.
    INFO[0004] You can run `arena serve get sklearn-iris --type kserve -n default` to check the job status

    The output indicates that Arena has successfully started a deployment for a KServe service that uses a scikit-learn model and has integrated Prometheus monitoring.

  2. Run the following command to create the ./iris-input.json file with the following JSON content. This file is used for inference input requests.

    cat <<EOF > "./iris-input.json"
    {
      "instances": [
        [6.8,  2.8,  4.8,  1.4],
        [6.0,  3.4,  4.5,  1.6]
      ]
    }
    EOF
  3. Run the following command to retrieve the IP address of the NGINX Ingress gateway and the hostname for the InferenceService URL from the cluster.

    NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'`
    SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3)
  4. Run the following command to use the Hey stress testing tool to access the service multiple times and generate monitoring data.

    Note

    For more information about the Hey stress testing tool, see Hey.

    hey -z 2m -c 20 -m POST -host $SERVICE_HOSTNAME -H "Content-Type: application/json" -D ./iris-input.json http://${NGINX_INGRESS_IP}:80/v1/models/sklearn-iris:predict

    Expected output:

    Click to view the expected output

    Summary:
      Total:	120.0296 secs
      Slowest:	0.1608 secs
      Fastest:	0.0213 secs
      Average:	0.0275 secs
      Requests/sec:	727.3875
      
      Total data:	1833468 bytes
      Size/request:	21 bytes
    
    Response time histogram:
      0.021 [1]	|
      0.035 [85717]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
      0.049 [1272]	|■
      0.063 [144]	|
      0.077 [96]	|
      0.091 [44]	|
      0.105 [7]	|
      0.119 [0]	|
      0.133 [0]	|
      0.147 [11]	|
      0.161 [16]	|
    
    
    Latency distribution:
      10% in 0.0248 secs
      25% in 0.0257 secs
      50% in 0.0270 secs
      75% in 0.0285 secs
      90% in 0.0300 secs
      95% in 0.0315 secs
      99% in 0.0381 secs
    
    Details (average, fastest, slowest):
      DNS+dialup:	0.0000 secs, 0.0213 secs, 0.1608 secs
      DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0000 secs
      req write:	0.0000 secs, 0.0000 secs, 0.0225 secs
      resp wait:	0.0273 secs, 0.0212 secs, 0.1607 secs
      resp read:	0.0001 secs, 0.0000 secs, 0.0558 secs
    
    Status code distribution:
      [200]	87308 responses

    The output summarizes the system's performance during the test. It includes key metrics such as processing speed, data throughput, and response latency. This information helps you evaluate the system's efficiency and stability.

  5. (Optional) Manually retrieve application metrics to confirm that they are exposed correctly.

    The following steps describe how to collect monitoring metrics from a specific pod related to sklearn-iris in an ACK cluster and view the data on your local host. You do not need to log on to the pod or expose the pod's port to an external network.

    1. Run the following command to forward port 8080 of the pod whose name contains `sklearn-iris` to port 8080 of your local host. The pod name is specified by the $POD_NAME variable. Requests sent to port 8080 of the local host are transparently forwarded to port 8080 of the pod.

      # Get the pod name.
      POD_NAME=`kubectl get po|grep sklearn-iris |awk -F ' ' '{print $1}'`
      # Forward port 8080 of the pod to the local host using port-forward.
      kubectl port-forward pod/$POD_NAME 8080:8080

      Expected output:

      Forwarding from 127.0.0.1:8080 -> 8080
      Forwarding from [::1]:8080 -> 8080

      The output shows that connections to the local host through both IPv4 and IPv6 are correctly forwarded to port 8080 of the pod.

    2. In a browser, enter the following URL to access port 8080 of the pod and view the metrics.

      http://localhost:8080/metrics

      Expected output:

      Click to view the expected output

      # HELP python_gc_objects_collected_total Objects collected during gc
      # TYPE python_gc_objects_collected_total counter
      python_gc_objects_collected_total{generation="0"} 10298.0
      python_gc_objects_collected_total{generation="1"} 1826.0
      python_gc_objects_collected_total{generation="2"} 0.0
      # HELP python_gc_objects_uncollectable_total Uncollectable object found during GC
      # TYPE python_gc_objects_uncollectable_total counter
      python_gc_objects_uncollectable_total{generation="0"} 0.0
      python_gc_objects_uncollectable_total{generation="1"} 0.0
      python_gc_objects_uncollectable_total{generation="2"} 0.0
      # HELP python_gc_collections_total Number of times this generation was collected
      # TYPE python_gc_collections_total counter
      python_gc_collections_total{generation="0"} 660.0
      python_gc_collections_total{generation="1"} 60.0
      python_gc_collections_total{generation="2"} 5.0
      # HELP python_info Python platform information
      # TYPE python_info gauge
      python_info{implementation="CPython",major="3",minor="9",patchlevel="18",version="3.9.18"} 1.0
      # HELP process_virtual_memory_bytes Virtual memory size in bytes.
      # TYPE process_virtual_memory_bytes gauge
      process_virtual_memory_bytes 1.406291968e+09
      # HELP process_resident_memory_bytes Resident memory size in bytes.
      # TYPE process_resident_memory_bytes gauge
      process_resident_memory_bytes 2.73207296e+08
      # HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
      # TYPE process_start_time_seconds gauge
      process_start_time_seconds 1.71533439115e+09
      # HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
      # TYPE process_cpu_seconds_total counter
      process_cpu_seconds_total 228.18
      # HELP process_open_fds Number of open file descriptors.
      # TYPE process_open_fds gauge
      process_open_fds 16.0
      # HELP process_max_fds Maximum number of open file descriptors.
      # TYPE process_max_fds gauge
      process_max_fds 1.048576e+06
      # HELP request_preprocess_seconds pre-process request latency
      # TYPE request_preprocess_seconds histogram
      request_preprocess_seconds_bucket{le="0.005",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.01",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.025",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.05",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.075",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.1",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.25",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.5",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.75",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="1.0",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="2.5",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="5.0",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="7.5",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="10.0",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="+Inf",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_count{model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_sum{model_name="sklearn-iris"} 1.7146860011853278
      # HELP request_preprocess_seconds_created pre-process request latency
      # TYPE request_preprocess_seconds_created gauge
      request_preprocess_seconds_created{model_name="sklearn-iris"} 1.7153354578475933e+09
      # HELP request_postprocess_seconds post-process request latency
      # TYPE request_postprocess_seconds histogram
      request_postprocess_seconds_bucket{le="0.005",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.01",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.025",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.05",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.075",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.1",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.25",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.5",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.75",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="1.0",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="2.5",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="5.0",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="7.5",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="10.0",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="+Inf",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_count{model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_sum{model_name="sklearn-iris"} 1.625360683305189
      # HELP request_postprocess_seconds_created post-process request latency
      # TYPE request_postprocess_seconds_created gauge
      request_postprocess_seconds_created{model_name="sklearn-iris"} 1.7153354578482144e+09
      # HELP request_predict_seconds predict request latency
      # TYPE request_predict_seconds histogram
      request_predict_seconds_bucket{le="0.005",model_name="sklearn-iris"} 259708.0
      request_predict_seconds_bucket{le="0.01",model_name="sklearn-iris"} 259708.0
      request_predict_seconds_bucket{le="0.025",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="0.05",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="0.075",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="0.1",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="0.25",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="0.5",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="0.75",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="1.0",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="2.5",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="5.0",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="7.5",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="10.0",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="+Inf",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_count{model_name="sklearn-iris"} 259709.0
      request_predict_seconds_sum{model_name="sklearn-iris"} 47.95311741752084
      # HELP request_predict_seconds_created predict request latency
      # TYPE request_predict_seconds_created gauge
      request_predict_seconds_created{model_name="sklearn-iris"} 1.7153354578476949e+09
      # HELP request_explain_seconds explain request latency
      # TYPE request_explain_seconds histogram

      The output displays various performance and status metrics from the application in the pod. This confirms that the request was successfully forwarded to the application service in the pod.

Step 2: Query KServe application metrics

  1. Log on to the ARMS console.

  2. In the navigation pane on the left, click Integration Management, and then click Query Dashboards.

  3. On the Dashboard List page, click the Kubernetes Pod dashboard to go to the Grafana page.

  4. In the navigation pane on the left, click Explore. Enter the search statement request_predict_seconds_bucket to query the application metric values.

    Note

    Data collection has a delay of approximately 5 minutes.

    image

FAQ and solutions

FAQ

How do I confirm that data for the request_predict_seconds_bucket metric is collected successfully?

Solutions

  1. Log on to the ARMS console.

  2. In the navigation pane on the left, click Integration Management. On the Integrated Environments page, click the Container Service tab. Click the name of the target container environment, and then click the Self-Monitoring tab.

  3. In the navigation pane on the left, click Targets. If `default/sklearn-iris-svcmonitor/0 (1/1 up)` is displayed, the metric data is being collected successfully.

References

For more information about the default metrics provided by the KServe framework, see the KServe community document KServe Prometheus Metrics.