All Products
Search
Document Center

Container Service for Kubernetes:Configure Prometheus monitoring for KServe to monitor model service performance and health

Last Updated:Mar 26, 2026

KServe exposes a set of default Prometheus metrics for monitoring model service performance and health. This topic walks through deploying a scikit-learn InferenceService with Prometheus monitoring enabled, generating inference traffic, and querying the collected metrics in ARMS.

Prerequisites

Before you begin, ensure that you have:

Step 1: Deploy a KServe application

  1. Deploy a KServe application for scikit-learn:

    Resource Type Description
    sklearn-iris-metric-svc Kubernetes Service Exposes the metrics endpoint on port 8080
    sklearn-iris KServe InferenceService The model serving resource
    sklearn-iris-svcmonitor ServiceMonitor Integrates with Alibaba Cloud Prometheus to scrape metrics from sklearn-iris-metric-svc
    arena serve kserve \
        --name=sklearn-iris \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/ai-sample/kserve-sklearn-server:v0.12.0 \
        --cpu=1 \
        --memory=200Mi \
        --enable-prometheus=true \
        --metrics-port=8080 \
        "python -m sklearnserver --model_name=sklearn-iris --model_dir=/models --http_port=8080"

    The --enable-prometheus=true flag creates the following resources:

    Expected output:

    service/sklearn-iris-metric-svc created
    inferenceservice.serving.kserve.io/sklearn-iris created
    servicemonitor.monitoring.coreos.com/sklearn-iris-svcmonitor created
    INFO[0004] The Job sklearn-iris has been submitted successfully
    INFO[0004] You can run `arena serve get sklearn-iris --type kserve -n default` to check the job status
  2. Create the ./iris-input.json file with the following content. This file is used as the inference request payload.

    cat <<EOF > "./iris-input.json"
    {
      "instances": [
        [6.8,  2.8,  4.8,  1.4],
        [6.0,  3.4,  4.5,  1.6]
      ]
    }
    EOF
  3. Retrieve the NGINX Ingress gateway IP address and the InferenceService hostname:

    NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'`
    SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3)
  4. Use the Hey stress testing tool to generate inference traffic:

    hey -z 2m -c 20 -m POST -host $SERVICE_HOSTNAME -H "Content-Type: application/json" -D ./iris-input.json http://${NGINX_INGRESS_IP}:80/v1/models/sklearn-iris:predict

    Expected output:

    Click to view the expected output

    Summary:
      Total:        120.0296 secs
      Slowest:      0.1608 secs
      Fastest:      0.0213 secs
      Average:      0.0275 secs
      Requests/sec: 727.3875
    
      Total data:   1833468 bytes
      Size/request: 21 bytes
    
    Response time histogram:
      0.021 [1]     |
      0.035 [85717] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
      0.049 [1272]  |■
      0.063 [144]   |
      0.077 [96]    |
      0.091 [44]    |
      0.105 [7]     |
      0.119 [0]     |
      0.133 [0]     |
      0.147 [11]    |
      0.161 [16]    |
    
    Latency distribution:
      10% in 0.0248 secs
      25% in 0.0257 secs
      50% in 0.0270 secs
      75% in 0.0285 secs
      90% in 0.0300 secs
      95% in 0.0315 secs
      99% in 0.0381 secs
    
    Details (average, fastest, slowest):
      DNS+dialup:  0.0000 secs, 0.0213 secs, 0.1608 secs
      DNS-lookup:  0.0000 secs, 0.0000 secs, 0.0000 secs
      req write:   0.0000 secs, 0.0000 secs, 0.0225 secs
      resp wait:   0.0273 secs, 0.0212 secs, 0.1607 secs
      resp read:   0.0001 secs, 0.0000 secs, 0.0558 secs
    
    Status code distribution:
      [200] 87308 responses
  5. (Optional) Verify that metrics are exposed on the pod before querying ARMS. The pod exposes metrics on port 8080, but the port is not accessible from outside the cluster. Use port forwarding to access it locally:

    Metric Type Labels Description
    request_preprocess_seconds Histogram model_name Preprocessing latency per request
    request_predict_seconds Histogram model_name Prediction latency per request
    request_postprocess_seconds Histogram model_name Postprocessing latency per request
    request_explain_seconds Histogram model_name Explain request latency

    The model_name label lets you filter and aggregate metrics by model when multiple models run in the same cluster.

    # Get the pod name
    POD_NAME=`kubectl get po|grep sklearn-iris |awk -F ' ' '{print $1}'`
    # Forward port 8080 of the pod to localhost
    kubectl port-forward pod/$POD_NAME 8080:8080

    Expected output:

    Click to view the expected output

    # HELP python_gc_objects_collected_total Objects collected during gc
    # TYPE python_gc_objects_collected_total counter
    python_gc_objects_collected_total{generation="0"} 10298.0
    python_gc_objects_collected_total{generation="1"} 1826.0
    python_gc_objects_collected_total{generation="2"} 0.0
    # HELP python_gc_objects_uncollectable_total Uncollectable object found during GC
    # TYPE python_gc_objects_uncollectable_total counter
    python_gc_objects_uncollectable_total{generation="0"} 0.0
    python_gc_objects_uncollectable_total{generation="1"} 0.0
    python_gc_objects_uncollectable_total{generation="2"} 0.0
    # HELP python_gc_collections_total Number of times this generation was collected
    # TYPE python_gc_collections_total counter
    python_gc_collections_total{generation="0"} 660.0
    python_gc_collections_total{generation="1"} 60.0
    python_gc_collections_total{generation="2"} 5.0
    # HELP python_info Python platform information
    # TYPE python_info gauge
    python_info{implementation="CPython",major="3",minor="9",patchlevel="18",version="3.9.18"} 1.0
    # HELP process_virtual_memory_bytes Virtual memory size in bytes.
    # TYPE process_virtual_memory_bytes gauge
    process_virtual_memory_bytes 1.406291968e+09
    # HELP process_resident_memory_bytes Resident memory size in bytes.
    # TYPE process_resident_memory_bytes gauge
    process_resident_memory_bytes 2.73207296e+08
    # HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
    # TYPE process_start_time_seconds gauge
    process_start_time_seconds 1.71533439115e+09
    # HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
    # TYPE process_cpu_seconds_total counter
    process_cpu_seconds_total 228.18
    # HELP process_open_fds Number of open file descriptors.
    # TYPE process_open_fds gauge
    process_open_fds 16.0
    # HELP process_max_fds Maximum number of open file descriptors.
    # TYPE process_max_fds gauge
    process_max_fds 1.048576e+06
    # HELP request_preprocess_seconds pre-process request latency
    # TYPE request_preprocess_seconds histogram
    request_preprocess_seconds_bucket{le="0.005",model_name="sklearn-iris"} 259709.0
    ...
    # HELP request_predict_seconds predict request latency
    # TYPE request_predict_seconds histogram
    request_predict_seconds_bucket{le="0.005",model_name="sklearn-iris"} 259708.0
    ...
    # HELP request_explain_seconds explain request latency
    # TYPE request_explain_seconds histogram
    Forwarding from 127.0.0.1:8080 -> 8080
    Forwarding from [::1]:8080 -> 8080

    In a browser, open http://localhost:8080/metrics to view the raw metrics. KServe exposes the following metrics:

Step 2: Query KServe application metrics

  1. Log on to the ARMS console.

  2. In the left navigation pane, click Integration Management, and then click Query Dashboards.

  3. On the Dashboard List page, click the Kubernetes Pod dashboard to open the Grafana page.

  4. In the left navigation pane, click Explore. Enter the following search statement to query the application metric values:

    Data collection has a delay of approximately 5 minutes after traffic is generated.

    request_predict_seconds_bucket

    image

FAQ

Question

How do I confirm that the metrics for request_predict_seconds_bucket are being collected?

Solution

Check the scrape target status in ARMS:

  1. Log on to the ARMS console.

  2. In the left navigation pane, click Integration Management. On the Integrated Environments page, click the Container Service tab, and then click the name of your cluster. Click the Self-Monitoring tab.

  3. In the left navigation pane, click Targets. If default/sklearn-iris-svcmonitor/0 (1/1 up) is listed, metric collection is working correctly.

References