All Products
Search
Document Center

Container Service for Kubernetes:Configure Managed Service for Prometheus for a service deployed by using KServe to monitor the performance and health of the service

Last Updated:Nov 18, 2024

KServe provides a set of default Prometheus metrics to help you monitor the performance and health of services. This topic describes how to configure Managed Service for Prometheus for a service deployed by using KServe. In this example, a Qwen-7B-Chat-Int8 model that uses NVIDIA V100 GPUs is used.

Prerequisites

Step 1: Deploy an application by using KServer

  1. Run the following command to deploy a scikit-learn-based application by using KServer:

    arena serve kserve \
        --name=sklearn-iris \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/ai-sample/kserve-sklearn-server:v0.12.0 \
        --cpu=1 \
        --memory=200Mi \
        --enable-prometheus=true \
        --metrics-port=8080 \
        "python -m sklearnserver --model_name=sklearn-iris --model_dir=/models --http_port=8080"

    Expected output:

    service/sklearn-iris-metric-svc created # A service named sklearn-iris-metric-svc is created. 
    inferenceservice.serving.kserve.io/sklearn-iris created # An inference service named sklearn-iris is created by using KServer. 
    servicemonitor.monitoring.coreos.com/sklearn-iris-svcmonitor created # A ServiceMonitor is created to integrate Managed Service for Prometheus and collect the monitoring data of the sklearn-iris-metric-svc service. 
    INFO[0004] The Job sklearn-iris has been submitted successfully # The job is submitted to the cluster. 
    INFO[0004] You can run `arena serve get sklearn-iris --type kserve -n default` to check the job status

    The preceding output indicates that the Arena client has started a deployment process for a scikit-learn-based service by using KServer and integrated Managed Service for Prometheus with the service.

  2. Run the following command to add the following JSON code to the ./iris-input.json file to create inference requests:

    cat <<EOF > "./iris-input.json"
    {
      "instances": [
        [6.8,  2.8,  4.8,  1.4],
        [6.0,  3.4,  4.5,  1.6]
      ]
    }
    EOF
  3. Run the following command to obtain the IP address of the NGINX Ingress gateway and the hostname in the URL that allows external access to the inference service from the cluster:

    NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'`
    SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3)
  4. Run the following command to use the stress testing tool hey to access the service multiple times to generate monitoring data:

    Note

    For more information about hey, see hey.

    hey -z 2m -c 20 -m POST -host $SERVICE_HOSTNAME -H "Content-Type: application/json" -D ./iris-input.json http://${NGINX_INGRESS_IP}:80/v1/models/sklearn-iris:predict

    Expected output:

    View expected output

    Summary:
      Total:	120.0296 secs
      Slowest:	0.1608 secs
      Fastest:	0.0213 secs
      Average:	0.0275 secs
      Requests/sec:	727.3875
      
      Total data:	1833468 bytes
      Size/request:	21 bytes
    
    Response time histogram:
      0.021 [1]	|
      0.035 [85717]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
      0.049 [1272]	|■
      0.063 [144]	|
      0.077 [96]	|
      0.091 [44]	|
      0.105 [7]	|
      0.119 [0]	|
      0.133 [0]	|
      0.147 [11]	|
      0.161 [16]	|
    
    
    Latency distribution:
      10% in 0.0248 secs
      25% in 0.0257 secs
      50% in 0.0270 secs
      75% in 0.0285 secs
      90% in 0.0300 secs
      95% in 0.0315 secs
      99% in 0.0381 secs
    
    Details (average, fastest, slowest):
      DNS+dialup:	0.0000 secs, 0.0213 secs, 0.1608 secs
      DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0000 secs
      req write:	0.0000 secs, 0.0000 secs, 0.0225 secs
      resp wait:	0.0273 secs, 0.0212 secs, 0.1607 secs
      resp read:	0.0001 secs, 0.0000 secs, 0.0558 secs
    
    Status code distribution:
      [200]	87308 responses

    The preceding output summarizes the performance of the system in a stress test based on the key metrics, including processing speed, data throughput, and response latency. This helps evaluate the efficiency and stability of the system.

  5. Optional. Manually collect the metrics of the application and make sure that the metrics are properly exposed.

    The following example shows how to collect monitoring metrics from a specific pod whose name contains sklearn-iris in an ACK cluster and view the data locally without the need to directly log on to the pod or expose the port of the pod to the external network.

    1. Run the following command to map port 8080 of the pod whose name contains sklearn-iris to port 8080 of a local host. You can specify the pod name by using the $POD_NAME variable. This way, requests sent to port 8080 of the local host are transparently forwarded to port 8080 of the pod.

      # Specify the pod name. 
      POD_NAME=`kubectl get po|grep sklearn-iris |awk -F ' ' '{print $1}'`
      # Map port 8080 of the pod to port 8080 of the local host. 
      kubectl port-forward pod/$POD_NAME 8080:8080

      Expected output:

      Forwarding from 127.0.0.1:8080 -> 8080
      Forwarding from [::1]:8080 -> 8080

      The preceding output shows that requests sent to port 8080 of the local host are forwarded to port 8080 of the pod as expected regardless of whether you connect to the local host by using an IPv4 address or an IPv6 address.

    2. Enter the following URL in a browser to access port 8080 of the pod and view the metrics.

      http://localhost:8080/metrics

      Expected output:

      View expected output

      # HELP python_gc_objects_collected_total Objects collected during gc
      # TYPE python_gc_objects_collected_total counter
      python_gc_objects_collected_total{generation="0"} 10298.0
      python_gc_objects_collected_total{generation="1"} 1826.0
      python_gc_objects_collected_total{generation="2"} 0.0
      # HELP python_gc_objects_uncollectable_total Uncollectable object found during GC
      # TYPE python_gc_objects_uncollectable_total counter
      python_gc_objects_uncollectable_total{generation="0"} 0.0
      python_gc_objects_uncollectable_total{generation="1"} 0.0
      python_gc_objects_uncollectable_total{generation="2"} 0.0
      # HELP python_gc_collections_total Number of times this generation was collected
      # TYPE python_gc_collections_total counter
      python_gc_collections_total{generation="0"} 660.0
      python_gc_collections_total{generation="1"} 60.0
      python_gc_collections_total{generation="2"} 5.0
      # HELP python_info Python platform information
      # TYPE python_info gauge
      python_info{implementation="CPython",major="3",minor="9",patchlevel="18",version="3.9.18"} 1.0
      # HELP process_virtual_memory_bytes Virtual memory size in bytes.
      # TYPE process_virtual_memory_bytes gauge
      process_virtual_memory_bytes 1.406291968e+09
      # HELP process_resident_memory_bytes Resident memory size in bytes.
      # TYPE process_resident_memory_bytes gauge
      process_resident_memory_bytes 2.73207296e+08
      # HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
      # TYPE process_start_time_seconds gauge
      process_start_time_seconds 1.71533439115e+09
      # HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
      # TYPE process_cpu_seconds_total counter
      process_cpu_seconds_total 228.18
      # HELP process_open_fds Number of open file descriptors.
      # TYPE process_open_fds gauge
      process_open_fds 16.0
      # HELP process_max_fds Maximum number of open file descriptors.
      # TYPE process_max_fds gauge
      process_max_fds 1.048576e+06
      # HELP request_preprocess_seconds pre-process request latency
      # TYPE request_preprocess_seconds histogram
      request_preprocess_seconds_bucket{le="0.005",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.01",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.025",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.05",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.075",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.1",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.25",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.5",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.75",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="1.0",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="2.5",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="5.0",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="7.5",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="10.0",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="+Inf",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_count{model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_sum{model_name="sklearn-iris"} 1.7146860011853278
      # HELP request_preprocess_seconds_created pre-process request latency
      # TYPE request_preprocess_seconds_created gauge
      request_preprocess_seconds_created{model_name="sklearn-iris"} 1.7153354578475933e+09
      # HELP request_postprocess_seconds post-process request latency
      # TYPE request_postprocess_seconds histogram
      request_postprocess_seconds_bucket{le="0.005",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.01",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.025",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.05",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.075",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.1",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.25",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.5",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.75",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="1.0",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="2.5",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="5.0",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="7.5",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="10.0",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="+Inf",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_count{model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_sum{model_name="sklearn-iris"} 1.625360683305189
      # HELP request_postprocess_seconds_created post-process request latency
      # TYPE request_postprocess_seconds_created gauge
      request_postprocess_seconds_created{model_name="sklearn-iris"} 1.7153354578482144e+09
      # HELP request_predict_seconds predict request latency
      # TYPE request_predict_seconds histogram
      request_predict_seconds_bucket{le="0.005",model_name="sklearn-iris"} 259708.0
      request_predict_seconds_bucket{le="0.01",model_name="sklearn-iris"} 259708.0
      request_predict_seconds_bucket{le="0.025",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="0.05",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="0.075",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="0.1",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="0.25",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="0.5",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="0.75",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="1.0",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="2.5",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="5.0",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="7.5",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="10.0",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="+Inf",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_count{model_name="sklearn-iris"} 259709.0
      request_predict_seconds_sum{model_name="sklearn-iris"} 47.95311741752084
      # HELP request_predict_seconds_created predict request latency
      # TYPE request_predict_seconds_created gauge
      request_predict_seconds_created{model_name="sklearn-iris"} 1.7153354578476949e+09
      # HELP request_explain_seconds explain request latency
      # TYPE request_explain_seconds histogram

      The preceding output shows the metrics based on which the performance and status of the application in the pod are evaluated. The request is eventually forwarded to the application in the pod.

Step 2: Query the metrics of the application deployed by using KServe

  1. Log on to the ARMS console.

  2. In the left-side navigation pane, click Integration Management.

  3. In the top navigation bar, select the region in which the ACK cluster resides. On the Integration Management page, click the Query Dashboards tab.

  4. In the dashboards list, click the Kubernetes Pod dashboard to go to the Grafana page.

  5. In the left-side navigation pane of the Grafana page, click Explore. On the Explore page, enter the request_predict_seconds_bucket statement to query the values of the application metrics.

    Note

    Data is collected with a delay of 5 minutes.

    image

FAQ

Issue

How do I determine whether data of the request_predict_seconds_bucket metric is collected? What do I do if the metric data fails to be collected?

Solution

  1. Log on to the ARMS console.

  2. In the left-side navigation pane, click Integration Management.

  3. In the top navigation bar, select the region in which the ACK cluster resides. On the Integration Management page, click the Container Service tab under the Integrated Environments tab, and click an environment name to view the details page. On the Container Service page, click the Self-Monitoring tab.

  4. In the left-side pane of the Self-Monitoring tab, click the Targets tab. If default/sklearn-iris-svcmonitor/0 (1/1 up) is displayed, the metric data is collected.

    If the metric data fails to be collected, submit a ticket to seek for technical support.

References

For information about the default metrics provided by KServe, see Prometheus Metrics.