全部產品
Search
文件中心

Container Service for Kubernetes:為KServe配置Prometheus監控以監控模型服務的效能和健康情況

更新時間:Aug 26, 2025

KServe提供了一套預設的Prometheus指標來協助您監控模型服務的效能和健康情況。本文以Qwen-7B-Chat-Int8模型、GPU類型為V100卡為例,介紹如何為KServe架構配置Prometheus監控。

前提條件

步驟一:部署KServe應用

  1. 執行如下命令,部署一個Sklearn的KServe應用。

    arena serve kserve \
        --name=sklearn-iris \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/ai-sample/kserve-sklearn-server:v0.12.0 \
        --cpu=1 \
        --memory=200Mi \
        --enable-prometheus=true \
        --metrics-port=8080 \
        "python -m sklearnserver --model_name=sklearn-iris --model_dir=/models --http_port=8080"

    預期輸出:

    service/sklearn-iris-metric-svc created # 名為sklearn-iris-metric-svc的服務被成功建立。
    inferenceservice.serving.kserve.io/sklearn-iris created # KServe的InferenceService資源sklearn-iris已經被建立。
    servicemonitor.monitoring.coreos.com/sklearn-iris-svcmonitor created # ServiceMonito資源被建立,用於整合Prometheus監控系統,收集sklearn-iris-metric-svc服務暴露的監控資料。
    INFO[0004] The Job sklearn-iris has been submitted successfully # Job已經被成功提交至叢集。
    INFO[0004] You can run `arena serve get sklearn-iris --type kserve -n default` to check the job status

    輸出結果表明使用Arena已成功啟動了一個使用scikit-learn模型的KServe服務部署流程,同時整合了Prometheus監控。

  2. 執行以下命令,將以下JSON內容寫入 ./iris-input.json檔案中,以準備推理輸入請求。

    cat <<EOF > "./iris-input.json"
    {
      "instances": [
        [6.8,  2.8,  4.8,  1.4],
        [6.0,  3.4,  4.5,  1.6]
      ]
    }
    EOF
  3. 執行以下命令,從叢集中檢索Nginx Ingress網關的IP地址以及InferenceService的外部可訪問URL的主機名稱部分。

    NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'`
    SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3)
  4. 執行以下命令,使用壓測工具Hey多次訪問服務產生監控資料。

    說明

    Hey壓測工具的詳細介紹,請參見Hey

    hey -z 2m -c 20 -m POST -host $SERVICE_HOSTNAME -H "Content-Type: application/json" -D ./iris-input.json http://${NGINX_INGRESS_IP}:80/v1/models/sklearn-iris:predict

    預期輸出:

    展開查看預期輸出

    Summary:
      Total:	120.0296 secs
      Slowest:	0.1608 secs
      Fastest:	0.0213 secs
      Average:	0.0275 secs
      Requests/sec:	727.3875
      
      Total data:	1833468 bytes
      Size/request:	21 bytes
    
    Response time histogram:
      0.021 [1]	|
      0.035 [85717]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
      0.049 [1272]	|■
      0.063 [144]	|
      0.077 [96]	|
      0.091 [44]	|
      0.105 [7]	|
      0.119 [0]	|
      0.133 [0]	|
      0.147 [11]	|
      0.161 [16]	|
    
    
    Latency distribution:
      10% in 0.0248 secs
      25% in 0.0257 secs
      50% in 0.0270 secs
      75% in 0.0285 secs
      90% in 0.0300 secs
      95% in 0.0315 secs
      99% in 0.0381 secs
    
    Details (average, fastest, slowest):
      DNS+dialup:	0.0000 secs, 0.0213 secs, 0.1608 secs
      DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0000 secs
      req write:	0.0000 secs, 0.0000 secs, 0.0225 secs
      resp wait:	0.0273 secs, 0.0212 secs, 0.1607 secs
      resp read:	0.0001 secs, 0.0000 secs, 0.0558 secs
    
    Status code distribution:
      [200]	87308 responses

    輸出結果總結了系統在某次測試中的效能表現,包括處理速度、資料輸送量、響應延遲等關鍵計量,有助於評估系統的效率和穩定性。

  5. (可選)手動擷取應用Metrics,確認Metrics正常暴露。

    以下將示範從ACK叢集中的特定Pod(與sklearn-iris相關的)收集監控指標資料,並在本地查看這些資料,無需直接登入到Pod或暴露Pod的連接埠到外部網路。

    1. 執行以下命令,將名稱中包含sklearn-iris的Pod(通過$POD_NAME變數指定)的8080連接埠轉寄到本地主機的8080連接埠。即發送到本地8080連接埠的請求都將會被透明地轉寄到Pod的8080連接埠上。

      # 擷取Pod名稱。
      POD_NAME=`kubectl get po|grep sklearn-iris |awk -F ' ' '{print $1}'`
      # 通過port-forward將pod的8080連接埠轉寄到本地。
      kubectl port-forward pod/$POD_NAME 8080:8080

      預期輸出:

      Forwarding from 127.0.0.1:8080 -> 8080
      Forwarding from [::1]:8080 -> 8080

      輸出結果表明無論是通過IPv4還是IPv6的本地串連嘗試,都能夠被正確地轉寄到Pod的8080連接埠。

    2. 在瀏覽器中輸入如下內容,訪問Pod的8080連接埠,查看Metrics。

      http://localhost:8080/metrics

      預期輸出:

      展開查看預期輸出

      # HELP python_gc_objects_collected_total Objects collected during gc
      # TYPE python_gc_objects_collected_total counter
      python_gc_objects_collected_total{generation="0"} 10298.0
      python_gc_objects_collected_total{generation="1"} 1826.0
      python_gc_objects_collected_total{generation="2"} 0.0
      # HELP python_gc_objects_uncollectable_total Uncollectable object found during GC
      # TYPE python_gc_objects_uncollectable_total counter
      python_gc_objects_uncollectable_total{generation="0"} 0.0
      python_gc_objects_uncollectable_total{generation="1"} 0.0
      python_gc_objects_uncollectable_total{generation="2"} 0.0
      # HELP python_gc_collections_total Number of times this generation was collected
      # TYPE python_gc_collections_total counter
      python_gc_collections_total{generation="0"} 660.0
      python_gc_collections_total{generation="1"} 60.0
      python_gc_collections_total{generation="2"} 5.0
      # HELP python_info Python platform information
      # TYPE python_info gauge
      python_info{implementation="CPython",major="3",minor="9",patchlevel="18",version="3.9.18"} 1.0
      # HELP process_virtual_memory_bytes Virtual memory size in bytes.
      # TYPE process_virtual_memory_bytes gauge
      process_virtual_memory_bytes 1.406291968e+09
      # HELP process_resident_memory_bytes Resident memory size in bytes.
      # TYPE process_resident_memory_bytes gauge
      process_resident_memory_bytes 2.73207296e+08
      # HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
      # TYPE process_start_time_seconds gauge
      process_start_time_seconds 1.71533439115e+09
      # HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
      # TYPE process_cpu_seconds_total counter
      process_cpu_seconds_total 228.18
      # HELP process_open_fds Number of open file descriptors.
      # TYPE process_open_fds gauge
      process_open_fds 16.0
      # HELP process_max_fds Maximum number of open file descriptors.
      # TYPE process_max_fds gauge
      process_max_fds 1.048576e+06
      # HELP request_preprocess_seconds pre-process request latency
      # TYPE request_preprocess_seconds histogram
      request_preprocess_seconds_bucket{le="0.005",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.01",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.025",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.05",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.075",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.1",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.25",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.5",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.75",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="1.0",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="2.5",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="5.0",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="7.5",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="10.0",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="+Inf",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_count{model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_sum{model_name="sklearn-iris"} 1.7146860011853278
      # HELP request_preprocess_seconds_created pre-process request latency
      # TYPE request_preprocess_seconds_created gauge
      request_preprocess_seconds_created{model_name="sklearn-iris"} 1.7153354578475933e+09
      # HELP request_postprocess_seconds post-process request latency
      # TYPE request_postprocess_seconds histogram
      request_postprocess_seconds_bucket{le="0.005",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.01",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.025",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.05",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.075",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.1",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.25",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.5",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.75",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="1.0",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="2.5",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="5.0",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="7.5",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="10.0",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="+Inf",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_count{model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_sum{model_name="sklearn-iris"} 1.625360683305189
      # HELP request_postprocess_seconds_created post-process request latency
      # TYPE request_postprocess_seconds_created gauge
      request_postprocess_seconds_created{model_name="sklearn-iris"} 1.7153354578482144e+09
      # HELP request_predict_seconds predict request latency
      # TYPE request_predict_seconds histogram
      request_predict_seconds_bucket{le="0.005",model_name="sklearn-iris"} 259708.0
      request_predict_seconds_bucket{le="0.01",model_name="sklearn-iris"} 259708.0
      request_predict_seconds_bucket{le="0.025",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="0.05",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="0.075",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="0.1",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="0.25",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="0.5",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="0.75",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="1.0",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="2.5",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="5.0",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="7.5",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="10.0",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="+Inf",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_count{model_name="sklearn-iris"} 259709.0
      request_predict_seconds_sum{model_name="sklearn-iris"} 47.95311741752084
      # HELP request_predict_seconds_created predict request latency
      # TYPE request_predict_seconds_created gauge
      request_predict_seconds_created{model_name="sklearn-iris"} 1.7153354578476949e+09
      # HELP request_explain_seconds explain request latency
      # TYPE request_explain_seconds histogram

      從輸出結果可以看到該Pod內應用提供的各種效能和狀態指標資訊,即這個請求最終被轉寄到了Pod內的應用服務上。

步驟二:查詢KServe應用指標

  1. 登入ARMS控制台

  2. 在左側的導覽列,單擊接入管理,然後單擊大盤查詢

  3. 大盤列表頁面,單擊Kubernetes Pod大盤,進入Grafana頁面。

  4. 在左側導覽列單擊Explore,輸入查詢語句request_predict_seconds_bucket,查詢應用指標值。

    說明

    資料擷取約有5分鐘延遲。

    image

常見問題與解決方案

常見問題

如何確認request_predict_seconds_bucket指標資料已經採集成功?

解決方案

  1. 登入ARMS控制台

  2. 已接入環境容器環境頁簽下,單擊目標容器環境名稱,然後單擊自監控頁簽。

  3. 單擊左側導覽列Targets,如果看到default/sklearn-iris-svcmonitor/0 (1/1 up),則說明指標資料擷取成功。

相關文檔

關於KServe架構提供的預設指標資訊,請參見社區文檔KServe Prometheus Metrics