All Products
Search
Document Center

Alibaba Cloud Service Mesh:Observe LLM traffic in Service Mesh (ASM)

Last Updated:Mar 11, 2026

When multiple workloads call LLM providers, tracking token consumption and model usage per request becomes difficult without infrastructure-level telemetry. Service Mesh (ASM) captures LLM-specific metadata -- model name, input tokens, and output tokens -- directly in the sidecar proxy, so you can monitor costs, debug requests, and analyze model performance without modifying application code.

ASM provides three levels of LLM observability, each building on the previous:

CapabilityWhat it tracksUse case
Access logsPer-request model name, input tokens, output tokensDebug individual requests, audit per-request costs
Token consumption metricsAggregated token counts per workload and modelMonitor token usage in real time, set alerting thresholds
Custom metric dimensionsLLM model as a dimension on native Istio metrics (istio_requests_total)Analyze success rates and latency by model

Prerequisites

Before you begin, make sure that you have:

  • A Service Mesh (ASM) instance

  • A Container Service for Kubernetes (ACK) cluster added to the mesh

  • Completion of at least Step 1 and Step 2 in Use ASM to route LLM traffic

Note

The examples below build on all steps from the traffic routing guide. If you completed only Step 1 and Step 2, use the test commands from Step 2 to generate traffic for the verification steps.

Add LLM fields to access logs

Most LLM providers charge by token usage. By adding LLM-specific fields to sidecar access logs, you get per-request visibility into which model handled each request and how many tokens it consumed -- enabling direct cost tracking from infrastructure logs.

For background on access log customization, see Custom data plane access logs.

Configure log fields

  1. Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.

  2. On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose Observability Management Center > Observability Settings.

  3. In the Log Settings section, add the following three fields:

    Field nameFILTER_STATE expressionDescription
    request_modelFILTER_STATE(wasm.asm.llmproxy.request_model:PLAIN)Model used for the request (for example, qwen-turbo or qwen1.5-72b-chat)
    request_prompt_tokensFILTER_STATE(wasm.asm.llmproxy.request_prompt_tokens:PLAIN)Number of input tokens
    request_completion_tokensFILTER_STATE(wasm.asm.llmproxy.request_completion_tokens:PLAIN)Number of output tokens

    image

Verify access logs

  1. Send two test requests using the kubeconfig file of the ACK cluster. Run each command separately:

       kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' \
       --header 'Content-Type: application/json' \
       --data '{
           "messages": [
               {"role": "user", "content": "Please introduce yourself."}
           ]
       }'
       kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' \
       --header 'Content-Type: application/json' \
       --header 'user-type: subscriber' \
       --data '{
           "messages": [
               {"role": "user", "content": "Please introduce yourself."}
           ]
       }'
  2. View the access logs: Expected output:

       kubectl logs deployments/sleep -c istio-proxy | tail -2
       {"bytes_received":"85","bytes_sent":"617","downstream_local_address":"47.93.xxx.xx:80","downstream_remote_address":"192.168.34.235:39066","duration":"7640","istio_policy_status":"-","method":"POST","path":"/compatible-mode/v1/chat/completions","protocol":"HTTP/1.1","request_id":"d0e17f66-f300-411a-8c32-xxxxxxxxxxxxx","requested_server_name":"-","response_code":"200","response_flags":"-","route_name":"-","start_time":"2024-07-12T03:20:03.993Z","trace_id":"-","upstream_cluster":"outbound|80||dashscope.aliyuncs.com","upstream_host":"47.93.xxx.xx:443","upstream_local_address":"192.168.34.235:38476","upstream_service_time":"7639","upstream_response_time":"7639","upstream_transport_failure_reason":"-","user_agent":"curl/8.8.0","x_forwarded_for":"-","authority_for":"dashscope.aliyuncs.com","request_model":"qwen1.5-72b-chat","request_prompt_tokens":"3","request_completion_tokens":"55"}
       {"bytes_received":"85","bytes_sent":"809","downstream_local_address":"47.93.xxx.xx:80","downstream_remote_address":"192.168.34.235:41090","duration":"2759","istio_policy_status":"-","method":"POST","path":"/compatible-mode/v1/chat/completions","protocol":"HTTP/1.1","request_id":"d89faada-6af3-4ac3-b4fd-xxxxxxxxxxxxx","requested_server_name":"-","response_code":"200","response_flags":"-","route_name":"vip-route","start_time":"2024-07-12T03:20:30.854Z","trace_id":"-","upstream_cluster":"outbound|80||dashscope.aliyuncs.com","upstream_host":"47.93.xxx.xx:443","upstream_local_address":"192.168.34.235:38476","upstream_service_time":"2759","upstream_response_time":"2759","upstream_transport_failure_reason":"-","user_agent":"curl/8.8.0","x_forwarded_for":"-","authority_for":"dashscope.aliyuncs.com","request_model":"qwen-turbo","request_prompt_tokens":"11","request_completion_tokens":"90"}
  3. The following formatted excerpt highlights the LLM-specific fields from the log output: Each log entry shows the LLM provider (authority_for), the model that handled the request, and the number of tokens consumed.

       {
           "duration": "7640",
           "response_code": "200",
           "authority_for": "dashscope.aliyuncs.com",
           "request_model": "qwen1.5-72b-chat",
           "request_prompt_tokens": "3",
           "request_completion_tokens": "55"
       }
       {
           "duration": "2759",
           "response_code": "200",
           "authority_for": "dashscope.aliyuncs.com",
           "request_model": "qwen-turbo",
           "request_prompt_tokens": "11",
           "request_completion_tokens": "90"
       }

Forward logs to Simple Log Service (SLS)

ASM integrates with Simple Log Service (SLS) for centralized log collection. After you enable log collection, you can:

  • Search and filter logs by model name, token count, or response code

  • Create alerting rules -- for example, alert when a single request exceeds a token threshold

  • Build dashboards for LLM usage analytics

For setup instructions, see Enable data plane log collection.

Export token consumption as Prometheus metrics

Access logs capture per-request detail. For aggregated, real-time monitoring, configure the sidecar proxy to export token consumption as Prometheus metrics.

ASM exposes two LLM-specific metrics:

MetricDescription
asm_llm_proxy_prompt_tokensNumber of input tokens
asm_llm_proxy_completion_tokensNumber of output tokens

These metrics include four default dimensions:

DimensionDescription
llmproxy_source_workloadWorkload that initiated the request
llmproxy_source_workload_namespaceNamespace of the source workload
llmproxy_destination_serviceDestination LLM service
llmproxy_modelModel used for the request

Configure the sidecar to emit metrics

This example uses the sleep Deployment in the default namespace.

  1. Create a file named asm-llm-proxy-bootstrap-config.yaml with the following content:

       apiVersion: v1
       kind: ConfigMap
       metadata:
         name: asm-llm-proxy-bootstrap-config
       data:
         custom_bootstrap.json: |
           "stats_config": {
             "stats_tags":[
               {
               "tag_name": "llmproxy_source_workload",
               "regex": "(\\|llmproxy_source_workload=([^|]*))"
               },
               {
                 "tag_name": "llmproxy_source_workload_namespace",
                 "regex": "(\\|llmproxy_source_workload_namespace=([^|]*))"
               },
               {
                 "tag_name": "llmproxy_destination_service",
                 "regex": "(\\|llmproxy_destination_service=([^|]*))"
               },
               {
                 "tag_name": "llmproxy_model",
                 "regex": "(\\|llmproxy_model=([^|]*))"
               }
             ]
           }
  2. Apply the ConfigMap:

       kubectl apply -f asm-llm-proxy-bootstrap-config.yaml
  3. Add the bootstrap override annotation to the Deployment. This tells the sidecar to load the custom stats configuration:

       kubectl patch deployment sleep -p '{"spec":{"template":{"metadata":{"annotations":{"sidecar.istio.io/bootstrapOverride":"asm-llm-proxy-bootstrap-config"}}}}}'

Verify token metrics

  1. Send test requests using the commands from the previous section.

  2. Query the sidecar's Prometheus endpoint: Expected output: Each metric line shows the token count broken down by source workload, destination service, and model.

       kubectl exec deployments/sleep -it -c istio-proxy -- curl localhost:15090/stats/prometheus | grep llmproxy
       asm_llm_proxy_completion_tokens{llmproxy_source_workload="sleep",llmproxy_source_workload_namespace="default",llmproxy_destination_service="dashscope.aliyuncs.com",llmproxy_model="qwen1.5-72b-chat"} 72
       asm_llm_proxy_completion_tokens{llmproxy_source_workload="sleep",llmproxy_source_workload_namespace="default",llmproxy_destination_service="dashscope.aliyuncs.com",llmproxy_model="qwen-turbo"} 85
       asm_llm_proxy_prompt_tokens{llmproxy_source_workload="sleep",llmproxy_source_workload_namespace="default",llmproxy_destination_service="dashscope.aliyuncs.com",llmproxy_model="qwen1.5-72b-chat"} 3
       asm_llm_proxy_prompt_tokens{llmproxy_source_workload="sleep",llmproxy_source_workload_namespace="default",llmproxy_destination_service="dashscope.aliyuncs.com",llmproxy_model="qwen-turbo"} 11

Forward metrics to Managed Service for Prometheus

ASM integrates with Application Real-Time Monitoring Service (ARMS) for Prometheus-based metric collection. After you configure collection rules, you can build Grafana dashboards and set up alerting rules based on these LLM metrics.

For setup instructions, see Collect metrics to Managed Service for Prometheus.

Add LLM dimensions to native Istio metrics

ASM natively provides Istio standard metrics such as istio_requests_total, which track HTTP and TCP traffic with dimensions like source workload, destination service, and response code. ASM has developed a Prometheus dashboard utilizing these metrics and dimensions. By default, these metrics do not include LLM-specific information.

To enable per-model analysis on native metrics, add a custom model dimension that extracts the model name from LLM requests.

Configure the model dimension

This example adds the model dimension to the REQUEST_COUNT metric.

  1. Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.

  2. On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose Observability Management Center > Observability Settings.

  3. Select REQUEST_COUNT and click Edit Dimension. On the Custom Dimension tab, enter the following values:

    • Dimension Name: model

    • Value: filter_state["wasm.asm.llmproxy.request_model"]

    image

    image

Verify the custom dimension

  1. Send test requests using the commands from the access log section.

  2. Query the sidecar's Prometheus endpoint: Expected output: The model dimension now appears in istio_requests_total, enabling per-model queries on native Istio metrics.

       kubectl exec deployments/sleep -it -c istio-proxy -- curl localhost:15090/stats/prometheus | grep llmproxy
       istio_requests_total{reporter="source",source_workload="sleep",source_canonical_service="sleep",source_canonical_revision="latest",source_workload_namespace="default",source_principal="unknown",source_app="sleep",source_version="",source_cluster="cce8d2c1d1e8d4abc8d5c180d160669cc",destination_workload="unknown",destination_workload_namespace="unknown",destination_principal="unknown",destination_app="unknown",destination_version="unknown",destination_service="dashscope.aliyuncs.com",destination_canonical_service="unknown",destination_canonical_revision="latest",destination_service_name="dashscope.aliyuncs.com",destination_service_namespace="unknown",destination_cluster="unknown",request_protocol="http",response_code="200",grpc_response_status="",response_flags="-",connection_security_policy="unknown",model="qwen1.5-72b-chat"} 1
       istio_requests_total{reporter="source",source_workload="sleep",source_canonical_service="sleep",source_canonical_revision="latest",source_workload_namespace="default",source_principal="unknown",source_app="sleep",source_version="",source_cluster="cce8d2c1d1e8d4abc8d5c180d160669cc",destination_workload="unknown",destination_workload_namespace="unknown",destination_principal="unknown",destination_app="unknown",destination_version="unknown",destination_service="dashscope.aliyuncs.com",destination_canonical_service="unknown",destination_canonical_revision="latest",destination_service_name="dashscope.aliyuncs.com",destination_service_namespace="unknown",destination_cluster="unknown",request_protocol="http",response_code="200",grpc_response_status="",response_flags="-",connection_security_policy="unknown",model="qwen-turbo"} 1

Example analysis queries

With the model dimension on istio_requests_total, set up analysis rules in Application Real-Time Monitoring Service (ARMS). For example:

  • Success rate by model: Compare response_code="200" counts against total counts, grouped by model.

  • Latency by model or provider: Add the same model dimension to latency metrics to track average response times per model.

What's next