Track LLM Token Usage and Traffic via Custom Access Logs - Alibaba Cloud Service Mesh

When multiple workloads call LLM providers, tracking token consumption and model usage per request becomes difficult without infrastructure-level telemetry. Service Mesh (ASM) captures LLM-specific metadata -- model name, input tokens, and output tokens -- directly in the sidecar proxy, so you can monitor costs, debug requests, and analyze model performance without modifying application code.

ASM provides three levels of LLM observability, each building on the previous:

Capability	What it tracks	Use case
Access logs	Per-request model name, input tokens, output tokens	Debug individual requests, audit per-request costs
Token consumption metrics	Aggregated token counts per workload and model	Monitor token usage in real time, set alerting thresholds
Custom metric dimensions	LLM model as a dimension on native Istio metrics (`istio_requests_total`)	Analyze success rates and latency by model

Prerequisites

Before you begin, make sure that you have:

A Service Mesh (ASM) instance
A Container Service for Kubernetes (ACK) cluster added to the mesh
Completion of at least Step 1 and Step 2 in Use ASM to route LLM traffic

Note

The examples below build on all steps from the traffic routing guide. If you completed only Step 1 and Step 2, use the test commands from Step 2 to generate traffic for the verification steps.

Add LLM fields to access logs

Most LLM providers charge by token usage. By adding LLM-specific fields to sidecar access logs, you get per-request visibility into which model handled each request and how many tokens it consumed -- enabling direct cost tracking from infrastructure logs.

For background on access log customization, see Custom data plane access logs.

Configure log fields

Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.
On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose Observability Management Center > Observability Settings.

In the Log Settings section, add the following three fields:

Field name	FILTER_STATE expression	Description
`request_model`	`FILTER_STATE(wasm.asm.llmproxy.request_model:PLAIN)`	Model used for the request (for example, `qwen-turbo` or `qwen1.5-72b-chat`)
`request_prompt_tokens`	`FILTER_STATE(wasm.asm.llmproxy.request_prompt_tokens:PLAIN)`	Number of input tokens
`request_completion_tokens`	`FILTER_STATE(wasm.asm.llmproxy.request_completion_tokens:PLAIN)`	Number of output tokens

Verify access logs

Send two test requests using the kubeconfig file of the ACK cluster. Run each command separately:

   kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' \
   --header 'Content-Type: application/json' \
   --data '{
       "messages": [
           {"role": "user", "content": "Please introduce yourself."}
       ]
   }'

   kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' \
   --header 'Content-Type: application/json' \
   --header 'user-type: subscriber' \
   --data '{
       "messages": [
           {"role": "user", "content": "Please introduce yourself."}
       ]
   }'

View the access logs: Expected output:

   kubectl logs deployments/sleep -c istio-proxy | tail -2

   {"bytes_received":"85","bytes_sent":"617","downstream_local_address":"47.93.xxx.xx:80","downstream_remote_address":"192.168.34.235:39066","duration":"7640","istio_policy_status":"-","method":"POST","path":"/compatible-mode/v1/chat/completions","protocol":"HTTP/1.1","request_id":"d0e17f66-f300-411a-8c32-xxxxxxxxxxxxx","requested_server_name":"-","response_code":"200","response_flags":"-","route_name":"-","start_time":"2024-07-12T03:20:03.993Z","trace_id":"-","upstream_cluster":"outbound|80||dashscope.aliyuncs.com","upstream_host":"47.93.xxx.xx:443","upstream_local_address":"192.168.34.235:38476","upstream_service_time":"7639","upstream_response_time":"7639","upstream_transport_failure_reason":"-","user_agent":"curl/8.8.0","x_forwarded_for":"-","authority_for":"dashscope.aliyuncs.com","request_model":"qwen1.5-72b-chat","request_prompt_tokens":"3","request_completion_tokens":"55"}
   {"bytes_received":"85","bytes_sent":"809","downstream_local_address":"47.93.xxx.xx:80","downstream_remote_address":"192.168.34.235:41090","duration":"2759","istio_policy_status":"-","method":"POST","path":"/compatible-mode/v1/chat/completions","protocol":"HTTP/1.1","request_id":"d89faada-6af3-4ac3-b4fd-xxxxxxxxxxxxx","requested_server_name":"-","response_code":"200","response_flags":"-","route_name":"vip-route","start_time":"2024-07-12T03:20:30.854Z","trace_id":"-","upstream_cluster":"outbound|80||dashscope.aliyuncs.com","upstream_host":"47.93.xxx.xx:443","upstream_local_address":"192.168.34.235:38476","upstream_service_time":"2759","upstream_response_time":"2759","upstream_transport_failure_reason":"-","user_agent":"curl/8.8.0","x_forwarded_for":"-","authority_for":"dashscope.aliyuncs.com","request_model":"qwen-turbo","request_prompt_tokens":"11","request_completion_tokens":"90"}

The following formatted excerpt highlights the LLM-specific fields from the log output: Each log entry shows the LLM provider (authority_for), the model that handled the request, and the number of tokens consumed.

   {
       "duration": "7640",
       "response_code": "200",
       "authority_for": "dashscope.aliyuncs.com",
       "request_model": "qwen1.5-72b-chat",
       "request_prompt_tokens": "3",
       "request_completion_tokens": "55"
   }

   {
       "duration": "2759",
       "response_code": "200",
       "authority_for": "dashscope.aliyuncs.com",
       "request_model": "qwen-turbo",
       "request_prompt_tokens": "11",
       "request_completion_tokens": "90"
   }

Forward logs to Simple Log Service (SLS)

ASM integrates with Simple Log Service (SLS) for centralized log collection. After you enable log collection, you can:

Search and filter logs by model name, token count, or response code
Create alerting rules -- for example, alert when a single request exceeds a token threshold
Build dashboards for LLM usage analytics

For setup instructions, see Enable data plane log collection.

Export token consumption as Prometheus metrics

Access logs capture per-request detail. For aggregated, real-time monitoring, configure the sidecar proxy to export token consumption as Prometheus metrics.

ASM exposes two LLM-specific metrics:

Metric	Description
`asm_llm_proxy_prompt_tokens`	Number of input tokens
`asm_llm_proxy_completion_tokens`	Number of output tokens

These metrics include four default dimensions:

Dimension	Description
`llmproxy_source_workload`	Workload that initiated the request
`llmproxy_source_workload_namespace`	Namespace of the source workload
`llmproxy_destination_service`	Destination LLM service
`llmproxy_model`	Model used for the request

Configure the sidecar to emit metrics

This example uses the sleep Deployment in the default namespace.

Create a file named asm-llm-proxy-bootstrap-config.yaml with the following content:

   apiVersion: v1
   kind: ConfigMap
   metadata:
     name: asm-llm-proxy-bootstrap-config
   data:
     custom_bootstrap.json: |
       "stats_config": {
         "stats_tags":[
           {
           "tag_name": "llmproxy_source_workload",
           "regex": "(\\|llmproxy_source_workload=([^|]*))"
           },
           {
             "tag_name": "llmproxy_source_workload_namespace",
             "regex": "(\\|llmproxy_source_workload_namespace=([^|]*))"
           },
           {
             "tag_name": "llmproxy_destination_service",
             "regex": "(\\|llmproxy_destination_service=([^|]*))"
           },
           {
             "tag_name": "llmproxy_model",
             "regex": "(\\|llmproxy_model=([^|]*))"
           }
         ]
       }

Apply the ConfigMap:

   kubectl apply -f asm-llm-proxy-bootstrap-config.yaml

Add the bootstrap override annotation to the Deployment. This tells the sidecar to load the custom stats configuration:

   kubectl patch deployment sleep -p '{"spec":{"template":{"metadata":{"annotations":{"sidecar.istio.io/bootstrapOverride":"asm-llm-proxy-bootstrap-config"}}}}}'

Verify token metrics

Send test requests using the commands from the previous section.

Query the sidecar's Prometheus endpoint: Expected output: Each metric line shows the token count broken down by source workload, destination service, and model.

   kubectl exec deployments/sleep -it -c istio-proxy -- curl localhost:15090/stats/prometheus | grep llmproxy

   asm_llm_proxy_completion_tokens{llmproxy_source_workload="sleep",llmproxy_source_workload_namespace="default",llmproxy_destination_service="dashscope.aliyuncs.com",llmproxy_model="qwen1.5-72b-chat"} 72
   asm_llm_proxy_completion_tokens{llmproxy_source_workload="sleep",llmproxy_source_workload_namespace="default",llmproxy_destination_service="dashscope.aliyuncs.com",llmproxy_model="qwen-turbo"} 85
   asm_llm_proxy_prompt_tokens{llmproxy_source_workload="sleep",llmproxy_source_workload_namespace="default",llmproxy_destination_service="dashscope.aliyuncs.com",llmproxy_model="qwen1.5-72b-chat"} 3
   asm_llm_proxy_prompt_tokens{llmproxy_source_workload="sleep",llmproxy_source_workload_namespace="default",llmproxy_destination_service="dashscope.aliyuncs.com",llmproxy_model="qwen-turbo"} 11

Forward metrics to Managed Service for Prometheus

ASM integrates with Application Real-Time Monitoring Service (ARMS) for Prometheus-based metric collection. After you configure collection rules, you can build Grafana dashboards and set up alerting rules based on these LLM metrics.

For setup instructions, see Collect metrics to Managed Service for Prometheus.

Add LLM dimensions to native Istio metrics

ASM natively provides Istio standard metrics such as istio_requests_total, which track HTTP and TCP traffic with dimensions like source workload, destination service, and response code. ASM has developed a Prometheus dashboard utilizing these metrics and dimensions. By default, these metrics do not include LLM-specific information.

To enable per-model analysis on native metrics, add a custom model dimension that extracts the model name from LLM requests.

Configure the model dimension

This example adds the model dimension to the REQUEST_COUNT metric.

Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.
On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose Observability Management Center > Observability Settings.
Select REQUEST_COUNT and click Edit Dimension. On the Custom Dimension tab, enter the following values:
- Dimension Name: model
- Value: filter_state["wasm.asm.llmproxy.request_model"]

Verify the custom dimension

Send test requests using the commands from the access log section.

Query the sidecar's Prometheus endpoint: Expected output: The model dimension now appears in istio_requests_total, enabling per-model queries on native Istio metrics.

   kubectl exec deployments/sleep -it -c istio-proxy -- curl localhost:15090/stats/prometheus | grep llmproxy

   istio_requests_total{reporter="source",source_workload="sleep",source_canonical_service="sleep",source_canonical_revision="latest",source_workload_namespace="default",source_principal="unknown",source_app="sleep",source_version="",source_cluster="cce8d2c1d1e8d4abc8d5c180d160669cc",destination_workload="unknown",destination_workload_namespace="unknown",destination_principal="unknown",destination_app="unknown",destination_version="unknown",destination_service="dashscope.aliyuncs.com",destination_canonical_service="unknown",destination_canonical_revision="latest",destination_service_name="dashscope.aliyuncs.com",destination_service_namespace="unknown",destination_cluster="unknown",request_protocol="http",response_code="200",grpc_response_status="",response_flags="-",connection_security_policy="unknown",model="qwen1.5-72b-chat"} 1
   istio_requests_total{reporter="source",source_workload="sleep",source_canonical_service="sleep",source_canonical_revision="latest",source_workload_namespace="default",source_principal="unknown",source_app="sleep",source_version="",source_cluster="cce8d2c1d1e8d4abc8d5c180d160669cc",destination_workload="unknown",destination_workload_namespace="unknown",destination_principal="unknown",destination_app="unknown",destination_version="unknown",destination_service="dashscope.aliyuncs.com",destination_canonical_service="unknown",destination_canonical_revision="latest",destination_service_name="dashscope.aliyuncs.com",destination_service_namespace="unknown",destination_cluster="unknown",request_protocol="http",response_code="200",grpc_response_status="",response_flags="-",connection_security_policy="unknown",model="qwen-turbo"} 1

Example analysis queries

With the model dimension on istio_requests_total, set up analysis rules in Application Real-Time Monitoring Service (ARMS). For example:

Success rate by model: Compare response_code="200" counts against total counts, grouped by model.
Latency by model or provider: Add the same model dimension to latency metrics to track average response times per model.

What's next

Enable data plane log collection -- Set up centralized log collection to Simple Log Service (SLS) for alerting and dashboards
Collect metrics to Managed Service for Prometheus -- Configure ARMS to scrape and store the LLM metrics for long-term analysis