Beyond the LLM request routing capabilities discussed in the preceding document, Alibaba Cloud Service Mesh (ASM) has further enhanced its observability features to meet the advanced observability requirements in LLM scenarios. This topic describes how to observe LLM request by using access logs and monitoring metrics in the ASM console.
To provide diverse options for traffic management, the steps in this topic are prepared based on all steps in Traffic routing: Use ASM to efficiently manage LLM traffic. If you have completed only Step 1 and Step 2, you can run the test commands in Step 2 to check whether the command to retrieve observable data is the same as discussed in this topic.
Step 1: Observe LLM requests using access logs
Configure access logs
ASM has enhanced support for LLM request logs. You can configure the custom access log to view the request logs. For detailed instructions, see Custom data plane access logs.
Log on to the ASM console. In the left-side navigation pane, choose .
On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose .
In the global Log Settings, add three fields as illustrated below:
The text content is as follows:
request_model FILTER_STATE(wasm.asm.llmproxy.request_model:PLAIN) request_prompt_tokens FILTER_STATE(wasm.asm.llmproxy.request_prompt_tokens:PLAIN) request_completion_tokens FILTER_STATE(wasm.asm.llmproxy.request_completion_tokens:PLAIN)
These fields are as follows:
request_model: the model used for the current LLM request, for instance, qwen-turbo or qwen1.5-72b-chat.
request_prompt_tokens: the number of input tokens for the current request.
request_completion_tokens: the number of output tokens for the current request.
Most large model service providers typically charge you for token usage. These fields allow users to view the number of tokens that are consumed during the request and identify the specific models in use.
Verification
Run the following two commands separately by using the kubeconfig file of the ACK cluster.
kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' \ --header 'Content-Type: application/json' \ --data '{ "messages": [ {"role": "user", "content": "Please introduce yourself."} ] }'
kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' \ --header 'Content-Type: application/json' \ --header 'user-type: subscriber' \ --data '{ "messages": [ {"role": "user", "content": "Please introduce yourself."} ] }'
Run the following command to view the access logs.
kubectl logs deployments/sleep -c istio-proxy | tail -2
Expected output:
{"bytes_received":"85","bytes_sent":"617","downstream_local_address":"47.93.xxx.xx:80","downstream_remote_address":"192.168.34.235:39066","duration":"7640","istio_policy_status":"-","method":"POST","path":"/compatible-mode/v1/chat/completions","protocol":"HTTP/1.1","request_id":"d0e17f66-f300-411a-8c32-xxxxxxxxxxxxx","requested_server_name":"-","response_code":"200","response_flags":"-","route_name":"-","start_time":"2024-07-12T03:20:03.993Z","trace_id":"-","upstream_cluster":"outbound|80||dashscope.aliyuncs.com","upstream_host":"47.93.xxx.xx:443","upstream_local_address":"192.168.34.235:38476","upstream_service_time":"7639","upstream_response_time":"7639","upstream_transport_failure_reason":"-","user_agent":"curl/8.8.0","x_forwarded_for":"-","authority_for":"dashscope.aliyuncs.com","request_model":"qwen1.5-72b-chat","request_prompt_tokens":"3","request_completion_tokens":"55"} {"bytes_received":"85","bytes_sent":"809","downstream_local_address":"47.93.xxx.xx:80","downstream_remote_address":"192.168.34.235:41090","duration":"2759","istio_policy_status":"-","method":"POST","path":"/compatible-mode/v1/chat/completions","protocol":"HTTP/1.1","request_id":"d89faada-6af3-4ac3-b4fd-xxxxxxxxxxxxx","requested_server_name":"-","response_code":"200","response_flags":"-","route_name":"vip-route","start_time":"2024-07-12T03:20:30.854Z","trace_id":"-","upstream_cluster":"outbound|80||dashscope.aliyuncs.com","upstream_host":"47.93.xxx.xx:443","upstream_local_address":"192.168.34.235:38476","upstream_service_time":"2759","upstream_response_time":"2759","upstream_transport_failure_reason":"-","user_agent":"curl/8.8.0","x_forwarded_for":"-","authority_for":"dashscope.aliyuncs.com","request_model":"qwen-turbo","request_prompt_tokens":"11","request_completion_tokens":"90"}
The formatted and processed logs appear as follows.
{ "duration": "7640", "response_code": "200", "authority_for": "dashscope.aliyuncs.com", --The actual large model provider accessed "request_model": "qwen1.5-72b-chat", --The model used by the current request "request_prompt_tokens": "3", --The number of input tokens for the current request "request_completion_tokens": "55" --The number of output tokens for the current request }
{ "duration": "2759", "response_code": "200", "authority_for": "dashscope.aliyuncs.com", --The actual large model provider accessed "request_model": "qwen-turbo", --The model used by the current request "request_prompt_tokens": "11", --The number of input tokens for the current request "request_completion_tokens": "90" --The number of output tokens for the current request }
ASM seamlessly integrates with Alibaba Cloud Simple Log Service (SLS), allowing you to monitor request-level LLM invocations via access logs. Additionally, these logs can be collected and stored directly. After you enable access logs, you can create custom alerting rules and design detailed log dashboards. For more information, see Enable data plane log collection.
Step 2: Add metrics to display the number of tokens consumed by the current workload
While access logs provide detailed records, monitoring metrics offer a broader view of data. ASM's mesh proxy can now output the number of tokens consumed by a workload as monitoring metrics, allowing real-time observation of token usage for the current workload.
ASM introduces two new metrics:
asm_llm_proxy_prompt_tokens: the number of input tokens.
asm_llm_proxy_completion_tokens: the number of output tokens.
These metrics include the following default dimensions:
llmproxy_source_workload: the name of the workload initiating the request.
llmproxy_source_workload_namespace: the namespace where the service that initiates the requests resides.
llmproxy_destination_service: the destination service.
llmproxy_model: the model used for the current request.
Modify workload configuration to output new metrics
This step uses the sleep deployed in the default namespace as an example.
Create a file named asm-llm-proxy-bootstrap-config.yaml by using the kubeconfig file of the ACK cluster.
apiVersion: v1 kind: ConfigMap metadata: name: asm-llm-proxy-bootstrap-config data: custom_bootstrap.json: | "stats_config": { "stats_tags":[ { "tag_name": "llmproxy_source_workload", "regex": "(\\|llmproxy_source_workload=([^|]*))" }, { "tag_name": "llmproxy_source_workload_namespace", "regex": "(\\|llmproxy_source_workload_namespace=([^|]*))" }, { "tag_name": "llmproxy_destination_service", "regex": "(\\|llmproxy_destination_service=([^|]*))" }, { "tag_name": "llmproxy_model", "regex": "(\\|llmproxy_model=([^|]*))" } ] }
Run the following command to create a ConfigMap named asm-llm-proxy-bootstrap-config.
kubectl apply -f asm-llm-proxy-bootstrap-config.yaml
Modify the sleep by adding an annotation to the pod with the following command.
kubectl patch deployment sleep -p '{"spec":{"template":{"metadata":{"annotations":{"sidecar.istio.io/bootstrapOverride":"asm-llm-proxy-bootstrap-config"}}}}}'
Verification
Run the test by using the following commands.
Run the following command to view the Prometheus metrics produced by the sleep's Sidecar.
kubectl exec deployments/sleep -it -c istio-proxy -- curl localhost:15090/stats/prometheus | grep llmproxy
Expected output:
asm_llm_proxy_completion_tokens{llmproxy_source_workload="sleep",llmproxy_source_workload_namespace="default",llmproxy_destination_service="dashscope.aliyuncs.com",llmproxy_model="qwen1.5-72b-chat"} 72 asm_llm_proxy_completion_tokens{llmproxy_source_workload="sleep",llmproxy_source_workload_namespace="default",llmproxy_destination_service="dashscope.aliyuncs.com",llmproxy_model="qwen-turbo"} 85 asm_llm_proxy_prompt_tokens{llmproxy_source_workload="sleep",llmproxy_source_workload_namespace="default",llmproxy_destination_service="dashscope.aliyuncs.com",llmproxy_model="qwen1.5-72b-chat"} 3 asm_llm_proxy_prompt_tokens{llmproxy_source_workload="sleep",llmproxy_source_workload_namespace="default",llmproxy_destination_service="dashscope.aliyuncs.com",llmproxy_model="qwen-turbo"} 11
The output shows the metrics provided by the sidecar and their respective default dimensions.
ASM is now integrated with the ARMS service, allowing for the collection of metrics to Managed Service for Prometheus through configured collection rules. For detailed instructions, see Collect metrics to Managed Service for Prometheus.
Step 3: Add LLM-related dimensions to ASM native metrics
ASM natively provides a variety of metrics that detail HTTP or TCP protocol. These metrics come with extensive dimensions, and ASM has developed a robust Prometheus Dashboard utilizing these metrics and dimensions.
However, these metrics currently do not include LLM requests. To address this issue, ASM has enhanced support for LLM request, allowing you add LLM request to existing metrics by customizing metric dimensions.
Configure custom dimension: model
In this example, add the model dimension to the REQUEST_COUNT metric.
Log on to the ASM console. In the left-side navigation pane, choose .
On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose .
Select REQUEST_COUNT and click Edit Dimension. Click the Custom Dimension tab, and enter model under the Dimension Name, and
filter_state["wasm.asm.llmproxy.request_model"]
under the Value.
Verification
Run the test by running the following commands separately.
Run the following command to view the Prometheus metrics produced by the sleep's Sidecar.
kubectl exec deployments/sleep -it -c istio-proxy -- curl localhost:15090/stats/prometheus | grep llmproxy
Expected output:
istio_requests_total{reporter="source",source_workload="sleep",source_canonical_service="sleep",source_canonical_revision="latest",source_workload_namespace="default",source_principal="unknown",source_app="sleep",source_version="",source_cluster="cce8d2c1d1e8d4abc8d5c180d160669cc",destination_workload="unknown",destination_workload_namespace="unknown",destination_principal="unknown",destination_app="unknown",destination_version="unknown",destination_service="dashscope.aliyuncs.com",destination_canonical_service="unknown",destination_canonical_revision="latest",destination_service_name="dashscope.aliyuncs.com",destination_service_namespace="unknown",destination_cluster="unknown",request_protocol="http",response_code="200",grpc_response_status="",response_flags="-",connection_security_policy="unknown",model="qwen1.5-72b-chat"} 1 istio_requests_total{reporter="source",source_workload="sleep",source_canonical_service="sleep",source_canonical_revision="latest",source_workload_namespace="default",source_principal="unknown",source_app="sleep",source_version="",source_cluster="cce8d2c1d1e8d4abc8d5c180d160669cc",destination_workload="unknown",destination_workload_namespace="unknown",destination_principal="unknown",destination_app="unknown",destination_version="unknown",destination_service="dashscope.aliyuncs.com",destination_canonical_service="unknown",destination_canonical_revision="latest",destination_service_name="dashscope.aliyuncs.com",destination_service_namespace="unknown",destination_cluster="unknown",request_protocol="http",response_code="200",grpc_response_status="",response_flags="-",connection_security_policy="unknown",model="qwen-turbo"} 1
The model dimension has now been successfully added to the istio_requests_total metric.
With these monitoring metrics, you can set up analysis rules in ARMS for a more granular analysis. For instance:
The success rate of requests to a specific model.
The average response latency for a particular model or service provider.
Conclusion
Based on Traffic routing: Use ASM to efficiently manage LLM traffic, this topic describes how to both detailed and overarching observations of LLM traffic. By making minor adjustments to the cluster configuration, you can unlock multi-dimensional observability features inherent to the service mesh. ASM further developing these observability functions, offering increasingly comprehensive and adaptable solutions.