Applications that call Large Language Model (LLM) APIs typically handle provider-specific protocols, credential management, and TLS configuration in application code. When you switch providers or route different user tiers to different models, these changes ripple through your codebase. Alibaba Cloud Service Mesh (ASM) moves this complexity into the mesh infrastructure: configure two Kubernetes custom resources -- LLMProvider and LLMRoute -- and the ASM sidecar handles protocol conversion, API key injection, TLS upgrade, and traffic routing automatically. Your application sends plain HTTP requests with no provider-specific logic.
With LLM traffic management in ASM, you can implement canary access, weighted routing, and observability capabilities:
Route by request header: Direct requests to different models based on headers -- for example, route subscribers to a premium model while other users use a standard model.
Split traffic by weight: Distribute requests across multiple LLM providers for gradual migration or A/B comparison.
Monitor LLM traffic: Track LLM-specific metrics through ASM's built-in observability dashboards.
How it works
ASM introduces two Custom Resource Definitions (CRDs) that work together to manage LLM traffic:
| Resource | Role | Applied to |
|---|---|---|
LLMProvider | Defines a backend LLM service: host, API path, model, and API key | ASM control plane (--kubeconfig=${PATH_TO_ASM_KUBECONFIG}) |
LLMRoute | Controls request distribution across providers, with support for header-based matching and weighted routing | ASM control plane (--kubeconfig=${PATH_TO_ASM_KUBECONFIG}) |
Request flow: When a pod sends a plain HTTP request to an LLM provider's hostname, the ASM sidecar intercepts the request and automatically:
Converts it to the OpenAI-compatible chat completion format.
Injects the API key from the
LLMProviderconfiguration.Upgrades the connection from HTTP to HTTPS.
Forwards the request to the provider's endpoint.
This means your application sends a minimal HTTP POST -- no API path, no credentials, no TLS setup. The sidecar fills in everything from the LLMProvider spec.
Prerequisites
Before you begin, make sure you have:
An ASM instance (v1.21.6.88 or later) with a cluster added
Sidecar injection policies configured for the target namespace
An Alibaba Cloud Model Studio account with a valid API key -- see Obtain an API key
(Scenario 2 only) A Moonshot AI account with a valid API key -- see the Moonshot AI Open Platform
Set up the test environment
Deploy a test client and configure a basic LLM provider before running either scenario.
Step 1: Deploy the sleep test application
The sleep pod serves as the client for sending test requests to LLM providers through the mesh.
Save the following content as
sleep.yaml:Apply the manifest to your ACK cluster:
kubectl apply -f sleep.yaml
Step 2: Configure the Model Studio provider
Create an LLMProvider resource that tells ASM how to reach Alibaba Cloud Model Studio (DashScope).
Save the following content as
LLMProvider.yaml. Replace<your-dashscope-api-key>with your Model Studio API key.apiVersion: istio.alibabacloud.com/v1beta1 kind: LLMProvider metadata: name: dashscope-qwen spec: host: dashscope.aliyuncs.com path: /compatible-mode/v1/chat/completions configs: defaultConfig: openAIConfig: model: qwen1.5-72b-chat # Qwen open-source series model apiKey: <your-dashscope-api-key>For a full list of available Qwen open-source models, see Text generation - Qwen open-source models.
Apply the manifest to your ASM instance:
kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMProvider.yamlVerify the setup by sending a test request from the
sleeppod:kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' \ --header 'Content-Type: application/json' \ --data '{ "messages": [ {"role": "user", "content": "Please introduce yourself."} ] }'A successful response looks like:
{ "choices": [ { "message": { "role": "assistant", "content": "Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud..." }, "finish_reason": "stop", "index": 0 } ], "model": "qwen1.5-72b-chat", "usage": { "prompt_tokens": 12, "completion_tokens": 130, "total_tokens": 142 } }The request targets plain
http://dashscope.aliyuncs.comwithout specifying a path, model, or API key. The ASM sidecar fills in these fields from theLLMProviderconfiguration, upgrades to HTTPS, and forwards the request to DashScope. Because Model Studio is compatible with the OpenAI protocol, the response follows the standard chat completion format.
Scenario 1: Route requests to different models by header
Route subscriber-tier users to the qwen-turbo model while all other users use the default qwen1.5-72b-chat model. The routing decision is based on the user-type request header.
Create the routing rule
Save the following content as
LLMRoute.yaml:apiVersion: istio.alibabacloud.com/v1beta1 kind: LLMRoute metadata: name: dashscope-route spec: host: dashscope.aliyuncs.com # Must match the LLMProvider host rules: - name: vip-route matches: - headers: user-type: exact: subscriber # Match requests with this header value backendRefs: - providerHost: dashscope.aliyuncs.com - backendRefs: - providerHost: dashscope.aliyuncs.comThe first rule matches requests that include a
user-type: subscriberheader and routes them through thevip-routerouting rule. The second rule acts as the default catch-all.Apply the routing rule to your ASM instance:
kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMRoute.yaml
Assign a model to each route
Update the LLMProvider to specify different models for the default route and the vip-route:
Update
LLMProvider.yamlwith the following content:apiVersion: istio.alibabacloud.com/v1beta1 kind: LLMProvider metadata: name: dashscope-qwen spec: host: dashscope.aliyuncs.com path: /compatible-mode/v1/chat/completions configs: defaultConfig: openAIConfig: model: qwen1.5-72b-chat # Default: open-source model apiKey: <your-dashscope-api-key> routeSpecificConfigs: vip-route: # Override for subscriber requests openAIConfig: model: qwen-turbo # Subscribers use qwen-turbo apiKey: <your-dashscope-api-key>Apply the update:
kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMProvider.yaml
Test the routing
Send two requests -- one without the header (default route) and one with the user-type: subscriber header (VIP route):
# Default route: uses qwen1.5-72b-chat
kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' \
--header 'Content-Type: application/json' \
--data '{
"messages": [
{"role": "user", "content": "Please introduce yourself."}
]
}'
# Subscriber route: uses qwen-turbo
kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' \
--header 'Content-Type: application/json' \
--header 'user-type: subscriber' \
--data '{
"messages": [
{"role": "user", "content": "Please introduce yourself."}
]
}'Expected output:
{"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1720680044,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-1c33b950-3220-9bfe-9066-xxxxxxxxxxxx"}{"choices":[{"message":{"role":"assistant","content":"Hello, I'm Qwen, a large language model from Alibaba Cloud. As an AI assistant, my goal is to help users get accurate and useful information, and to solve their problems and confusions. I can provide knowledge in various fields, engage in conversation, and even create text. Please note that all the content I provide is based on the data I was trained on and may not include the latest events or personal information. If you have any questions, feel free to ask me at any time!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":11,"completion_tokens":85,"total_tokens":96},"created":1720683416,"system_fingerprint":null,"model":"qwen-turbo","id":"chatcmpl-9cbc7c56-06e9-9639-a50d-xxxxxxxxxxxx"}Check the model field in each response. The default request returns "model": "qwen1.5-72b-chat", while the subscriber request returns "model": "qwen-turbo".
Scenario 2: Split traffic across providers with weighted routing
Split traffic 50/50 between Alibaba Cloud Model Studio (DashScope) and Moonshot AI. This pattern is useful for gradually migrating between providers or comparing model performance side by side.
Step 1: Configure the Moonshot provider
Save the following content as
LLMProvider-moonshot.yaml. Replace<your-moonshot-api-key>with your Moonshot AI API key.apiVersion: istio.alibabacloud.com/v1beta1 kind: LLMProvider metadata: name: moonshot spec: host: api.moonshot.cn # Must be unique across all LLMProviders path: /v1/chat/completions configs: defaultConfig: openAIConfig: model: moonshot-v1-8k stream: false apiKey: <your-moonshot-api-key>Apply the manifest:
kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMProvider-moonshot.yaml
Step 2: Create a virtual LLM service
Create a Kubernetes Service as a single entry point for LLM requests. This service has no backing pods -- the ASM sidecar routes all requests to the LLM providers defined in the LLMRoute.
Save the following content as
demo-llm-server.yaml:apiVersion: v1 kind: Service metadata: name: demo-llm-server namespace: default spec: ports: - name: http port: 80 protocol: TCP targetPort: 80 selector: app: none type: ClusterIPApply the manifest:
kubectl apply -f demo-llm-server.yaml
Step 3: Configure weighted routing
Create an LLMRoute that distributes traffic evenly between DashScope and Moonshot:
Save the following content as
LLMRoute.yaml:apiVersion: istio.alibabacloud.com/v1beta1 kind: LLMRoute metadata: name: demo-llm-server namespace: default spec: host: demo-llm-server rules: - name: migrate-rule backendRefs: - providerHost: dashscope.aliyuncs.com weight: 50 - providerHost: api.moonshot.cn weight: 50Adjust the
weightvalues to control the traffic split. The weights are relative --50/50splits evenly, while80/20sends 80% to DashScope and 20% to Moonshot.Apply the routing rule:
kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMRoute.yaml
Test the weighted routing
Send several requests to the virtual demo-llm-server service and observe the responses:
kubectl exec deployment/sleep -it -- curl --location 'http://demo-llm-server' \
--header 'Content-Type: application/json' \
--data '{
"messages": [
{"role": "user", "content": "Please introduce yourself."}
]
}'Run the command multiple times. Some responses come from Moonshot (identified by "model": "moonshot-v1-8k" and the Kimi assistant name), while others come from DashScope (identified by "model": "qwen1.5-72b-chat" and the Qwen assistant name).
Expected output:
{"id":"cmpl-cafd47b181204cdbb4a4xxxxxxxxxxxx","object":"chat.completion","created":1720687132,"model":"moonshot-v1-8k","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! I am an AI language model named Kimi. My main function is to help people generate human-like text. I can write articles, answer questions, provide advice, and more. I am trained on a massive amount of text data, so I can generate a wide variety of text. My goal is to help people communicate more effectively and solve problems."},"finish_reason":"stop"}],"usage":{"prompt_tokens":11,"completion_tokens":59,"total_tokens":70}}
{"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1720687164,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-2443772b-4e41-9ea8-9bed-xxxxxxxxxxxx"}The output shows that requests are distributed evenly between Moonshot and Alibaba Cloud Model Studio.
Resource and placeholder reference
Kubernetes resources
| YAML file | Kind | Name | Applied to |
|---|---|---|---|
sleep.yaml | ServiceAccount, Service, Deployment | sleep | ACK cluster (kubectl apply) |
LLMProvider.yaml | LLMProvider | dashscope-qwen | ASM instance (--kubeconfig) |
LLMProvider-moonshot.yaml | LLMProvider | moonshot | ASM instance (--kubeconfig) |
LLMRoute.yaml | LLMRoute | dashscope-route / demo-llm-server | ASM instance (--kubeconfig) |
demo-llm-server.yaml | Service | demo-llm-server | ACK cluster (kubectl apply) |
Placeholders
Replace the following placeholders with your actual values before applying the YAML manifests:
| Placeholder | Description | Example |
|---|---|---|
<your-dashscope-api-key> | API key for Alibaba Cloud Model Studio | sk-xxxxxxxxxxxxx |
<your-moonshot-api-key> | API key for Moonshot AI | sk-xxxxxxxxxxxxx |
${PATH_TO_ASM_KUBECONFIG} | Path to the ASM instance kubeconfig file | ~/.kube/asm-config |