Route LLM Traffic via LLMProvider CRDs - Alibaba Cloud Service Mesh

Applications that call Large Language Model (LLM) APIs typically handle provider-specific protocols, credential management, and TLS configuration in application code. When you switch providers or route different user tiers to different models, these changes ripple through your codebase. Alibaba Cloud Service Mesh (ASM) moves this complexity into the mesh infrastructure: configure two Kubernetes custom resources -- LLMProvider and LLMRoute -- and the ASM sidecar handles protocol conversion, API key injection, TLS upgrade, and traffic routing automatically. Your application sends plain HTTP requests with no provider-specific logic.

With LLM traffic management in ASM, you can implement canary access, weighted routing, and observability capabilities:

Route by request header: Direct requests to different models based on headers -- for example, route subscribers to a premium model while other users use a standard model.
Split traffic by weight: Distribute requests across multiple LLM providers for gradual migration or A/B comparison.
Monitor LLM traffic: Track LLM-specific metrics through ASM's built-in observability dashboards.

How it works

ASM introduces two Custom Resource Definitions (CRDs) that work together to manage LLM traffic:

Resource	Role	Applied to
`LLMProvider`	Defines a backend LLM service: host, API path, model, and API key	ASM control plane (`--kubeconfig=${PATH_TO_ASM_KUBECONFIG}`)
`LLMRoute`	Controls request distribution across providers, with support for header-based matching and weighted routing	ASM control plane (`--kubeconfig=${PATH_TO_ASM_KUBECONFIG}`)

Request flow: When a pod sends a plain HTTP request to an LLM provider's hostname, the ASM sidecar intercepts the request and automatically:

Converts it to the OpenAI-compatible chat completion format.
Injects the API key from the LLMProvider configuration.
Upgrades the connection from HTTP to HTTPS.
Forwards the request to the provider's endpoint.

This means your application sends a minimal HTTP POST -- no API path, no credentials, no TLS setup. The sidecar fills in everything from the LLMProvider spec.

Prerequisites

Before you begin, make sure you have:

An ASM instance (v1.21.6.88 or later) with a cluster added
Sidecar injection policies configured for the target namespace
An Alibaba Cloud Model Studio account with a valid API key -- see Obtain an API key
(Scenario 2 only) A Moonshot AI account with a valid API key -- see the Moonshot AI Open Platform

Set up the test environment

Deploy a test client and configure a basic LLM provider before running either scenario.

Step 1: Deploy the sleep test application

The sleep pod serves as the client for sending test requests to LLM providers through the mesh.

Save the following content as sleep.yaml:

YAML content

apiVersion: v1
kind: ServiceAccount
metadata:
  name: sleep
---
apiVersion: v1
kind: Service
metadata:
  name: sleep
  labels:
    app: sleep
    service: sleep
spec:
  ports:
  - port: 80
    name: http
  selector:
    app: sleep
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sleep
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sleep
  template:
    metadata:
      labels:
        app: sleep
    spec:
      terminationGracePeriodSeconds: 0
      serviceAccountName: sleep
      containers:
      - name: sleep
        image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/curl:asm-sleep
        command: ["/bin/sleep", "infinity"]
        imagePullPolicy: IfNotPresent
        volumeMounts:
        - mountPath: /etc/sleep/tls
          name: secret-volume
      volumes:
      - name: secret-volume
        secret:
          secretName: sleep-secret
          optional: true
---

Apply the manifest to your ACK cluster:
```
kubectl apply -f sleep.yaml
```

Step 2: Configure the Model Studio provider

Create an LLMProvider resource that tells ASM how to reach Alibaba Cloud Model Studio (DashScope).

Save the following content as LLMProvider.yaml. Replace <your-dashscope-api-key> with your Model Studio API key.

apiVersion: istio.alibabacloud.com/v1beta1
kind: LLMProvider
metadata:
  name: dashscope-qwen
spec:
  host: dashscope.aliyuncs.com
  path: /compatible-mode/v1/chat/completions
  configs:
    defaultConfig:
      openAIConfig:
        model: qwen1.5-72b-chat  # Qwen open-source series model
        apiKey: <your-dashscope-api-key>

For a full list of available Qwen open-source models, see Text generation - Qwen open-source models.

Apply the manifest to your ASM instance:

kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMProvider.yaml

Verify the setup by sending a test request from the sleep pod:

kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' \
  --header 'Content-Type: application/json' \
  --data '{
    "messages": [
      {"role": "user", "content": "Please introduce yourself."}
    ]
  }'

A successful response looks like:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud..."
      },
      "finish_reason": "stop",
      "index": 0
    }
  ],
  "model": "qwen1.5-72b-chat",
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 130,
    "total_tokens": 142
  }
}

The request targets plain http://dashscope.aliyuncs.com without specifying a path, model, or API key. The ASM sidecar fills in these fields from the LLMProvider configuration, upgrades to HTTPS, and forwards the request to DashScope. Because Model Studio is compatible with the OpenAI protocol, the response follows the standard chat completion format.

Scenario 1: Route requests to different models by header

Route subscriber-tier users to the qwen-turbo model while all other users use the default qwen1.5-72b-chat model. The routing decision is based on the user-type request header.

Create the routing rule

Save the following content as LLMRoute.yaml:

apiVersion: istio.alibabacloud.com/v1beta1
kind: LLMRoute
metadata:
  name: dashscope-route
spec:
  host: dashscope.aliyuncs.com  # Must match the LLMProvider host
  rules:
  - name: vip-route
    matches:
    - headers:
        user-type:
          exact: subscriber  # Match requests with this header value
    backendRefs:
    - providerHost: dashscope.aliyuncs.com
  - backendRefs:
    - providerHost: dashscope.aliyuncs.com

The first rule matches requests that include a user-type: subscriber header and routes them through the vip-route routing rule. The second rule acts as the default catch-all.

Apply the routing rule to your ASM instance:

kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMRoute.yaml

Assign a model to each route

Update the LLMProvider to specify different models for the default route and the vip-route:

Update LLMProvider.yaml with the following content:

apiVersion: istio.alibabacloud.com/v1beta1
kind: LLMProvider
metadata:
  name: dashscope-qwen
spec:
  host: dashscope.aliyuncs.com
  path: /compatible-mode/v1/chat/completions
  configs:
    defaultConfig:
      openAIConfig:
        model: qwen1.5-72b-chat  # Default: open-source model
        apiKey: <your-dashscope-api-key>
    routeSpecificConfigs:
      vip-route:                  # Override for subscriber requests
        openAIConfig:
          model: qwen-turbo       # Subscribers use qwen-turbo
          apiKey: <your-dashscope-api-key>

Apply the update:

kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMProvider.yaml

Test the routing

Send two requests -- one without the header (default route) and one with the user-type: subscriber header (VIP route):

# Default route: uses qwen1.5-72b-chat
kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' \
  --header 'Content-Type: application/json' \
  --data '{
    "messages": [
      {"role": "user", "content": "Please introduce yourself."}
    ]
  }'

# Subscriber route: uses qwen-turbo
kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' \
  --header 'Content-Type: application/json' \
  --header 'user-type: subscriber' \
  --data '{
    "messages": [
      {"role": "user", "content": "Please introduce yourself."}
    ]
  }'

Expected output:

{"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1720680044,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-1c33b950-3220-9bfe-9066-xxxxxxxxxxxx"}

{"choices":[{"message":{"role":"assistant","content":"Hello, I'm Qwen, a large language model from Alibaba Cloud. As an AI assistant, my goal is to help users get accurate and useful information, and to solve their problems and confusions. I can provide knowledge in various fields, engage in conversation, and even create text. Please note that all the content I provide is based on the data I was trained on and may not include the latest events or personal information. If you have any questions, feel free to ask me at any time!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":11,"completion_tokens":85,"total_tokens":96},"created":1720683416,"system_fingerprint":null,"model":"qwen-turbo","id":"chatcmpl-9cbc7c56-06e9-9639-a50d-xxxxxxxxxxxx"}

Check the model field in each response. The default request returns "model": "qwen1.5-72b-chat", while the subscriber request returns "model": "qwen-turbo".

Scenario 2: Split traffic across providers with weighted routing

Split traffic 50/50 between Alibaba Cloud Model Studio (DashScope) and Moonshot AI. This pattern is useful for gradually migrating between providers or comparing model performance side by side.

Step 1: Configure the Moonshot provider

Save the following content as LLMProvider-moonshot.yaml. Replace <your-moonshot-api-key> with your Moonshot AI API key.

apiVersion: istio.alibabacloud.com/v1beta1
kind: LLMProvider
metadata:
  name: moonshot
spec:
  host: api.moonshot.cn  # Must be unique across all LLMProviders
  path: /v1/chat/completions
  configs:
    defaultConfig:
      openAIConfig:
        model: moonshot-v1-8k
        stream: false
        apiKey: <your-moonshot-api-key>

Apply the manifest:

kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMProvider-moonshot.yaml

Step 2: Create a virtual LLM service

Create a Kubernetes Service as a single entry point for LLM requests. This service has no backing pods -- the ASM sidecar routes all requests to the LLM providers defined in the LLMRoute.

Save the following content as demo-llm-server.yaml:

apiVersion: v1
kind: Service
metadata:
  name: demo-llm-server
  namespace: default
spec:
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 80
  selector:
    app: none
  type: ClusterIP

Apply the manifest:
```
kubectl apply -f demo-llm-server.yaml
```

Step 3: Configure weighted routing

Create an LLMRoute that distributes traffic evenly between DashScope and Moonshot:

Save the following content as LLMRoute.yaml:

apiVersion: istio.alibabacloud.com/v1beta1
kind: LLMRoute
metadata:
  name: demo-llm-server
  namespace: default
spec:
  host: demo-llm-server
  rules:
  - name: migrate-rule
    backendRefs:
    - providerHost: dashscope.aliyuncs.com
      weight: 50
    - providerHost: api.moonshot.cn
      weight: 50

Adjust the weight values to control the traffic split. The weights are relative -- 50/50 splits evenly, while 80/20 sends 80% to DashScope and 20% to Moonshot.

Apply the routing rule:

kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMRoute.yaml

Test the weighted routing

Send several requests to the virtual demo-llm-server service and observe the responses:

kubectl exec deployment/sleep -it -- curl --location 'http://demo-llm-server' \
  --header 'Content-Type: application/json' \
  --data '{
    "messages": [
      {"role": "user", "content": "Please introduce yourself."}
    ]
  }'

Run the command multiple times. Some responses come from Moonshot (identified by "model": "moonshot-v1-8k" and the Kimi assistant name), while others come from DashScope (identified by "model": "qwen1.5-72b-chat" and the Qwen assistant name).

Expected output:

{"id":"cmpl-cafd47b181204cdbb4a4xxxxxxxxxxxx","object":"chat.completion","created":1720687132,"model":"moonshot-v1-8k","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! I am an AI language model named Kimi. My main function is to help people generate human-like text. I can write articles, answer questions, provide advice, and more. I am trained on a massive amount of text data, so I can generate a wide variety of text. My goal is to help people communicate more effectively and solve problems."},"finish_reason":"stop"}],"usage":{"prompt_tokens":11,"completion_tokens":59,"total_tokens":70}}

{"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1720687164,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-2443772b-4e41-9ea8-9bed-xxxxxxxxxxxx"}

The output shows that requests are distributed evenly between Moonshot and Alibaba Cloud Model Studio.

Resource and placeholder reference

Kubernetes resources

YAML file	Kind	Name	Applied to
`sleep.yaml`	ServiceAccount, Service, Deployment	sleep	ACK cluster (`kubectl apply`)
`LLMProvider.yaml`	LLMProvider	dashscope-qwen	ASM instance (`--kubeconfig`)
`LLMProvider-moonshot.yaml`	LLMProvider	moonshot	ASM instance (`--kubeconfig`)
`LLMRoute.yaml`	LLMRoute	dashscope-route / demo-llm-server	ASM instance (`--kubeconfig`)
`demo-llm-server.yaml`	Service	demo-llm-server	ACK cluster (`kubectl apply`)

Placeholders

Replace the following placeholders with your actual values before applying the YAML manifests:

Placeholder	Description	Example
`<your-dashscope-api-key>`	API key for Alibaba Cloud Model Studio	sk-xxxxxxxxxxxxx
`<your-moonshot-api-key>`	API key for Moonshot AI	sk-xxxxxxxxxxxxx
`${PATH_TO_ASM_KUBECONFIG}`	Path to the ASM instance kubeconfig file	~/.kube/asm-config