All Products
Search
Document Center

Alibaba Cloud Service Mesh:Manage LLM traffic with ASM

Last Updated:Mar 11, 2026

Applications that call Large Language Model (LLM) APIs typically handle provider-specific protocols, credential management, and TLS configuration in application code. When you switch providers or route different user tiers to different models, these changes ripple through your codebase. Alibaba Cloud Service Mesh (ASM) moves this complexity into the mesh infrastructure: configure two Kubernetes custom resources -- LLMProvider and LLMRoute -- and the ASM sidecar handles protocol conversion, API key injection, TLS upgrade, and traffic routing automatically. Your application sends plain HTTP requests with no provider-specific logic.

With LLM traffic management in ASM, you can implement canary access, weighted routing, and observability capabilities:

  • Route by request header: Direct requests to different models based on headers -- for example, route subscribers to a premium model while other users use a standard model.

  • Split traffic by weight: Distribute requests across multiple LLM providers for gradual migration or A/B comparison.

  • Monitor LLM traffic: Track LLM-specific metrics through ASM's built-in observability dashboards.

How it works

ASM introduces two Custom Resource Definitions (CRDs) that work together to manage LLM traffic:

ResourceRoleApplied to
LLMProviderDefines a backend LLM service: host, API path, model, and API keyASM control plane (--kubeconfig=${PATH_TO_ASM_KUBECONFIG})
LLMRouteControls request distribution across providers, with support for header-based matching and weighted routingASM control plane (--kubeconfig=${PATH_TO_ASM_KUBECONFIG})

Request flow: When a pod sends a plain HTTP request to an LLM provider's hostname, the ASM sidecar intercepts the request and automatically:

  1. Converts it to the OpenAI-compatible chat completion format.

  2. Injects the API key from the LLMProvider configuration.

  3. Upgrades the connection from HTTP to HTTPS.

  4. Forwards the request to the provider's endpoint.

This means your application sends a minimal HTTP POST -- no API path, no credentials, no TLS setup. The sidecar fills in everything from the LLMProvider spec.

Prerequisites

Before you begin, make sure you have:

Set up the test environment

Deploy a test client and configure a basic LLM provider before running either scenario.

Step 1: Deploy the sleep test application

The sleep pod serves as the client for sending test requests to LLM providers through the mesh.

  1. Save the following content as sleep.yaml:

    YAML content

    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: sleep
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: sleep
      labels:
        app: sleep
        service: sleep
    spec:
      ports:
      - port: 80
        name: http
      selector:
        app: sleep
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: sleep
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: sleep
      template:
        metadata:
          labels:
            app: sleep
        spec:
          terminationGracePeriodSeconds: 0
          serviceAccountName: sleep
          containers:
          - name: sleep
            image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/curl:asm-sleep
            command: ["/bin/sleep", "infinity"]
            imagePullPolicy: IfNotPresent
            volumeMounts:
            - mountPath: /etc/sleep/tls
              name: secret-volume
          volumes:
          - name: secret-volume
            secret:
              secretName: sleep-secret
              optional: true
    ---
  2. Apply the manifest to your ACK cluster:

    kubectl apply -f sleep.yaml

Step 2: Configure the Model Studio provider

Create an LLMProvider resource that tells ASM how to reach Alibaba Cloud Model Studio (DashScope).

  1. Save the following content as LLMProvider.yaml. Replace <your-dashscope-api-key> with your Model Studio API key.

    apiVersion: istio.alibabacloud.com/v1beta1
    kind: LLMProvider
    metadata:
      name: dashscope-qwen
    spec:
      host: dashscope.aliyuncs.com
      path: /compatible-mode/v1/chat/completions
      configs:
        defaultConfig:
          openAIConfig:
            model: qwen1.5-72b-chat  # Qwen open-source series model
            apiKey: <your-dashscope-api-key>

    For a full list of available Qwen open-source models, see Text generation - Qwen open-source models.

  2. Apply the manifest to your ASM instance:

    kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMProvider.yaml
  3. Verify the setup by sending a test request from the sleep pod:

    kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' \
      --header 'Content-Type: application/json' \
      --data '{
        "messages": [
          {"role": "user", "content": "Please introduce yourself."}
        ]
      }'

    A successful response looks like:

    {
      "choices": [
        {
          "message": {
            "role": "assistant",
            "content": "Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud..."
          },
          "finish_reason": "stop",
          "index": 0
        }
      ],
      "model": "qwen1.5-72b-chat",
      "usage": {
        "prompt_tokens": 12,
        "completion_tokens": 130,
        "total_tokens": 142
      }
    }

    The request targets plain http://dashscope.aliyuncs.com without specifying a path, model, or API key. The ASM sidecar fills in these fields from the LLMProvider configuration, upgrades to HTTPS, and forwards the request to DashScope. Because Model Studio is compatible with the OpenAI protocol, the response follows the standard chat completion format.

Scenario 1: Route requests to different models by header

Route subscriber-tier users to the qwen-turbo model while all other users use the default qwen1.5-72b-chat model. The routing decision is based on the user-type request header.

Create the routing rule

  1. Save the following content as LLMRoute.yaml:

    apiVersion: istio.alibabacloud.com/v1beta1
    kind: LLMRoute
    metadata:
      name: dashscope-route
    spec:
      host: dashscope.aliyuncs.com  # Must match the LLMProvider host
      rules:
      - name: vip-route
        matches:
        - headers:
            user-type:
              exact: subscriber  # Match requests with this header value
        backendRefs:
        - providerHost: dashscope.aliyuncs.com
      - backendRefs:
        - providerHost: dashscope.aliyuncs.com

    The first rule matches requests that include a user-type: subscriber header and routes them through the vip-route routing rule. The second rule acts as the default catch-all.

  2. Apply the routing rule to your ASM instance:

    kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMRoute.yaml

Assign a model to each route

Update the LLMProvider to specify different models for the default route and the vip-route:

  1. Update LLMProvider.yaml with the following content:

    apiVersion: istio.alibabacloud.com/v1beta1
    kind: LLMProvider
    metadata:
      name: dashscope-qwen
    spec:
      host: dashscope.aliyuncs.com
      path: /compatible-mode/v1/chat/completions
      configs:
        defaultConfig:
          openAIConfig:
            model: qwen1.5-72b-chat  # Default: open-source model
            apiKey: <your-dashscope-api-key>
        routeSpecificConfigs:
          vip-route:                  # Override for subscriber requests
            openAIConfig:
              model: qwen-turbo       # Subscribers use qwen-turbo
              apiKey: <your-dashscope-api-key>
  2. Apply the update:

    kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMProvider.yaml

Test the routing

Send two requests -- one without the header (default route) and one with the user-type: subscriber header (VIP route):

# Default route: uses qwen1.5-72b-chat
kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' \
  --header 'Content-Type: application/json' \
  --data '{
    "messages": [
      {"role": "user", "content": "Please introduce yourself."}
    ]
  }'

# Subscriber route: uses qwen-turbo
kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' \
  --header 'Content-Type: application/json' \
  --header 'user-type: subscriber' \
  --data '{
    "messages": [
      {"role": "user", "content": "Please introduce yourself."}
    ]
  }'

Expected output:

{"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1720680044,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-1c33b950-3220-9bfe-9066-xxxxxxxxxxxx"}
{"choices":[{"message":{"role":"assistant","content":"Hello, I'm Qwen, a large language model from Alibaba Cloud. As an AI assistant, my goal is to help users get accurate and useful information, and to solve their problems and confusions. I can provide knowledge in various fields, engage in conversation, and even create text. Please note that all the content I provide is based on the data I was trained on and may not include the latest events or personal information. If you have any questions, feel free to ask me at any time!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":11,"completion_tokens":85,"total_tokens":96},"created":1720683416,"system_fingerprint":null,"model":"qwen-turbo","id":"chatcmpl-9cbc7c56-06e9-9639-a50d-xxxxxxxxxxxx"}

Check the model field in each response. The default request returns "model": "qwen1.5-72b-chat", while the subscriber request returns "model": "qwen-turbo".

Scenario 2: Split traffic across providers with weighted routing

Split traffic 50/50 between Alibaba Cloud Model Studio (DashScope) and Moonshot AI. This pattern is useful for gradually migrating between providers or comparing model performance side by side.

Step 1: Configure the Moonshot provider

  1. Save the following content as LLMProvider-moonshot.yaml. Replace <your-moonshot-api-key> with your Moonshot AI API key.

    apiVersion: istio.alibabacloud.com/v1beta1
    kind: LLMProvider
    metadata:
      name: moonshot
    spec:
      host: api.moonshot.cn  # Must be unique across all LLMProviders
      path: /v1/chat/completions
      configs:
        defaultConfig:
          openAIConfig:
            model: moonshot-v1-8k
            stream: false
            apiKey: <your-moonshot-api-key>
  2. Apply the manifest:

    kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMProvider-moonshot.yaml

Step 2: Create a virtual LLM service

Create a Kubernetes Service as a single entry point for LLM requests. This service has no backing pods -- the ASM sidecar routes all requests to the LLM providers defined in the LLMRoute.

  1. Save the following content as demo-llm-server.yaml:

    apiVersion: v1
    kind: Service
    metadata:
      name: demo-llm-server
      namespace: default
    spec:
      ports:
      - name: http
        port: 80
        protocol: TCP
        targetPort: 80
      selector:
        app: none
      type: ClusterIP
  2. Apply the manifest:

    kubectl apply -f demo-llm-server.yaml

Step 3: Configure weighted routing

Create an LLMRoute that distributes traffic evenly between DashScope and Moonshot:

  1. Save the following content as LLMRoute.yaml:

    apiVersion: istio.alibabacloud.com/v1beta1
    kind: LLMRoute
    metadata:
      name: demo-llm-server
      namespace: default
    spec:
      host: demo-llm-server
      rules:
      - name: migrate-rule
        backendRefs:
        - providerHost: dashscope.aliyuncs.com
          weight: 50
        - providerHost: api.moonshot.cn
          weight: 50

    Adjust the weight values to control the traffic split. The weights are relative -- 50/50 splits evenly, while 80/20 sends 80% to DashScope and 20% to Moonshot.

  2. Apply the routing rule:

    kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMRoute.yaml

Test the weighted routing

Send several requests to the virtual demo-llm-server service and observe the responses:

kubectl exec deployment/sleep -it -- curl --location 'http://demo-llm-server' \
  --header 'Content-Type: application/json' \
  --data '{
    "messages": [
      {"role": "user", "content": "Please introduce yourself."}
    ]
  }'

Run the command multiple times. Some responses come from Moonshot (identified by "model": "moonshot-v1-8k" and the Kimi assistant name), while others come from DashScope (identified by "model": "qwen1.5-72b-chat" and the Qwen assistant name).

Expected output:

{"id":"cmpl-cafd47b181204cdbb4a4xxxxxxxxxxxx","object":"chat.completion","created":1720687132,"model":"moonshot-v1-8k","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! I am an AI language model named Kimi. My main function is to help people generate human-like text. I can write articles, answer questions, provide advice, and more. I am trained on a massive amount of text data, so I can generate a wide variety of text. My goal is to help people communicate more effectively and solve problems."},"finish_reason":"stop"}],"usage":{"prompt_tokens":11,"completion_tokens":59,"total_tokens":70}}

{"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1720687164,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-2443772b-4e41-9ea8-9bed-xxxxxxxxxxxx"}

The output shows that requests are distributed evenly between Moonshot and Alibaba Cloud Model Studio.

Resource and placeholder reference

Kubernetes resources

YAML fileKindNameApplied to
sleep.yamlServiceAccount, Service, DeploymentsleepACK cluster (kubectl apply)
LLMProvider.yamlLLMProviderdashscope-qwenASM instance (--kubeconfig)
LLMProvider-moonshot.yamlLLMProvidermoonshotASM instance (--kubeconfig)
LLMRoute.yamlLLMRoutedashscope-route / demo-llm-serverASM instance (--kubeconfig)
demo-llm-server.yamlServicedemo-llm-serverACK cluster (kubectl apply)

Placeholders

Replace the following placeholders with your actual values before applying the YAML manifests:

PlaceholderDescriptionExample
<your-dashscope-api-key>API key for Alibaba Cloud Model Studiosk-xxxxxxxxxxxxx
<your-moonshot-api-key>API key for Moonshot AIsk-xxxxxxxxxxxxx
${PATH_TO_ASM_KUBECONFIG}Path to the ASM instance kubeconfig file~/.kube/asm-config