All Products
Search
Document Center

Alibaba Cloud Service Mesh:Traffic routing: Manage LLM traffic with ASM

Last Updated:Feb 28, 2026

Alibaba Cloud Service Mesh (ASM) enhances the standard HTTP protocol to better support Large Language Model (LLM) requests, providing a simple and efficient way to manage LLM access. You can use ASM to implement Canary Access, Weighted Routing, and various observability capabilities. This topic describes how to configure and use LLM traffic routing.

Prerequisites

Setup

Step 1: Create the sleep test application

  1. Create a file named sleep.yaml with the following content.

    YAML content

    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: sleep
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: sleep
      labels:
        app: sleep
        service: sleep
    spec:
      ports:
      - port: 80
        name: http
      selector:
        app: sleep
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: sleep
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: sleep
      template:
        metadata:
          labels:
            app: sleep
        spec:
          terminationGracePeriodSeconds: 0
          serviceAccountName: sleep
          containers:
          - name: sleep
            image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/curl:asm-sleep
            command: ["/bin/sleep", "infinity"]
            imagePullPolicy: IfNotPresent
            volumeMounts:
            - mountPath: /etc/sleep/tls
              name: secret-volume
          volumes:
          - name: secret-volume
            secret:
              secretName: sleep-secret
              optional: true
    ---
  2. Run the following command to create the sleep application.

    kubectl apply -f sleep.yaml

Step 2: Create an LLMProvider for Model Studio

  1. Create a file named LLMProvider.yaml with the following content.

    apiVersion: istio.alibabacloud.com/v1beta1
    kind: LLMProvider
    metadata:  
      name: dashscope-qwen
    spec:
      host: dashscope.aliyuncs.com
      path: /compatible-mode/v1/chat/completions
      configs:
        defaultConfig:
          openAIConfig:
            model: qwen1.5-72b-chat  # Qwen open-source series model
            apiKey: ${YOUR_DASHSCOPE_API_KEY}

    For more information about open-source models, see Text Generation-Qwen-Open-source Version.

  2. Run the following command to create the LLMProvider.

    kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMProvider.yaml
  3. Run the following command to test the configuration.

    kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' --header 'Content-Type: application/json' --data '{
        "messages": [
            {"role": "user", "content": "Please introduce yourself."}
        ]
    }'

    Expected output:

    {"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1720680044,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-3608dcd5-e3ad-9ade-bc70-xxxxxxxxxxxxxx"}

    Once the LLMProvider is created, you can send plain HTTP requests to dashscope.aliyuncs.com from the sleep pod. The ASM sidecar automatically intercepts the request, converts it to the OpenAI-compatible LLM format, adds the API Key, upgrades the connection to HTTPS, and forwards it to the external LLM provider's server. Alibaba Cloud Model Studio is compatible with the OpenAI LLM protocol.

Scenarios

Scenario 1: Route users to different models

  1. Create a file named LLMRoute.yaml with the following content.

    apiVersion: istio.alibabacloud.com/v1beta1
    kind: LLMRoute
    metadata:  
      name: dashscope-route
    spec:
      host: dashscope.aliyuncs.com # This must be unique among providers.
      rules:
      - name: vip-route
        matches:
        - headers:
            user-type:
              exact: subscriber  # Routing rule for subscribers. A specific configuration will be provided in the provider.
        backendRefs:
        - providerHost: dashscope.aliyuncs.com
      - backendRefs:
        - providerHost: dashscope.aliyuncs.com

    This configuration routes requests that contain the user-type: subscriber header to the vip-route routing rule.

  2. Run the following command to create the LLMRoute.

    kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMRoute.yaml
  3. Update the LLMProvider.yaml file with the following route-level configuration:

    apiVersion: istio.alibabacloud.com/v1beta1
    kind: LLMProvider
    metadata:  
      name: dashscope-qwen
    spec:
      host: dashscope.aliyuncs.com
      path: /compatible-mode/v1/chat/completions
      configs:
        defaultConfig:
          openAIConfig:
            model: qwen1.5-72b-chat  # Use the open-source model by default.
            apiKey: ${YOUR_DASHSCOPE_API_KEY}
        routeSpecificConfigs:
          vip-route:  # Specific configuration for subscribers.
            openAIConfig:
              model: qwen-turbo  # Subscribers use the qwen-turbo model.
              apiKey: ${YOUR_DASHSCOPE_API_KEY}

    Run the following command to apply the update to the LLMProvider.

    kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMProvider.yaml
  4. Run the following commands to test the routing.

    kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' --header 'Content-Type: application/json' --data '{
        "messages": [
            {"role": "user", "content": "Please introduce yourself."}
        ]
    }'
    kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' --header 'Content-Type: application/json' --header 'user-type: subscriber' --data '{
        "messages": [
            {"role": "user", "content": "Please introduce yourself."}
        ]
    }'

    Expected output:

    {"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1720680044,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-1c33b950-3220-9bfe-9066-xxxxxxxxxxxx"}
    {"choices":[{"message":{"role":"assistant","content":"Hello, I'm Qwen, a large language model from Alibaba Cloud. As an AI assistant, my goal is to help users get accurate and useful information, and to solve their problems and confusions. I can provide knowledge in various fields, engage in conversation, and even create text. Please note that all the content I provide is based on the data I was trained on and may not include the latest events or personal information. If you have any questions, feel free to ask me at any time!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":11,"completion_tokens":85,"total_tokens":96},"created":1720683416,"system_fingerprint":null,"model":"qwen-turbo","id":"chatcmpl-9cbc7c56-06e9-9639-a50d-xxxxxxxxxxxx"}

    The output shows that the request from the subscriber was routed to the qwen-turbo model.

Scenario 2: Weighted routing between providers

  1. Create a file named LLMProvider-moonshot.yaml with the following content.

    apiVersion: istio.alibabacloud.com/v1beta1
    kind: LLMProvider
    metadata:  
      name: moonshot
    spec:
      host: api.moonshot.cn # This must be unique among providers.
      path: /v1/chat/completions
      configs:
        defaultConfig:
          openAIConfig:
            model: moonshot-v1-8k
            stream: false
            apiKey: ${YOUR_MOONSHOT_API_KEY}
  2. Run the following command to create the LLMProvider for Moonshot.

    kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMProvider-moonshot.yaml
  3. Create a file named demo-llm-server.yaml with the following content.

    apiVersion: v1
    kind: Service
    metadata:
      name: demo-llm-server
      namespace: default
    spec:
      ports:
      - name: http
        port: 80
        protocol: TCP
        targetPort: 80
      selector:
        app: none
      type: ClusterIP
  4. Run the following command to create the demo-llm-server service.

    kubectl apply -f demo-llm-server.yaml
  5. Update the LLMRoute.yaml file with the following content.

    apiVersion: istio.alibabacloud.com/v1beta1
    kind: LLMRoute
    metadata:
      name: demo-llm-server
      namespace: default
    spec:
      host: demo-llm-server
      rules:
      - backendRefs:
        - providerHost: dashscope.aliyuncs.com
          weight: 50
        - providerHost: api.moonshot.cn
          weight: 50
        name: migrate-rule
  6. Run the following command to update the LLMRoute routing rule.

    kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMRoute.yaml
  7. Run the following command multiple times.

    kubectl exec deployment/sleep -it -- curl --location 'http://demo-llm-server' --header 'Content-Type: application/json' --data '{
        "messages": [
            {"role": "user", "content": "Please introduce yourself."}
        ]
    }' 

    Expected output:

    {"id":"cmpl-cafd47b181204cdbb4a4xxxxxxxxxxxx","object":"chat.completion","created":1720687132,"model":"moonshot-v1-8k","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! I am an AI language model named Kimi. My main function is to help people generate human-like text. I can write articles, answer questions, provide advice, and more. I am trained on a massive amount of text data, so I can generate a wide variety of text. My goal is to help people communicate more effectively and solve problems."},"finish_reason":"stop"}],"usage":{"prompt_tokens":11,"completion_tokens":59,"total_tokens":70}}
    
    {"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1720687164,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-2443772b-4e41-9ea8-9bed-xxxxxxxxxxxx"}

    The output shows that requests are distributed evenly between Moonshot and Alibaba Cloud Model Studio.