All Products
Search
Document Center

Alibaba Cloud Service Mesh:Traffic routing: Use ASM to manage LLM traffic

Last Updated:Feb 10, 2025

Based on HTTP, Alibaba Cloud Service Mesh (ASM) has enhanced support for LLM request protocols. It now supports common LLM provider protocol standards, providing users with a simple and efficient integration experience. With ASM, users can implement canary access, routing in proportion, and various observability features for LLMs. This topic describes how to manage LLM traffic in ASM from the perspectives of traffic routing.

Prerequisites

Preparations

Step 1: Create a test application named sleep

  1. Create a file named sleep.yaml with the following content.

    Click to view details

    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: sleep
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: sleep
      labels:
        app: sleep
        service: sleep
    spec:
      ports:
      - port: 80
        name: http
      selector:
        app: sleep
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: sleep
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: sleep
      template:
        metadata:
          labels:
            app: sleep
        spec:
          terminationGracePeriodSeconds: 0
          serviceAccountName: sleep
          containers:
          - name: sleep
            image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/curl:asm-sleep
            command: ["/bin/sleep", "infinity"]
            imagePullPolicy: IfNotPresent
            volumeMounts:
            - mountPath: /etc/sleep/tls
              name: secret-volume
          volumes:
          - name: secret-volume
            secret:
              secretName: sleep-secret
              optional: true
    ---
  2. Run the following command to create an application named sleep.

    kubectl apply -f sleep.yaml

Step 2: Create an LLMProvider for Alibaba Cloud Model Studio

  1. Create a file named LLMProvider.yaml with the following content.

    apiVersion: istio.alibabacloud.com/v1beta1
    kind: LLMProvider
    metadata:  
      name: dashscope-qwen
    spec:
      host: dashscope.aliyuncs.com
      path: /compatible-mode/v1/chat/completions
      configs:
        defaultConfig:
          openAIConfig:
            model: qwen1.5-72b-chat  # Qwen open-source LLM
            apiKey: ${dashscope API_KEY}

    For more open-source models, see and Text Generation-Qwen-Open-source Version.

  2. Run the following command to create an LLMProvider.

    kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMProvider.yaml
  3. Run the following command to test the LLMProvider.

    kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' \
    --header 'Content-Type: application/json' \
    --data '{
        "messages": [
            {"role": "user", "content": "Please introduce yourself."}
        ]
    }'

    The expected output:

    {"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1735021898,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-3608dcd5-e3ad-9ade-bc70-xxxxxxxxxxxxxx"}%   

    Once the LLMProvider is created, it can directly access the dashscope.aliyuncs.com in the pod where the sleep application resides over HTTP. The Sidecar of ASM automatically converts the request into a format that conforms to the OpenAI LLM protocol (Alibaba Cloud Model Studio is compatible with the LLM protocol of OpenAI), adds an API Key to the request, updates the HTTP protocol to HTTPS, and finally sends the request to the LLM provider server outside the cluster.

Demonstrate the procedure in different scenarios

Scenario 1: Create LLMRoute to implement different models for different types of users

  1. Create a file named LLMRoute.yaml with the following content.

    apiVersion: istio.alibabacloud.com/v1beta1
    kind: LLMRoute
    metadata:  
      name: dashscope-route
    spec:
      host: dashscope.aliyuncs.com # Different providers cannot have the same host.
      rules:
      - name: vip-route
        matches:
        - headers:
            user-type:
              exact: subscriber  # This is a dedicated route for subscribed users, which will be provided with specialized configuration in the provider later.
        backendRefs:
        - providerHost: dashscope.aliyuncs.com
      - backendRefs:
        - providerHost: dashscope.aliyuncs.com

    This configuration allows requests that carry user-type:subscriber to follow the vip-route routing rule.

  2. Create LLMRoute with the following command.

    kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMRoute.yaml
  3. Update the LLMProvider.yaml file with the following content and add route-level configurations.

    apiVersion: istio.alibabacloud.com/v1beta1
    kind: LLMProvider
    metadata:  
      name: dashscope-qwen
    spec:
      host: dashscope.aliyuncs.com
      path: /compatible-mode/v1/chat/completions
      configs:
        defaultConfig:
          openAIConfig:
            model: qwen1.5-72b-chat  # Open-source models are used by default.
            apiKey: ${dashscope API_KEY}
        routeSpecificConfigs:
          vip-route:  # A dedicated route for subscribed users.
            openAIConfig:
              model: qwen-turbo  # The qwen-turbo model for subscribed users.
              apiKey: ${dashscope API_KEY}

    Run the following command to modify LLMProvider.

    kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMProvider.yaml
  4. Run the test using the following command:

    kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' \
    --header 'Content-Type: application/json' \
    --data '{
        "messages": [
            {"role": "user", "content": "Please introduce yourself."}
        ]
    }'
    kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' \
    --header 'Content-Type: application/json' \
    --header 'user-type: subscriber' \
    --data '{
        "messages": [
            {"role": "user", "content": "Please introduce yourself."}
        ]
    }'

    The expected output:

    {"choices":[{"message":{"role":"assistant","content":"I am a pre-trained language model developed by Alibaba Cloud. I am Qwen. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1735021898,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-3608dcd5-e3ad-9ade-bc70-06aed84b6715"}%   
    {"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1735021898,"system_fingerprint":null,"model":"qwen-turbo","id":"chatcmpl-3608dcd5-e3ad-9ade-bc70-06aed84b6715"}%   

    The output shows that the qwen-turbo model is used for subscribed users.

Scenario 2: Configure LLMProvider and LLMRoute to distribute traffic in proportion

  1. Create a file named LLMProvider-moonshot.yaml with the following content.

    apiVersion: istio.alibabacloud.com/v1beta1
    kind: LLMProvider
    metadata:  
      name: moonshot
    spec:
      host: api.moonshot.cn # Different providers cannot have the same host.
      path: /v1/chat/completions
      configs:
        defaultConfig:
          openAIConfig:
            model: moonshot-v1-8k
            stream: false
            apiKey: ${Moonshot API_KEY}
  2. Run the following command to create an LLMProvider for Moonshot.

    kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMProvider-moonshot.yaml
  3. Create a file named demo-llm-server.yaml with the following content.

    apiVersion: v1
    kind: Service
    metadata:
      name: demo-llm-server
      namespace: default
    spec:
      ports:
      - name: http
        port: 80
        protocol: TCP
        targetPort: 80
      selector:
        app: none
      type: ClusterIP
  4. Run the following command to create a demo-llm-server service.

    kubectl apply -f demo-llm-server.yaml
  5. Update the LLMRoute.yaml file with the following content.

    apiVersion: istio.alibabacloud.com/v1beta1
    kind: LLMRoute
    metadata:
      name: demo-llm-server
      namespace: default
    spec:
      host: demo-llm-server
      rules:
      - backendRefs:
        - providerHost: dashscope.aliyuncs.com
          weight: 50
        - providerHost: api.moonshot.cn
          weight: 50
        name: migrate-rule
  6. Run the following command to update the LLMRoute routing rule.

    kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMRoute.yaml
  7. Run the following test multiple times.

    kubectl exec deployment/sleep -it -- curl --location 'http://demo-llm-server' \
    --header 'Content-Type: application/json' \
    --data '{
        "messages": [
            {"role": "user", "content": "Please introduce yourself"}
        ]
    }' 

    The expected output:

    {"id":"chatcmpl-676a6599045dxxxxxxxxxxxx","object":"chat.completion","created":1735026073,"model":"moonshot-v1-8k","choices":[{"index":0,"message":{"role":"assistant","content":"Hello there! I'm Kimi, your AI assistant crafted by the innovative minds at Moonshot AI. I'm here to lend a digital hand with your queries, providing safe, helpful, and accurate responses. Whether it's a dash of information or a deep dive into data, I'm your go-to for a chat. Let's make today awesome! How can I assist you?"},"finish_reason":"stop"}],"usage":{"prompt_tokens":78,"completion_tokens":78,"total_tokens":156}}
    
    {"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1735021898,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-3608dcd5-e3ad-9ade-bc70-06aed84b6715"}%   

    The output shows that approximately 50% of the requests are sent to Moonshot, and 50% to the Alibaba Cloud Model Studio.