Traffic routing: Manage LLM traffic with ASM - Alibaba Cloud Service Mesh

Alibaba Cloud Service Mesh (ASM) enhances the standard HTTP protocol to better support Large Language Model (LLM) requests, providing a simple and efficient way to manage LLM access. You can use ASM to implement Canary Access, Weighted Routing, and various observability capabilities. This topic describes how to configure and use LLM traffic routing.

Prerequisites

You have added a cluster to an ASM instance, and the instance version is v1.21.6.88 or later.
You have configured sidecar injection policies.
You have activated Alibaba Cloud Model Studio and obtained a valid API Key. See Obtain an API Key.
You have activated the Moonshot AI API service and obtained a valid API Key. See the Moonshot AI Open Platform.

Setup

Step 1: Create the sleep test application

Create a file named sleep.yaml with the following content.

YAML content

apiVersion: v1
kind: ServiceAccount
metadata:
  name: sleep
---
apiVersion: v1
kind: Service
metadata:
  name: sleep
  labels:
    app: sleep
    service: sleep
spec:
  ports:
  - port: 80
    name: http
  selector:
    app: sleep
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sleep
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sleep
  template:
    metadata:
      labels:
        app: sleep
    spec:
      terminationGracePeriodSeconds: 0
      serviceAccountName: sleep
      containers:
      - name: sleep
        image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/curl:asm-sleep
        command: ["/bin/sleep", "infinity"]
        imagePullPolicy: IfNotPresent
        volumeMounts:
        - mountPath: /etc/sleep/tls
          name: secret-volume
      volumes:
      - name: secret-volume
        secret:
          secretName: sleep-secret
          optional: true
---

Run the following command to create the sleep application.
```
kubectl apply -f sleep.yaml
```

Step 2: Create an LLMProvider for Model Studio

Create a file named LLMProvider.yaml with the following content.

apiVersion: istio.alibabacloud.com/v1beta1
kind: LLMProvider
metadata:  
  name: dashscope-qwen
spec:
  host: dashscope.aliyuncs.com
  path: /compatible-mode/v1/chat/completions
  configs:
    defaultConfig:
      openAIConfig:
        model: qwen1.5-72b-chat  # Qwen open-source series model
        apiKey: ${YOUR_DASHSCOPE_API_KEY}

For more information about open-source models, see Text Generation-Qwen-Open-source Version.

Run the following command to create the LLMProvider.

kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMProvider.yaml

Run the following command to test the configuration.

kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' --header 'Content-Type: application/json' --data '{
    "messages": [
        {"role": "user", "content": "Please introduce yourself."}
    ]
}'

Expected output:

{"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1720680044,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-3608dcd5-e3ad-9ade-bc70-xxxxxxxxxxxxxx"}

Once the LLMProvider is created, you can send plain HTTP requests to dashscope.aliyuncs.com from the sleep pod. The ASM sidecar automatically intercepts the request, converts it to the OpenAI-compatible LLM format, adds the API Key, upgrades the connection to HTTPS, and forwards it to the external LLM provider's server. Alibaba Cloud Model Studio is compatible with the OpenAI LLM protocol.

Scenarios

Scenario 1: Route users to different models

Create a file named LLMRoute.yaml with the following content.

apiVersion: istio.alibabacloud.com/v1beta1
kind: LLMRoute
metadata:  
  name: dashscope-route
spec:
  host: dashscope.aliyuncs.com # This must be unique among providers.
  rules:
  - name: vip-route
    matches:
    - headers:
        user-type:
          exact: subscriber  # Routing rule for subscribers. A specific configuration will be provided in the provider.
    backendRefs:
    - providerHost: dashscope.aliyuncs.com
  - backendRefs:
    - providerHost: dashscope.aliyuncs.com

This configuration routes requests that contain the user-type: subscriber header to the vip-route routing rule.

Run the following command to create the LLMRoute.

kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMRoute.yaml

Update the LLMProvider.yaml file with the following route-level configuration:

apiVersion: istio.alibabacloud.com/v1beta1
kind: LLMProvider
metadata:  
  name: dashscope-qwen
spec:
  host: dashscope.aliyuncs.com
  path: /compatible-mode/v1/chat/completions
  configs:
    defaultConfig:
      openAIConfig:
        model: qwen1.5-72b-chat  # Use the open-source model by default.
        apiKey: ${YOUR_DASHSCOPE_API_KEY}
    routeSpecificConfigs:
      vip-route:  # Specific configuration for subscribers.
        openAIConfig:
          model: qwen-turbo  # Subscribers use the qwen-turbo model.
          apiKey: ${YOUR_DASHSCOPE_API_KEY}

Run the following command to apply the update to the LLMProvider.

kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMProvider.yaml

Run the following commands to test the routing.

kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' --header 'Content-Type: application/json' --data '{
    "messages": [
        {"role": "user", "content": "Please introduce yourself."}
    ]
}'

kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' --header 'Content-Type: application/json' --header 'user-type: subscriber' --data '{
    "messages": [
        {"role": "user", "content": "Please introduce yourself."}
    ]
}'

Expected output:

{"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1720680044,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-1c33b950-3220-9bfe-9066-xxxxxxxxxxxx"}

{"choices":[{"message":{"role":"assistant","content":"Hello, I'm Qwen, a large language model from Alibaba Cloud. As an AI assistant, my goal is to help users get accurate and useful information, and to solve their problems and confusions. I can provide knowledge in various fields, engage in conversation, and even create text. Please note that all the content I provide is based on the data I was trained on and may not include the latest events or personal information. If you have any questions, feel free to ask me at any time!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":11,"completion_tokens":85,"total_tokens":96},"created":1720683416,"system_fingerprint":null,"model":"qwen-turbo","id":"chatcmpl-9cbc7c56-06e9-9639-a50d-xxxxxxxxxxxx"}

The output shows that the request from the subscriber was routed to the qwen-turbo model.

Scenario 2: Weighted routing between providers

Create a file named LLMProvider-moonshot.yaml with the following content.

apiVersion: istio.alibabacloud.com/v1beta1
kind: LLMProvider
metadata:  
  name: moonshot
spec:
  host: api.moonshot.cn # This must be unique among providers.
  path: /v1/chat/completions
  configs:
    defaultConfig:
      openAIConfig:
        model: moonshot-v1-8k
        stream: false
        apiKey: ${YOUR_MOONSHOT_API_KEY}

Run the following command to create the LLMProvider for Moonshot.

kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMProvider-moonshot.yaml

Create a file named demo-llm-server.yaml with the following content.

apiVersion: v1
kind: Service
metadata:
  name: demo-llm-server
  namespace: default
spec:
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 80
  selector:
    app: none
  type: ClusterIP

Run the following command to create the demo-llm-server service.
```
kubectl apply -f demo-llm-server.yaml
```

Update the LLMRoute.yaml file with the following content.

apiVersion: istio.alibabacloud.com/v1beta1
kind: LLMRoute
metadata:
  name: demo-llm-server
  namespace: default
spec:
  host: demo-llm-server
  rules:
  - backendRefs:
    - providerHost: dashscope.aliyuncs.com
      weight: 50
    - providerHost: api.moonshot.cn
      weight: 50
    name: migrate-rule

Run the following command to update the LLMRoute routing rule.

kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMRoute.yaml

Run the following command multiple times.

kubectl exec deployment/sleep -it -- curl --location 'http://demo-llm-server' --header 'Content-Type: application/json' --data '{
    "messages": [
        {"role": "user", "content": "Please introduce yourself."}
    ]
}'

Expected output:

{"id":"cmpl-cafd47b181204cdbb4a4xxxxxxxxxxxx","object":"chat.completion","created":1720687132,"model":"moonshot-v1-8k","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! I am an AI language model named Kimi. My main function is to help people generate human-like text. I can write articles, answer questions, provide advice, and more. I am trained on a massive amount of text data, so I can generate a wide variety of text. My goal is to help people communicate more effectively and solve problems."},"finish_reason":"stop"}],"usage":{"prompt_tokens":11,"completion_tokens":59,"total_tokens":70}}

{"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1720687164,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-2443772b-4e41-9ea8-9bed-xxxxxxxxxxxx"}

The output shows that requests are distributed evenly between Moonshot and Alibaba Cloud Model Studio.