Alibaba Cloud Service Mesh (ASM) enhances the standard HTTP protocol to better support Large Language Model (LLM) requests, providing a simple and efficient way to manage LLM access. You can use ASM to implement Canary Access, Weighted Routing, and various observability capabilities. This topic describes how to configure and use LLM traffic routing.
Prerequisites
-
You have added a cluster to an ASM instance, and the instance version is v1.21.6.88 or later.
-
You have configured sidecar injection policies.
-
You have activated Alibaba Cloud Model Studio and obtained a valid API Key. See Obtain an API Key.
-
You have activated the Moonshot AI API service and obtained a valid API Key. See the Moonshot AI Open Platform.
Setup
Step 1: Create the sleep test application
-
Create a file named sleep.yaml with the following content.
-
Run the following command to create the sleep application.
kubectl apply -f sleep.yaml
Step 2: Create an LLMProvider for Model Studio
-
Create a file named LLMProvider.yaml with the following content.
apiVersion: istio.alibabacloud.com/v1beta1 kind: LLMProvider metadata: name: dashscope-qwen spec: host: dashscope.aliyuncs.com path: /compatible-mode/v1/chat/completions configs: defaultConfig: openAIConfig: model: qwen1.5-72b-chat # Qwen open-source series model apiKey: ${YOUR_DASHSCOPE_API_KEY}For more information about open-source models, see Text Generation-Qwen-Open-source Version.
-
Run the following command to create the LLMProvider.
kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMProvider.yaml -
Run the following command to test the configuration.
kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' --header 'Content-Type: application/json' --data '{ "messages": [ {"role": "user", "content": "Please introduce yourself."} ] }'Expected output:
{"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1720680044,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-3608dcd5-e3ad-9ade-bc70-xxxxxxxxxxxxxx"}Once the LLMProvider is created, you can send plain HTTP requests to
dashscope.aliyuncs.comfrom the sleep pod. The ASM sidecar automatically intercepts the request, converts it to the OpenAI-compatible LLM format, adds the API Key, upgrades the connection to HTTPS, and forwards it to the external LLM provider's server. Alibaba Cloud Model Studio is compatible with the OpenAI LLM protocol.
Scenarios
Scenario 1: Route users to different models
-
Create a file named LLMRoute.yaml with the following content.
apiVersion: istio.alibabacloud.com/v1beta1 kind: LLMRoute metadata: name: dashscope-route spec: host: dashscope.aliyuncs.com # This must be unique among providers. rules: - name: vip-route matches: - headers: user-type: exact: subscriber # Routing rule for subscribers. A specific configuration will be provided in the provider. backendRefs: - providerHost: dashscope.aliyuncs.com - backendRefs: - providerHost: dashscope.aliyuncs.comThis configuration routes requests that contain the
user-type: subscriberheader to thevip-routerouting rule. -
Run the following command to create the LLMRoute.
kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMRoute.yaml -
Update the LLMProvider.yaml file with the following route-level configuration:
apiVersion: istio.alibabacloud.com/v1beta1 kind: LLMProvider metadata: name: dashscope-qwen spec: host: dashscope.aliyuncs.com path: /compatible-mode/v1/chat/completions configs: defaultConfig: openAIConfig: model: qwen1.5-72b-chat # Use the open-source model by default. apiKey: ${YOUR_DASHSCOPE_API_KEY} routeSpecificConfigs: vip-route: # Specific configuration for subscribers. openAIConfig: model: qwen-turbo # Subscribers use the qwen-turbo model. apiKey: ${YOUR_DASHSCOPE_API_KEY}Run the following command to apply the update to the LLMProvider.
kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMProvider.yaml -
Run the following commands to test the routing.
kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' --header 'Content-Type: application/json' --data '{ "messages": [ {"role": "user", "content": "Please introduce yourself."} ] }'kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' --header 'Content-Type: application/json' --header 'user-type: subscriber' --data '{ "messages": [ {"role": "user", "content": "Please introduce yourself."} ] }'Expected output:
{"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1720680044,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-1c33b950-3220-9bfe-9066-xxxxxxxxxxxx"}{"choices":[{"message":{"role":"assistant","content":"Hello, I'm Qwen, a large language model from Alibaba Cloud. As an AI assistant, my goal is to help users get accurate and useful information, and to solve their problems and confusions. I can provide knowledge in various fields, engage in conversation, and even create text. Please note that all the content I provide is based on the data I was trained on and may not include the latest events or personal information. If you have any questions, feel free to ask me at any time!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":11,"completion_tokens":85,"total_tokens":96},"created":1720683416,"system_fingerprint":null,"model":"qwen-turbo","id":"chatcmpl-9cbc7c56-06e9-9639-a50d-xxxxxxxxxxxx"}The output shows that the request from the subscriber was routed to the
qwen-turbomodel.
Scenario 2: Weighted routing between providers
-
Create a file named LLMProvider-moonshot.yaml with the following content.
apiVersion: istio.alibabacloud.com/v1beta1 kind: LLMProvider metadata: name: moonshot spec: host: api.moonshot.cn # This must be unique among providers. path: /v1/chat/completions configs: defaultConfig: openAIConfig: model: moonshot-v1-8k stream: false apiKey: ${YOUR_MOONSHOT_API_KEY} -
Run the following command to create the LLMProvider for Moonshot.
kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMProvider-moonshot.yaml -
Create a file named demo-llm-server.yaml with the following content.
apiVersion: v1 kind: Service metadata: name: demo-llm-server namespace: default spec: ports: - name: http port: 80 protocol: TCP targetPort: 80 selector: app: none type: ClusterIP -
Run the following command to create the demo-llm-server service.
kubectl apply -f demo-llm-server.yaml -
Update the LLMRoute.yaml file with the following content.
apiVersion: istio.alibabacloud.com/v1beta1 kind: LLMRoute metadata: name: demo-llm-server namespace: default spec: host: demo-llm-server rules: - backendRefs: - providerHost: dashscope.aliyuncs.com weight: 50 - providerHost: api.moonshot.cn weight: 50 name: migrate-rule -
Run the following command to update the LLMRoute routing rule.
kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMRoute.yaml -
Run the following command multiple times.
kubectl exec deployment/sleep -it -- curl --location 'http://demo-llm-server' --header 'Content-Type: application/json' --data '{ "messages": [ {"role": "user", "content": "Please introduce yourself."} ] }'Expected output:
{"id":"cmpl-cafd47b181204cdbb4a4xxxxxxxxxxxx","object":"chat.completion","created":1720687132,"model":"moonshot-v1-8k","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! I am an AI language model named Kimi. My main function is to help people generate human-like text. I can write articles, answer questions, provide advice, and more. I am trained on a massive amount of text data, so I can generate a wide variety of text. My goal is to help people communicate more effectively and solve problems."},"finish_reason":"stop"}],"usage":{"prompt_tokens":11,"completion_tokens":59,"total_tokens":70}} {"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1720687164,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-2443772b-4e41-9ea8-9bed-xxxxxxxxxxxx"}The output shows that requests are distributed evenly between Moonshot and Alibaba Cloud Model Studio.