Based on HTTP, Alibaba Cloud Service Mesh (ASM) has enhanced support for LLM request protocols. It now supports common LLM provider protocol standards, providing users with a simple and efficient integration experience. With ASM, users can implement canary access, routing in proportion, and various observability features for LLMs. This topic describes how to manage LLM traffic in ASM from the perspectives of traffic routing.
Prerequisites
Add a cluster to the ASM instance of version 1.21.6.88 or later.
Alibaba Cloud Model Studio is activated and an available API_KEY is obtained. For more information, see Obtain an API Key.
The API service of Moonshot is activated and the available API_KEY is obtained. For more information, see Moonshot AI Open Platform.
Preparations
Step 1: Create a test application named sleep
Create a file named sleep.yaml with the following content.
Run the following command to create an application named sleep.
kubectl apply -f sleep.yaml
Step 2: Create an LLMProvider for Alibaba Cloud Model Studio
Create a file named LLMProvider.yaml with the following content.
apiVersion: istio.alibabacloud.com/v1beta1 kind: LLMProvider metadata: name: dashscope-qwen spec: host: dashscope.aliyuncs.com path: /compatible-mode/v1/chat/completions configs: defaultConfig: openAIConfig: model: qwen1.5-72b-chat # Qwen open-source LLM apiKey: ${dashscope API_KEY}For more open-source models, see and Text Generation-Qwen-Open-source Version.
Run the following command to create an LLMProvider.
kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMProvider.yamlRun the following command to test the LLMProvider.
kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' \ --header 'Content-Type: application/json' \ --data '{ "messages": [ {"role": "user", "content": "Please introduce yourself."} ] }'The expected output:
{"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1735021898,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-3608dcd5-e3ad-9ade-bc70-xxxxxxxxxxxxxx"}%Once the LLMProvider is created, it can directly access the
dashscope.aliyuncs.comin the pod where the sleep application resides over HTTP. The Sidecar of ASM automatically converts the request into a format that conforms to the OpenAI LLM protocol (Alibaba Cloud Model Studio is compatible with the LLM protocol of OpenAI), adds an API Key to the request, updates the HTTP protocol to HTTPS, and finally sends the request to the LLM provider server outside the cluster.
Demonstrate the procedure in different scenarios
Scenario 1: Create LLMRoute to implement different models for different types of users
Create a file named LLMRoute.yaml with the following content.
apiVersion: istio.alibabacloud.com/v1beta1 kind: LLMRoute metadata: name: dashscope-route spec: host: dashscope.aliyuncs.com # Different providers cannot have the same host. rules: - name: vip-route matches: - headers: user-type: exact: subscriber # This is a dedicated route for subscribed users, which will be provided with specialized configuration in the provider later. backendRefs: - providerHost: dashscope.aliyuncs.com - backendRefs: - providerHost: dashscope.aliyuncs.comThis configuration allows requests that carry
user-type:subscriberto follow the vip-route routing rule.Create LLMRoute with the following command.
kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMRoute.yamlUpdate the LLMProvider.yaml file with the following content and add route-level configurations.
apiVersion: istio.alibabacloud.com/v1beta1 kind: LLMProvider metadata: name: dashscope-qwen spec: host: dashscope.aliyuncs.com path: /compatible-mode/v1/chat/completions configs: defaultConfig: openAIConfig: model: qwen1.5-72b-chat # Open-source models are used by default. apiKey: ${dashscope API_KEY} routeSpecificConfigs: vip-route: # A dedicated route for subscribed users. openAIConfig: model: qwen-turbo # The qwen-turbo model for subscribed users. apiKey: ${dashscope API_KEY}Run the following command to modify LLMProvider.
kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMProvider.yamlRun the test using the following command:
kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' \ --header 'Content-Type: application/json' \ --data '{ "messages": [ {"role": "user", "content": "Please introduce yourself."} ] }'kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' \ --header 'Content-Type: application/json' \ --header 'user-type: subscriber' \ --data '{ "messages": [ {"role": "user", "content": "Please introduce yourself."} ] }'The expected output:
{"choices":[{"message":{"role":"assistant","content":"I am a pre-trained language model developed by Alibaba Cloud. I am Qwen. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1735021898,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-3608dcd5-e3ad-9ade-bc70-06aed84b6715"}%{"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1735021898,"system_fingerprint":null,"model":"qwen-turbo","id":"chatcmpl-3608dcd5-e3ad-9ade-bc70-06aed84b6715"}%The output shows that the qwen-turbo model is used for subscribed users.
Scenario 2: Configure LLMProvider and LLMRoute to distribute traffic in proportion
Create a file named LLMProvider-moonshot.yaml with the following content.
apiVersion: istio.alibabacloud.com/v1beta1 kind: LLMProvider metadata: name: moonshot spec: host: api.moonshot.cn # Different providers cannot have the same host. path: /v1/chat/completions configs: defaultConfig: openAIConfig: model: moonshot-v1-8k stream: false apiKey: ${Moonshot API_KEY}Run the following command to create an LLMProvider for Moonshot.
kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMProvider-moonshot.yamlCreate a file named demo-llm-server.yaml with the following content.
apiVersion: v1 kind: Service metadata: name: demo-llm-server namespace: default spec: ports: - name: http port: 80 protocol: TCP targetPort: 80 selector: app: none type: ClusterIPRun the following command to create a demo-llm-server service.
kubectl apply -f demo-llm-server.yamlUpdate the LLMRoute.yaml file with the following content.
apiVersion: istio.alibabacloud.com/v1beta1 kind: LLMRoute metadata: name: demo-llm-server namespace: default spec: host: demo-llm-server rules: - backendRefs: - providerHost: dashscope.aliyuncs.com weight: 50 - providerHost: api.moonshot.cn weight: 50 name: migrate-ruleRun the following command to update the LLMRoute routing rule.
kubectl --kubeconfig=${PATH_TO_ASM_KUBECONFIG} apply -f LLMRoute.yamlRun the following test multiple times.
kubectl exec deployment/sleep -it -- curl --location 'http://demo-llm-server' \ --header 'Content-Type: application/json' \ --data '{ "messages": [ {"role": "user", "content": "Please introduce yourself"} ] }'The expected output:
{"id":"chatcmpl-676a6599045dxxxxxxxxxxxx","object":"chat.completion","created":1735026073,"model":"moonshot-v1-8k","choices":[{"index":0,"message":{"role":"assistant","content":"Hello there! I'm Kimi, your AI assistant crafted by the innovative minds at Moonshot AI. I'm here to lend a digital hand with your queries, providing safe, helpful, and accurate responses. Whether it's a dash of information or a deep dive into data, I'm your go-to for a chat. Let's make today awesome! How can I assist you?"},"finish_reason":"stop"}],"usage":{"prompt_tokens":78,"completion_tokens":78,"total_tokens":156}} {"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1735021898,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-3608dcd5-e3ad-9ade-bc70-06aed84b6715"}%The output shows that approximately 50% of the requests are sent to Moonshot, and 50% to the Alibaba Cloud Model Studio.