In LLM scenarios, business applications require connectivity to foundational model services, both internal and external. Service Mesh (ASM) enables simultaneous connections to multiple foundational model services and offers an automatic rollback to an alternate service in the event of a failure, ensuring high availability for LLM applications. This topic describes how to leverage the traffic rollback feature for LLM service connections.
Prerequisites
Add a cluster to an ASM instance of version 1.22.6.72 or later.
Sidecar proxy injection is enabled for the specified namespaces. For more information, see Manage global namespaces.
Step 1: Create two LLMProviders
Create a file named provider.yaml with the following content. This YAML can be used to create two LLMProviders within ASM: a test provider for simulating service outages and a normally functioning Qwen provider.
apiVersion: istio.alibabacloud.com/v1beta1 kind: LLMProvider metadata: name: asm-llm-provider-test spec: host: asm-llm-provider-test.com path: /compatible-mode/v1/chat/completions workloadSelector: labels: app: sleep configs: defaultConfig: openAIConfig: model: test-model stream: false apiKey: test-api-key --- apiVersion: istio.alibabacloud.com/v1beta1 kind: LLMProvider metadata: name: dashscope-qwen spec: host: dashscope.aliyuncs.com path: /compatible-mode/v1/chat/completions workloadSelector: labels: app: sleep configs: defaultConfig: openAIConfig: model: qwen1.5-72b-chat # Qwen open-source LLM stream: false apiKey: ${API_KEY of dashscope}Modify
.spec.configs.defaultConfig.openAIConfig.modelas needed to explore different models. For additional Qwen open-source models, refer to Text generation - Qwen - open source.Run the following command to deploy the LLMProvider by using the kubeconfig file of the ASM instance.
kubectl apply -f provider.yaml
Step 2: Configure eviction policy and rollback policy for abnormal endpoint
To avoid service disruptions, you need to configure the abnormal endpoint eviction policy within the destination rule. After eviction and rollback policies are configured, request can be redirected to an operational provider in the event of an LLM service failure. The LLMProvider resource automatically creates the corresponding destination rule by default. To implement the eviction policy, modify the existing destination rules.
To enable custom destination rules and prevent the rules from being overwritten by ASM control plane, annotate the LLMProvider
asm-llm-provider-testby running the following command.kubectl annotate llmprovider asm-llm-provider-test asm.alibabacloud.com/custom-destinationrule=trueAdd the abnormal endpoint eviction policy by modifying the destination rule with the following command.
kubectl edit DestinationRule/asm-llm-provider-testThe updated destination rule is as follows:
apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: asm-llm-provider-test namespace: default spec: host: asm-llm-provider-test.com trafficPolicy: portLevelSettings: - port: number: 80 tls: mode: SIMPLE sni: asm-llm-provider-test outlierDetection: consecutive5xxErrors: 1 interval: 1s baseEjectionTime: 10s maxEjectionPercent: 100 minHealthPercent: 0This modification introduces an
outlierDetectionconfiguration to the destination rule. If a 5xx error occurs within 1 second, the endpoint will be temporarily removed for 10 seconds.The
outlierDetectionconfiguration is outlined below:Configuration Item
Description
consecutive5xxErrors
Defines the maximum number of consecutive 5xx error requests allowed before eviction. If this threshold is reached, eviction will occur. The default value is
5, meaning if five consecutive requests return 5xx errors, the service will be marked as unhealthy.interval
Defines the detection interval, which is how often the service is checked. The default value is
10s. For example,5smeans checking every five seconds. The supported format is1h/1m/1s/1ms, and the minimum value must be ≥1ms.baseEjectionTime
Specifies the base time for which the service is evicted. That is, after a service is marked as unhealthy, it will not be reused within this time. The default value is
30s. The supported format is1h/1m/1s/1ms, and the minimum value must be ≥1ms.maxEjectionPercent
Specifies the maximum percentage of services that are allowed to be evicted to prevent too many services from being excluded simultaneously. For example, setting it to
100means all services can be evicted. The default value is10%.minHealthPercent
Specifies the minimum percentage of healthy services. This parameter helps ensure that after some services are evicted, there are still enough healthy services available to process requests. The default value is
0, which means disabling the check for whether there are services in a healthy state, allowing all services to be marked as unhealthy.For more information about
outlierDetection, see OutlierDetection.Create a virtual service to implement the rollback policy.
Create a file named vs.yaml with the following content.
apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: test-fallback-llm-vs namespace: default spec: hosts: - asm-llm-provider-test.com http: - name: fallback-route route: - destination: host: asm-llm-provider-test.com port: number: 80 fallback: target: host: dashscope.aliyuncs.comRun the following command to deploy a virtual service.
kubectl apply -f vs.yamlThis setup ensures that the request sent to
asm-llm-provider-test.comis rerouted todashscope.aliyuncs.comif the former is deemed unhealthy.
Step 3: Create a sleep application for testing
To ensure asm-llm-provider-test.com can be resolved by Alibaba Cloud Domain Name System (DNS), add static DNS configuration to the sleep deployment using hostAlias.Create a file named sleep.yaml with the following content.
Run the following command by using the kubeconfig files of the cluster on the data plane.
kubectl apply -f sleep.yaml
Step 4: Verification
Send the following request twice from the pod of the sleep to LLM service in quick succession.
The eviction duration set in this step is 10 seconds, which means the second request must be sent within that timeframe. Adjust the outlierDetection settings as needed for your business requirements.
kubectl exec deployment/sleep -it -- curl http://asm-llm-provider-test.com \
--header 'Content-Type: application/json' \
--data '{
"messages": [
{"role": "user", "content": "Please introduce yourself"}
]
}'Expected output:
{"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1730261854,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-3608dcd5-e3ad-9ade-bc70-xxxxxxxxxxxxxx"}% The output shows that the request was automatically redirected to dashscope.aliyuncs.com.
References
Traffic shifting is a crucial strategy for managing traffic routing. When an LLM provider in the request trace experiences a temporary outage, the configured eviction and rollback policies can ensure service continuity. For more information about traffic routing, including scenarios where different users access different LLM providers, see Traffic routing: Use ASM to manage LLM traffic.