Use ASM rollback feature to create a high-availability LLM service - Alibaba Cloud Service Mesh

In LLM scenarios, business applications require connectivity to foundational model services, both internal and external. Service Mesh (ASM) enables simultaneous connections to multiple foundational model services and offers an automatic rollback to an alternate service in the event of a failure, ensuring high availability for LLM applications. This topic describes how to leverage the traffic rollback feature for LLM service connections.

Prerequisites

Add a cluster to an ASM instance of version 1.22.6.72 or later.
Sidecar proxy injection is enabled for the specified namespaces. For more information, see Manage global namespaces.

Step 1: Create two LLMProviders

Create a file named provider.yaml with the following content. This YAML can be used to create two LLMProviders within ASM: a test provider for simulating service outages and a normally functioning Qwen provider.

apiVersion: istio.alibabacloud.com/v1beta1
kind: LLMProvider
metadata:  
  name: asm-llm-provider-test
spec:
  host: asm-llm-provider-test.com
  path: /compatible-mode/v1/chat/completions
  workloadSelector:
    labels:
      app: sleep
  configs:
    defaultConfig:
      openAIConfig:
        model: test-model
        stream: false
        apiKey: test-api-key
---
apiVersion: istio.alibabacloud.com/v1beta1
kind: LLMProvider
metadata:  
  name: dashscope-qwen
spec:
  host: dashscope.aliyuncs.com
  path: /compatible-mode/v1/chat/completions
  workloadSelector:
    labels:
      app: sleep
  configs:
    defaultConfig:
      openAIConfig:
        model: qwen1.5-72b-chat  # Qwen open-source LLM
        stream: false
        apiKey: ${API_KEY of dashscope}

Modify .spec.configs.defaultConfig.openAIConfig.model as needed to explore different models. For additional Qwen open-source models, refer to Text generation - Qwen - open source.

Run the following command to deploy the LLMProvider by using the kubeconfig file of the ASM instance.
```
kubectl apply -f provider.yaml
```

Step 2: Configure eviction policy and rollback policy for abnormal endpoint

To avoid service disruptions, you need to configure the abnormal endpoint eviction policy within the destination rule. After eviction and rollback policies are configured, request can be redirected to an operational provider in the event of an LLM service failure. The LLMProvider resource automatically creates the corresponding destination rule by default. To implement the eviction policy, modify the existing destination rules.

To enable custom destination rules and prevent the rules from being overwritten by ASM control plane, annotate the LLMProvider asm-llm-provider-test by running the following command.
```
kubectl annotate llmprovider asm-llm-provider-test asm.alibabacloud.com/custom-destinationrule=true
```

Add the abnormal endpoint eviction policy by modifying the destination rule with the following command.

kubectl edit DestinationRule/asm-llm-provider-test

The updated destination rule is as follows:

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: asm-llm-provider-test
  namespace: default
spec:
  host: asm-llm-provider-test.com
  trafficPolicy:
    portLevelSettings:
    - port:
        number: 80
      tls:
        mode: SIMPLE
        sni: asm-llm-provider-test
      outlierDetection:
        consecutive5xxErrors: 1
        interval: 1s
        baseEjectionTime: 10s
        maxEjectionPercent: 100
        minHealthPercent: 0

This modification introduces an outlierDetection configuration to the destination rule. If a 5xx error occurs within 1 second, the endpoint will be temporarily removed for 10 seconds.

The outlierDetection configuration is outlined below:

Configuration Item	Description
consecutive5xxErrors	Defines the maximum number of consecutive 5xx error requests allowed before eviction. If this threshold is reached, eviction will occur. The default value is `5`, meaning if five consecutive requests return 5xx errors, the service will be marked as unhealthy.
interval	Defines the detection interval, which is how often the service is checked. The default value is `10s`. For example, `5s` means checking every five seconds. The supported format is `1h`/`1m`/`1s`/`1ms`, and the minimum value must be ≥`1ms`.
baseEjectionTime	Specifies the base time for which the service is evicted. That is, after a service is marked as unhealthy, it will not be reused within this time. The default value is `30s`. The supported format is `1h`/`1m`/`1s`/`1ms`, and the minimum value must be ≥`1ms`.
maxEjectionPercent	Specifies the maximum percentage of services that are allowed to be evicted to prevent too many services from being excluded simultaneously. For example, setting it to `100` means all services can be evicted. The default value is `10%`.
minHealthPercent	Specifies the minimum percentage of healthy services. This parameter helps ensure that after some services are evicted, there are still enough healthy services available to process requests. The default value is `0`, which means disabling the check for whether there are services in a healthy state, allowing all services to be marked as unhealthy.

For more information about outlierDetection, see OutlierDetection.

Create a virtual service to implement the rollback policy.

Create a file named vs.yaml with the following content.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: test-fallback-llm-vs
  namespace: default
spec:
  hosts:
  - asm-llm-provider-test.com
  http:
  - name: fallback-route
    route:
    - destination:
        host: asm-llm-provider-test.com
        port:
          number: 80
      fallback:
        target:
          host: dashscope.aliyuncs.com

Run the following command to deploy a virtual service.
```
kubectl apply -f vs.yaml
```
This setup ensures that the request sent to asm-llm-provider-test.com is rerouted to dashscope.aliyuncs.com if the former is deemed unhealthy.

Step 3: Create a sleep application for testing

To ensure asm-llm-provider-test.com can be resolved by Alibaba Cloud Domain Name System (DNS), add static DNS configuration to the sleep deployment using hostAlias.

Create a file named sleep.yaml with the following content.

Click to view details

apiVersion: v1
kind: ServiceAccount
metadata:
  name: sleep
---
apiVersion: v1
kind: Service
metadata:
  name: sleep
  labels:
    app: sleep
    service: sleep
spec:
  ports:
  - port: 80
    name: http
  selector:
    app: sleep
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sleep
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sleep
  template:
    metadata:
      labels:
        app: sleep
    spec:
      terminationGracePeriodSeconds: 0
      serviceAccountName: 
      hostAliases:
      - hostnames:
        - asm-llm-provider-test.com
        ip: 1.2.3.4
      containers:
      - name: sleep
        image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/curl:asm-sleep
        command: ["/bin/sleep", "infinity"]
        imagePullPolicy: IfNotPresent
        volumeMounts:
        - mountPath: /etc/sleep/tls
          name: secret-volume
      volumes:
      - name: secret-volume
        secret:
          secretName: sleep-secret
          optional: true
---

Run the following command by using the kubeconfig files of the cluster on the data plane.
```
kubectl apply -f sleep.yaml
```

Step 4: Verification

Send the following request twice from the pod of the sleep to LLM service in quick succession.

The eviction duration set in this step is 10 seconds, which means the second request must be sent within that timeframe. Adjust the outlierDetection settings as needed for your business requirements.

kubectl exec deployment/sleep -it -- curl http://asm-llm-provider-test.com \
--header 'Content-Type: application/json' \
--data '{
    "messages": [
        {"role": "user", "content": "Please introduce yourself"}
    ]
}'

Expected output:

{"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1730261854,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-3608dcd5-e3ad-9ade-bc70-xxxxxxxxxxxxxx"}%

The output shows that the request was automatically redirected to dashscope.aliyuncs.com.

References

Traffic shifting is a crucial strategy for managing traffic routing. When an LLM provider in the request trace experiences a temporary outage, the configured eviction and rollback policies can ensure service continuity. For more information about traffic routing, including scenarios where different users access different LLM providers, see Traffic routing: Use ASM to manage LLM traffic.