All Products
Search
Document Center

Alibaba Cloud Service Mesh:Use ASM rollback feature to create a high-availability LLM service

Last Updated:Mar 24, 2025

In LLM scenarios, business applications require connectivity to foundational model services, both internal and external. Service Mesh (ASM) enables simultaneous connections to multiple foundational model services and offers an automatic rollback to an alternate service in the event of a failure, ensuring high availability for LLM applications. This topic describes how to leverage the traffic rollback feature for LLM service connections.

Prerequisites

Step 1: Create two LLMProviders

  1. Create a file named provider.yaml with the following content. This YAML can be used to create two LLMProviders within ASM: a test provider for simulating service outages and a normally functioning Qwen provider.

    apiVersion: istio.alibabacloud.com/v1beta1
    kind: LLMProvider
    metadata:  
      name: asm-llm-provider-test
    spec:
      host: asm-llm-provider-test.com
      path: /compatible-mode/v1/chat/completions
      workloadSelector:
        labels:
          app: sleep
      configs:
        defaultConfig:
          openAIConfig:
            model: test-model
            stream: false
            apiKey: test-api-key
    ---
    apiVersion: istio.alibabacloud.com/v1beta1
    kind: LLMProvider
    metadata:  
      name: dashscope-qwen
    spec:
      host: dashscope.aliyuncs.com
      path: /compatible-mode/v1/chat/completions
      workloadSelector:
        labels:
          app: sleep
      configs:
        defaultConfig:
          openAIConfig:
            model: qwen1.5-72b-chat  # Qwen open-source LLM
            stream: false
            apiKey: ${API_KEY of dashscope}
    Modify .spec.configs.defaultConfig.openAIConfig.model as needed to explore different models. For additional Qwen open-source models, refer to Text generation - Qwen - open source.
  2. Run the following command to deploy the LLMProvider by using the kubeconfig file of the ASM instance.

    kubectl apply -f provider.yaml

Step 2: Configure eviction policy and rollback policy for abnormal endpoint

To avoid service disruptions, you need to configure the abnormal endpoint eviction policy within the destination rule. After eviction and rollback policies are configured, request can be redirected to an operational provider in the event of an LLM service failure. The LLMProvider resource automatically creates the corresponding destination rule by default. To implement the eviction policy, modify the existing destination rules.

  1. To enable custom destination rules and prevent the rules from being overwritten by ASM control plane, annotate the LLMProvider asm-llm-provider-test by running the following command.

    kubectl annotate llmprovider asm-llm-provider-test asm.alibabacloud.com/custom-destinationrule=true
  2. Add the abnormal endpoint eviction policy by modifying the destination rule with the following command.

    kubectl edit DestinationRule/asm-llm-provider-test

    The updated destination rule is as follows:

    apiVersion: networking.istio.io/v1beta1
    kind: DestinationRule
    metadata:
      name: asm-llm-provider-test
      namespace: default
    spec:
      host: asm-llm-provider-test.com
      trafficPolicy:
        portLevelSettings:
        - port:
            number: 80
          tls:
            mode: SIMPLE
            sni: asm-llm-provider-test
          outlierDetection:
            consecutive5xxErrors: 1
            interval: 1s
            baseEjectionTime: 10s
            maxEjectionPercent: 100
            minHealthPercent: 0

    This modification introduces an outlierDetection configuration to the destination rule. If a 5xx error occurs within 1 second, the endpoint will be temporarily removed for 10 seconds.

    The outlierDetection configuration is outlined below:

    Configuration Item

    Description

    consecutive5xxErrors

    Defines the maximum number of consecutive 5xx error requests allowed before eviction. If this threshold is reached, eviction will occur. The default value is 5, meaning if five consecutive requests return 5xx errors, the service will be marked as unhealthy.

    interval

    Defines the detection interval, which is how often the service is checked. The default value is 10s. For example, 5s means checking every five seconds. The supported format is 1h/1m/1s/1ms, and the minimum value must be ≥1ms.

    baseEjectionTime

    Specifies the base time for which the service is evicted. That is, after a service is marked as unhealthy, it will not be reused within this time. The default value is 30s. The supported format is 1h/1m/1s/1ms, and the minimum value must be ≥1ms.

    maxEjectionPercent

    Specifies the maximum percentage of services that are allowed to be evicted to prevent too many services from being excluded simultaneously. For example, setting it to 100 means all services can be evicted. The default value is 10%.

    minHealthPercent

    Specifies the minimum percentage of healthy services. This parameter helps ensure that after some services are evicted, there are still enough healthy services available to process requests. The default value is 0, which means disabling the check for whether there are services in a healthy state, allowing all services to be marked as unhealthy.

    For more information about outlierDetection, see OutlierDetection.

  3. Create a virtual service to implement the rollback policy.

    1. Create a file named vs.yaml with the following content.

      apiVersion: networking.istio.io/v1alpha3
      kind: VirtualService
      metadata:
        name: test-fallback-llm-vs
        namespace: default
      spec:
        hosts:
        - asm-llm-provider-test.com
        http:
        - name: fallback-route
          route:
          - destination:
              host: asm-llm-provider-test.com
              port:
                number: 80
            fallback:
              target:
                host: dashscope.aliyuncs.com
    2. Run the following command to deploy a virtual service.

      kubectl apply -f vs.yaml

      This setup ensures that the request sent to asm-llm-provider-test.com is rerouted to dashscope.aliyuncs.com if the former is deemed unhealthy.

Step 3: Create a sleep application for testing

To ensure asm-llm-provider-test.com can be resolved by Alibaba Cloud Domain Name System (DNS), add static DNS configuration to the sleep deployment using hostAlias.
  1. Create a file named sleep.yaml with the following content.

    Click to view details

    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: sleep
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: sleep
      labels:
        app: sleep
        service: sleep
    spec:
      ports:
      - port: 80
        name: http
      selector:
        app: sleep
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: sleep
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: sleep
      template:
        metadata:
          labels:
            app: sleep
        spec:
          terminationGracePeriodSeconds: 0
          serviceAccountName: 
          hostAliases:
          - hostnames:
            - asm-llm-provider-test.com
            ip: 1.2.3.4
          containers:
          - name: sleep
            image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/curl:asm-sleep
            command: ["/bin/sleep", "infinity"]
            imagePullPolicy: IfNotPresent
            volumeMounts:
            - mountPath: /etc/sleep/tls
              name: secret-volume
          volumes:
          - name: secret-volume
            secret:
              secretName: sleep-secret
              optional: true
    ---
  2. Run the following command by using the kubeconfig files of the cluster on the data plane.

    kubectl apply -f sleep.yaml

Step 4: Verification

Send the following request twice from the pod of the sleep to LLM service in quick succession.

The eviction duration set in this step is 10 seconds, which means the second request must be sent within that timeframe. Adjust the outlierDetection settings as needed for your business requirements.
kubectl exec deployment/sleep -it -- curl http://asm-llm-provider-test.com \
--header 'Content-Type: application/json' \
--data '{
    "messages": [
        {"role": "user", "content": "Please introduce yourself"}
    ]
}'

Expected output:

{"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1730261854,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-3608dcd5-e3ad-9ade-bc70-xxxxxxxxxxxxxx"}%   

The output shows that the request was automatically redirected to dashscope.aliyuncs.com.

References

Traffic shifting is a crucial strategy for managing traffic routing. When an LLM provider in the request trace experiences a temporary outage, the configured eviction and rollback policies can ensure service continuity. For more information about traffic routing, including scenarios where different users access different LLM providers, see Traffic routing: Use ASM to manage LLM traffic.