All Products
Search
Document Center

Container Compute Service:Use ACK Gateway with Inference Extension to configure traffic circuit breaker rules for inference services

Last Updated:Jul 28, 2025

The ACK Gateway with Inference Extension component supports configuring circuit breaker rules while enabling intelligent load balancing for inference services. When a service becomes abnormal, the circuit breaker mechanism automatically disconnects problematic service connections to prevent fault propagation. This topic describes how to use ACK Gateway with Inference Extension to configure traffic circuit breaker rules for inference services.

Important

Before reading this topic, make sure you understand the concepts of InferencePool and InferenceModel.

Prerequisites

Note

For the image used in this topic, we recommend that you use A10 cards for ACK clusters and GN8IS cards for Alibaba Cloud Container Compute Service (ACS) GPU computing power.

Due to the large size of the LLM image, we recommend that you transfer it to Container Registry in advance and pull it using the internal network address. The speed of pulling from the public network depends on the bandwidth configuration of the cluster elastic IP address (EIP), which may result in longer wait times.

Workflow

In this example, the following resources are used:

  • The vllm-llama2-7b-pool inference service (Application).

  • A gateway with a Service type of ClusterIP.

  • An HTTPRoute, which is configured with specific traffic forwarding rules and a circuit breaker rule that limits the number of pending requests to 1 (The maximum concurrent requests is 1).

    This topic uses a small concurrency limit for demonstration purposes. In actual environments, modify the limit according to your needs.
  • InferencePool and corresponding InferenceModel are used to enable intelligent load balancing for the application.

  • Sleep application, which serves as a test client.

The following figure illustrates the request path for the traffic circuit breaker mechanism.

image
  • The client initiates request ① and then initiates request ② before the response of request ① returns.

  • The circuit breaker rule determines that there are no pending requests before request ①, so request ① is forwarded to the application.

  • The circuit breaker rule determines that request ① is already being processed before request ②, so it directly blocks the request and returns circuit breaker information ③ to the client. In this example, the request is rejected.

  • After request ① is processed by the application, the response ④ is returned to the client.

Procedure

  1. Deploy an inference service named vllm-llama2-7b-pool.

    Expand to view YAML content

    # =============================================================
    # inference_app.yaml
    # =============================================================
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: chat-template
    data:
      llama-2-chat.jinja: |
        {% if messages[0]['role'] == 'system' %}
          {% set system_message = '<<SYS>>\n' + messages[0]['content'] | trim + '\n<</SYS>>\n\n' %}
          {% set messages = messages[1:] %}
        {% else %}
            {% set system_message = '' %}
        {% endif %}
    
        {% for message in messages %}
            {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
                {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
            {% endif %}
    
            {% if loop.index0 == 0 %}
                {% set content = system_message + message['content'] %}
            {% else %}
                {% set content = message['content'] %}
            {% endif %}
            {% if message['role'] == 'user' %}
                {{ bos_token + '[INST] ' + content | trim + ' [/INST]' }}
            {% elif message['role'] == 'assistant' %}
                {{ ' ' + content | trim + ' ' + eos_token }}
            {% endif %}
        {% endfor %}
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: vllm-llama2-7b-pool
      namespace: default
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: vllm-llama2-7b-pool
      template:
        metadata:
          annotations:
            prometheus.io/path: /metrics
            prometheus.io/port: '8000'
            prometheus.io/scrape: 'true'
          labels:
            app: vllm-llama2-7b-pool
        spec:
          containers:
            - name: lora
              image: "registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/llama2-with-lora:v0.2"
              imagePullPolicy: IfNotPresent
              command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
              args:
              - "--model"
              - "/model/llama2"
              - "--tensor-parallel-size"
              - "1"
              - "--port"
              - "8000"
              - '--gpu_memory_utilization'
              - '0.8'
              - "--enable-lora"
              - "--max-loras"
              - "4"
              - "--max-cpu-loras"
              - "12"
              - "--lora-modules"
              - 'sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0'
              - 'sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1'
              - 'sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2'
              - 'sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3'
              - 'sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4'
              - 'tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0'
              - 'tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1'
              - 'tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2'
              - 'tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3'
              - 'tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4'
              - '--chat-template'
              - '/etc/vllm/llama-2-chat.jinja'
              env:
                - name: PORT
                  value: "8000"
              ports:
                - containerPort: 8000
                  name: http
                  protocol: TCP
              livenessProbe:
                failureThreshold: 2400
                httpGet:
                  path: /health
                  port: http
                  scheme: HTTP
                initialDelaySeconds: 5
                periodSeconds: 5
                successThreshold: 1
                timeoutSeconds: 1
              readinessProbe:
                failureThreshold: 6000
                httpGet:
                  path: /health
                  port: http
                  scheme: HTTP
                initialDelaySeconds: 5
                periodSeconds: 5
                successThreshold: 1
                timeoutSeconds: 1
              resources:
                limits:
                  nvidia.com/gpu: 1
                requests:
                  nvidia.com/gpu: 1
              volumeMounts:
                - mountPath: /data
                  name: data
                - mountPath: /dev/shm
                  name: shm
                - mountPath: /etc/vllm
                  name: chat-template
          restartPolicy: Always
          schedulerName: default-scheduler
          terminationGracePeriodSeconds: 30
          volumes:
            - name: data
              emptyDir: {}
            - name: shm
              emptyDir:
                medium: Memory
            - name: chat-template
              configMap:
                name: chat-template
  2. Deploy the InferencePool and InferenceModel resources.

    # =============================================================
    # inference_rules.yaml
    # =============================================================
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: vllm-llama2-7b-pool
    spec:
      targetPortNumber: 8000
      selector:
        app: vllm-llama2-7b-pool
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: inferencemodel-sample
    spec:
      modelName: /model/llama2
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: vllm-llama2-7b-pool
      targetModels:
      - name: /model/llama2
        weight: 100
  3. Deploy a Gateway and an HTTPRoute, and configure circuit breaker rules.

    The Service type of the gateway is ClusterIP, which can only be accessed from within the cluster. You can modify it to LoadBalancer based on your business requirements.
    # =============================================================
    # gateway.yaml
    # =============================================================
    kind: GatewayClass
    apiVersion: gateway.networking.k8s.io/v1
    metadata:
      name: example-gateway-class
      labels:
        example: http-routing
    spec:
      controllerName: gateway.envoyproxy.io/gatewayclass-controller
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      labels:
        example: http-routing
      name: example-gateway
      namespace: default
    spec:
      gatewayClassName: example-gateway-class
      infrastructure:
        parametersRef:
          group: gateway.envoyproxy.io
          kind: EnvoyProxy
          name: custom-proxy-config
      listeners:
      - allowedRoutes:
          namespaces:
            from: Same
        name: http
        port: 80
        protocol: HTTP
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: EnvoyProxy
    metadata:
      name: custom-proxy-config
      namespace: default
    spec:
      provider:
        type: Kubernetes
        kubernetes:
          envoyService:
            type: ClusterIP
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: test-httproute
      labels:
        example: http-routing
    spec:
      parentRefs:
        - name: example-gateway
      hostnames:
        - "example.com"
      rules:
        - matches:
            - path:
                type: PathPrefix
                value: /
          backendRefs:
          - group: inference.networking.x-k8s.io
            kind: InferencePool
            name: vllm-llama2-7b-pool
            weight: 1
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: BackendTrafficPolicy
    metadata:
      name: circuitbreaker-for-route
    spec:
      targetRefs:
        - group: gateway.networking.k8s.io
          kind: HTTPRoute
          name: test-httproute
      circuitBreaker:
        maxPendingRequests: 1
        maxParallelRequests: 1 # Limit concurrent requests to 1
  4. Deploy the sleep application.

    # =============================================================
    # sleep.yaml
    # =============================================================
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: sleep
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: sleep
      labels:
        app: sleep
        service: sleep
    spec:
      ports:
      - port: 80
        name: http
      selector:
        app: sleep
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: sleep
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: sleep
      template:
        metadata:
          labels:
            app: sleep
        spec:
          terminationGracePeriodSeconds: 0
          serviceAccountName: sleep
          containers:
          - name: sleep
            image:  registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/curl:asm-sleep
            command: ["/bin/sleep", "infinity"]
            imagePullPolicy: IfNotPresent
            volumeMounts:
            - mountPath: /etc/sleep/tls
              name: secret-volume
          volumes:
          - name: secret-volume
            secret:
              secretName: sleep-secret
              optional: true
  5. Verify the traffic circuit breaker configuration.

    This topic uses a small concurrency limit for demonstration purposes. In actual environments, modify the limit according to your needs.
    1. Obtain the gateway address.

      export GATEWAY_ADDRESS=$(kubectl get gateway/example-gateway -o jsonpath='{.status.addresses[0].value}')
    2. Open two terminal windows. In window 1, initiate a test request, and before the request returns, initiate another request in window 2.

      kubectl exec deployment/sleep -it -- curl -X POST ${GATEWAY_ADDRESS}/v1/chat/completions -H 'Content-Type: application/json' -H "host: example.com" -d '{
          "model": "/model/llama2",
          "max_completion_tokens": 100,
          "temperature": 0,
          "messages": [
            {
              "role": "user",
              "content": "introduce yourself"
            }
          ]
      }'

      Expected output in window 1:

      {"id":"chatcmpl-eb67bf29-1f87-4e29-8c3e-a83f3c74cd87","object":"chat.completion","created":1745207283,"model":"/model/llama2","choices":[{"index":0,"message":{"role":"assistant","content":"\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n        ","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":15,"total_tokens":115,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null}%

      Expected output in window 2:

      upstream connect error or disconnect/reset before headers. reset reason: overflow

      After you configure the circuit breaker rule, if the number of concurrent requests exceeds the configured limit of 1, subsequent requests trigger the circuit breaker.