The ACK Gateway with Inference Extension component supports configuring circuit breaker rules while enabling intelligent load balancing for inference services. When a service becomes abnormal, the circuit breaker mechanism automatically disconnects problematic service connections to prevent fault propagation. This topic describes how to use ACK Gateway with Inference Extension to configure traffic circuit breaker rules for inference services.
Before reading this topic, make sure you understand the concepts of InferencePool and InferenceModel.
Prerequisites
An ACK managed cluster with a GPU node pool is created. You can also install the ACK Virtual Node component in the ACK managed cluster to use ACS GPU compute power.
ACK Gateway with Inference Extension is installed and Enable Gateway API Inference Extension is selected when you create the cluster. For more information, see Step 2: Install the Gateway with Inference Extension component.
For the image used in this topic, we recommend that you use A10 cards for ACK clusters and GN8IS cards for Alibaba Cloud Container Compute Service (ACS) GPU computing power.
Due to the large size of the LLM image, we recommend that you transfer it to Container Registry in advance and pull it using the internal network address. The speed of pulling from the public network depends on the bandwidth configuration of the cluster elastic IP address (EIP), which may result in longer wait times.
Workflow
In this example, the following resources are used:
The vllm-llama2-7b-pool inference service (Application).
A gateway with a Service type of ClusterIP.
An HTTPRoute, which is configured with specific traffic forwarding rules and a circuit breaker rule that limits the number of pending requests to 1 (The maximum concurrent requests is 1).
This topic uses a small concurrency limit for demonstration purposes. In actual environments, modify the limit according to your needs.
InferencePool and corresponding InferenceModel are used to enable intelligent load balancing for the application.
Sleep application, which serves as a test client.
The following figure illustrates the request path for the traffic circuit breaker mechanism.
The client initiates request ① and then initiates request ② before the response of request ① returns.
The circuit breaker rule determines that there are no pending requests before request ①, so request ① is forwarded to the application.
The circuit breaker rule determines that request ① is already being processed before request ②, so it directly blocks the request and returns circuit breaker information ③ to the client. In this example, the request is rejected.
After request ① is processed by the application, the response ④ is returned to the client.
Procedure
Deploy an inference service named vllm-llama2-7b-pool.
Deploy the InferencePool and InferenceModel resources.
# ============================================================= # inference_rules.yaml # ============================================================= apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: vllm-llama2-7b-pool spec: targetPortNumber: 8000 selector: app: vllm-llama2-7b-pool extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: inferencemodel-sample spec: modelName: /model/llama2 criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: vllm-llama2-7b-pool targetModels: - name: /model/llama2 weight: 100Deploy a Gateway and an HTTPRoute, and configure circuit breaker rules.
The Service type of the gateway is ClusterIP, which can only be accessed from within the cluster. You can modify it to LoadBalancer based on your business requirements.
# ============================================================= # gateway.yaml # ============================================================= kind: GatewayClass apiVersion: gateway.networking.k8s.io/v1 metadata: name: example-gateway-class labels: example: http-routing spec: controllerName: gateway.envoyproxy.io/gatewayclass-controller --- apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: labels: example: http-routing name: example-gateway namespace: default spec: gatewayClassName: example-gateway-class infrastructure: parametersRef: group: gateway.envoyproxy.io kind: EnvoyProxy name: custom-proxy-config listeners: - allowedRoutes: namespaces: from: Same name: http port: 80 protocol: HTTP --- apiVersion: gateway.envoyproxy.io/v1alpha1 kind: EnvoyProxy metadata: name: custom-proxy-config namespace: default spec: provider: type: Kubernetes kubernetes: envoyService: type: ClusterIP --- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: test-httproute labels: example: http-routing spec: parentRefs: - name: example-gateway hostnames: - "example.com" rules: - matches: - path: type: PathPrefix value: / backendRefs: - group: inference.networking.x-k8s.io kind: InferencePool name: vllm-llama2-7b-pool weight: 1 --- apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: circuitbreaker-for-route spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: test-httproute circuitBreaker: maxPendingRequests: 1 maxParallelRequests: 1 # Limit concurrent requests to 1Deploy the sleep application.
# ============================================================= # sleep.yaml # ============================================================= apiVersion: v1 kind: ServiceAccount metadata: name: sleep --- apiVersion: v1 kind: Service metadata: name: sleep labels: app: sleep service: sleep spec: ports: - port: 80 name: http selector: app: sleep --- apiVersion: apps/v1 kind: Deployment metadata: name: sleep spec: replicas: 1 selector: matchLabels: app: sleep template: metadata: labels: app: sleep spec: terminationGracePeriodSeconds: 0 serviceAccountName: sleep containers: - name: sleep image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/curl:asm-sleep command: ["/bin/sleep", "infinity"] imagePullPolicy: IfNotPresent volumeMounts: - mountPath: /etc/sleep/tls name: secret-volume volumes: - name: secret-volume secret: secretName: sleep-secret optional: trueVerify the traffic circuit breaker configuration.
This topic uses a small concurrency limit for demonstration purposes. In actual environments, modify the limit according to your needs.
Obtain the gateway address.
export GATEWAY_ADDRESS=$(kubectl get gateway/example-gateway -o jsonpath='{.status.addresses[0].value}')Open two terminal windows. In window 1, initiate a test request, and before the request returns, initiate another request in window 2.
kubectl exec deployment/sleep -it -- curl -X POST ${GATEWAY_ADDRESS}/v1/chat/completions -H 'Content-Type: application/json' -H "host: example.com" -d '{ "model": "/model/llama2", "max_completion_tokens": 100, "temperature": 0, "messages": [ { "role": "user", "content": "introduce yourself" } ] }'Expected output in window 1:
{"id":"chatcmpl-eb67bf29-1f87-4e29-8c3e-a83f3c74cd87","object":"chat.completion","created":1745207283,"model":"/model/llama2","choices":[{"index":0,"message":{"role":"assistant","content":"\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n ","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":15,"total_tokens":115,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null}%Expected output in window 2:
upstream connect error or disconnect/reset before headers. reset reason: overflowAfter you configure the circuit breaker rule, if the number of concurrent requests exceeds the configured limit of 1, subsequent requests trigger the circuit breaker.