Large language model (LLM) applications typically require GPUs to run, and GPU-accelerated nodes or virtual nodes are more expensive than CPU nodes. Gateway with Inference Extension lets you use CPU computing power to quickly experience the intelligent load balancing capability for LLM inference scenarios. This topic describes how to build a mock environment with Gateway with Inference Extension to demonstrate the intelligent load balancing capability for inference services.
Applicability
Gateway with Inference Extension is installed and Enable Gateway API Inference Extension is selected when you create the cluster. For more information, see Step 2: Install the Gateway with Inference Extension component.
The mock environment built in this topic is intended only for experiencing some basic AI capabilities of the Gateway with Inference Extension component, such as phased release, request circuit breaking, and traffic mirroring. It is not suitable for stress testing scenarios and is not recommended for use in production environments.
Procedure
Step 1: Deploy the mock sample application
Create mock-vllm.yaml.
apiVersion: v1 kind: ServiceAccount metadata: name: mock-vllm --- apiVersion: v1 kind: Service metadata: name: mock-vllm labels: app: mock-vllm service: mock-vllm spec: ports: - name: http port: 8000 targetPort: 8000 selector: app: mock-vllm --- apiVersion: apps/v1 kind: Deployment metadata: name: mock-vllm spec: replicas: 1 selector: matchLabels: app: mock-vllm template: metadata: labels: app: mock-vllm spec: serviceAccountName: mock-vllm containers: - image: registry-cn-hangzhou.ack.aliyuncs.com/dev/mock-vllm:v0.1.7-g3cffa27-aliyun imagePullPolicy: IfNotPresent name: mock-vllm ports: - containerPort: 8000Deploy the sample application.
kubectl apply -f mock-vllm.yamlCreate a sleep.yaml file with the following content.
apiVersion: v1 kind: ServiceAccount metadata: name: sleep --- apiVersion: v1 kind: Service metadata: name: sleep labels: app: sleep service: sleep spec: ports: - port: 80 name: http selector: app: sleep --- apiVersion: apps/v1 kind: Deployment metadata: name: sleep spec: replicas: 1 selector: matchLabels: app: sleep template: metadata: labels: app: sleep spec: terminationGracePeriodSeconds: 0 serviceAccountName: sleep containers: - name: sleep image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/curl:asm-sleep command: ["/bin/sleep", "infinity"] imagePullPolicy: IfNotPresentDeploy the test application to send test requests to the sample application.
kubectl apply -f sleep.yaml
Step 2: Deploy inference resources
Create inference-rule.yaml.
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: mock-pool spec: extensionRef: group: "" kind: Service name: mock-ext-proc selector: app: mock-vllm targetPortNumber: 8000 --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: mock-model spec: criticality: Critical modelName: mock poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: mock-pool targetModels: - name: mock weight: 100Deploy the InferencePool and InferenceModel.
kubectl apply -f inference-rule.yaml
Step 3: Deploy the gateway and routing rule
By default, a GatewayClass is created when you install Gateway with Inference Extension. Run the following command to verify its existence.
kubectl get gatewayclassIf the GatewayClass resource is not found, create it manually.
Create gateway.yaml.
apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: mock-gateway spec: gatewayClassName: ack-gateway infrastructure: parametersRef: group: gateway.envoyproxy.io kind: EnvoyProxy name: custom-proxy-config listeners: - name: llm-gw protocol: HTTP port: 80 --- apiVersion: gateway.envoyproxy.io/v1alpha1 kind: EnvoyProxy metadata: name: custom-proxy-config namespace: default spec: provider: type: Kubernetes kubernetes: envoyService: type: ClusterIP --- apiVersion: gateway.envoyproxy.io/v1alpha1 kind: ClientTrafficPolicy metadata: name: mock-client-buffer-limit spec: connection: bufferLimit: 20Mi targetRefs: - group: gateway.networking.k8s.io kind: Gateway name: mock-gateway ---Create httproute.yaml.
apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: mock-route spec: parentRefs: - group: gateway.networking.k8s.io kind: Gateway name: mock-gateway sectionName: llm-gw rules: - backendRefs: - group: inference.networking.x-k8s.io kind: InferencePool name: mock-pool weight: 1 matches: - path: type: PathPrefix value: /Deploy the gateway and routing rule.
kubectl apply -f gateway.yaml kubectl apply -f httproute.yaml
Step 4: Send a test request
Retrieve the gateway IP address.
export GATEWAY_ADDRESS=$(kubectl get gateway/mock-gateway -o jsonpath='{.status.addresses[0].value}') echo ${GATEWAY_ADDRESS}Send a request from the test application.
kubectl exec deployment/sleep -it -- curl -X POST ${GATEWAY_ADDRESS}/v1/chat/completions \ -H 'Content-Type: application/json' -H "Host: example.com" -v -d '{ "model": "mock", "max_completion_tokens": 100, "temperature": 0, "messages": [ { "role": "user", "content": "introduce yourself" } ] }'Expected output:
* Trying 192.168.12.230:80... * Connected to 192.168.12.230 (192.168.12.230) port 80 > POST /v1/chat/completions HTTP/1.1 > Host: example.com > User-Agent: curl/8.8.0 > Accept: */* > Content-Type: application/json > Content-Length: 184 > * upload completely sent off: 184 bytes < HTTP/1.1 200 OK < date: Tue, 27 May 2025 08:21:37 GMT < server: uvicorn < content-length: 354 < content-type: application/json < * Connection #0 to host 192.168.12.230 left intact {"id":"3bcc1fdd-e514-4a06-95aa-36c904015639","object":"chat.completion","created":1748334097.297188,"model":"mock","choices":[{"index":"0","message":{"role":"assistant","content":"As a mock AI Assitant, I can only echo your last message: introduce yourself"},"finish_reason":"stop"}],"usage":{"prompt_tokens":18,"completion_tokens":76,"total_tokens":94}}
Step 5: Clean up the environment
If you no longer need this environment, clean it up:
Delete the cluster resources:
# Delete the gateway and route. kubectl delete -f gateway.yaml kubectl delete -f httproute.yaml # Delete the test application. kubectl delete -f sleep.yaml # Delete the backend application. kubectl delete -f mock-vllm.yaml kubectl delete -f inference-rule.yamlOn the Component Management page, search for Gateway with Inference Extension, and then click Uninstall on the component card.