Large language model (LLM) applications typically require GPUs to run, and GPU-accelerated nodes or virtual nodes often have higher costs compared to CPU nodes. For this reason, Gateway with Inference Extension provides a way to quickly experience the intelligent load balancing capability for LLM inference environment using CPU computing power. This topic describes how to build a mock environment to experience this capability.
Prerequisites
You have installed Gateway with Inference Extension 1.4.0 or later and have selected the Enable Gateway API Inference Extension option during setup. For instructions, see Install Gateway with Inference Extension.
The mock environment created in this topic is intended only for experiencing the basic AI capabilities of Gateway with Inference Extension, such as canary release, request circuit breaking, and traffic mirroring. It is not suitable for performance testing and is not recommended for use in production environments.
Procedure
Step 1: Deploy the mock application
Deploy a mock LLM inference service (mock-vllm) and a client application (sleep). The client application is used for sending test requests.
Create a file named
mock-vllm.yaml.apiVersion: v1 kind: ServiceAccount metadata: name: mock-vllm --- apiVersion: v1 kind: Service metadata: name: mock-vllm labels: app: mock-vllm service: mock-vllm spec: ports: - name: http port: 8000 targetPort: 8000 selector: app: mock-vllm --- apiVersion: apps/v1 kind: Deployment metadata: name: mock-vllm spec: replicas: 1 selector: matchLabels: app: mock-vllm template: metadata: labels: app: mock-vllm spec: serviceAccountName: mock-vllm containers: - image: registry-cn-hangzhou.ack.aliyuncs.com/dev/mock-vllm:v0.1.7-g3cffa27-aliyun imagePullPolicy: IfNotPresent name: mock-vllm ports: - containerPort: 8000Deploy the mock inference service.
kubectl apply -f mock-vllm.yamlCreate a file named
sleep.yamlto deploy the client application.apiVersion: v1 kind: ServiceAccount metadata: name: sleep --- apiVersion: v1 kind: Service metadata: name: sleep labels: app: sleep service: sleep spec: ports: - port: 80 name: http selector: app: sleep --- apiVersion: apps/v1 kind: Deployment metadata: name: sleep spec: replicas: 1 selector: matchLabels: app: sleep template: metadata: labels: app: sleep spec: terminationGracePeriodSeconds: 0 serviceAccountName: sleep containers: - name: sleep image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/curl:asm-sleep command: ["/bin/sleep", "infinity"] imagePullPolicy: IfNotPresent volumeMounts: - mountPath: /etc/sleep/tls name: secret-volume volumes: - name: secret-volume secret: secretName: sleep-secret optional: trueDeploy the client application.
kubectl apply -f sleep.yaml
Step 2: Configure inference resources
Create the InferencePool and InferenceModel custom resources to represent your mock service.
Create a file named
inference-rule.yaml.apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: mock-pool spec: extensionRef: group: "" kind: Service name: mock-ext-proc selector: app: mock-vllm targetPortNumber: 8000 --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: mock-model spec: criticality: Critical modelName: mock poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: mock-pool targetModels: - name: mock weight: 100Deploy the resources.
kubectl apply -f inference-rule.yaml
Step 3: Deploy the gateway and routing rule
Create a file named
gateway.yamlto define theGateway,GatewayClass, and associated policies.kind: GatewayClass apiVersion: gateway.networking.k8s.io/v1 metadata: name: inference-gateway spec: controllerName: gateway.envoyproxy.io/gatewayclass-controller --- apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: mock-gateway spec: gatewayClassName: inference-gateway infrastructure: parametersRef: group: gateway.envoyproxy.io kind: EnvoyProxy name: custom-proxy-config listeners: - name: llm-gw protocol: HTTP port: 80 --- apiVersion: gateway.envoyproxy.io/v1alpha1 kind: EnvoyProxy metadata: name: custom-proxy-config namespace: default spec: provider: type: Kubernetes kubernetes: envoyService: type: ClusterIP --- apiVersion: gateway.envoyproxy.io/v1alpha1 kind: ClientTrafficPolicy metadata: name: mock-client-buffer-limit spec: connection: bufferLimit: 20Mi targetRefs: - group: gateway.networking.k8s.io kind: Gateway name: mock-gateway ---Create a file named
httproute.yamlto route traffic to theInferencePool.apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: mock-route spec: parentRefs: - group: gateway.networking.k8s.io kind: Gateway name: mock-gateway sectionName: llm-gw rules: - backendRefs: - group: inference.networking.x-k8s.io kind: InferencePool name: mock-pool weight: 1 matches: - path: type: PathPrefix value: /Deploy the gateway and the routing rule.
kubectl apply -f gateway.yaml kubectl apply -f httproute.yaml
Step 4: Send a test request
Get the external IP address of the gateway.
export GATEWAY_ADDRESS=$(kubectl get gateway/mock-gateway -o jsonpath='{.status.addresses[0].value}') echo ${GATEWAY_ADDRESS}Send a request to the mock service from the
sleepclient application.kubectl exec deployment/sleep -it -- curl -X POST ${GATEWAY_ADDRESS}/v1/chat/completions \ -H 'Content-Type: application/json' -H "host: example.com" -v -d '{ "model": "mock", "max_completion_tokens": 100, "temperature": 0, "messages": [ { "role": "user", "content": "introduce yourself" } ] }'Expected output:
* Trying 192.168.12.230:80... * Connected to 192.168.12.230 (192.168.12.230) port 80 > POST /v1/chat/completions HTTP/1.1 > Host: example.com > User-Agent: curl/8.8.0 > Accept: */* > Content-Type: application/json > Content-Length: 184 > * upload completely sent off: 184 bytes < HTTP/1.1 200 OK < date: Tue, 27 May 2025 08:21:37 GMT < server: uvicorn < content-length: 354 < content-type: application/json < * Connection #0 to host 192.168.12.230 left intact {"id":"3bcc1fdd-e514-4a06-95aa-36c904015639","object":"chat.completion","created":1748334097.297188,"model":"mock","choices":[{"index":"0","message":{"role":"assistant","content":"As a mock AI Assitant, I can only echo your last message: introduce yourself"},"finish_reason":"stop"}],"usage":{"prompt_tokens":18,"completion_tokens":76,"total_tokens":94}}
Step 5: Clean up the environment
After you're finished, delete the resources created in this topic by placing all the YAML files in a directory, then running the following command:
kubectl delete -f .