Quick start for Gateway with Inference Extension intelligent inference routing - Container Service for Kubernetes

Large language model (LLM) applications typically require GPUs to run, and GPU-accelerated nodes or virtual nodes are more expensive than CPU nodes. Gateway with Inference Extension lets you use CPU computing power to quickly experience the intelligent load balancing capability for LLM inference scenarios. This topic describes how to build a mock environment with Gateway with Inference Extension to demonstrate the intelligent load balancing capability for inference services.

Applicability

Gateway with Inference Extension is installed and Enable Gateway API Inference Extension is selected when you create the cluster. For more information, see Step 2: Install the Gateway with Inference Extension component.

Important

The mock environment built in this topic is intended only for experiencing some basic AI capabilities of the Gateway with Inference Extension component, such as phased release, request circuit breaking, and traffic mirroring. It is not suitable for stress testing scenarios and is not recommended for use in production environments.

Procedure

Step 1: Deploy the mock sample application

Create mock-vllm.yaml.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: mock-vllm
---
apiVersion: v1
kind: Service
metadata:
  name: mock-vllm
  labels:
    app: mock-vllm
    service: mock-vllm
spec:
  ports:
  - name: http
    port: 8000
    targetPort: 8000
  selector:
    app: mock-vllm
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mock-vllm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mock-vllm
  template:
    metadata:
      labels:
        app: mock-vllm
    spec:
      serviceAccountName: mock-vllm
      containers:
      - image: registry-cn-hangzhou.ack.aliyuncs.com/dev/mock-vllm:v0.1.7-g3cffa27-aliyun
        imagePullPolicy: IfNotPresent
        name: mock-vllm
        ports:
        - containerPort: 8000

Deploy the sample application.
```
kubectl apply -f mock-vllm.yaml
```

Create a sleep.yaml file with the following content.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: sleep
---
apiVersion: v1
kind: Service
metadata:
  name: sleep
  labels:
    app: sleep
    service: sleep
spec:
  ports:
  - port: 80
    name: http
  selector:
    app: sleep
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sleep
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sleep
  template:
    metadata:
      labels:
        app: sleep
    spec:
      terminationGracePeriodSeconds: 0
      serviceAccountName: sleep
      containers:
      - name: sleep
        image:  registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/curl:asm-sleep
        command: ["/bin/sleep", "infinity"]
        imagePullPolicy: IfNotPresent

Deploy the test application to send test requests to the sample application.
```
kubectl apply -f sleep.yaml
```

Step 2: Deploy inference resources

Create inference-rule.yaml.

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: mock-pool
spec:
  extensionRef:
    group: ""
    kind: Service
    name: mock-ext-proc
  selector:
    app: mock-vllm
  targetPortNumber: 8000
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: mock-model
spec:
  criticality: Critical
  modelName: mock
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: mock-pool
  targetModels:
  - name: mock
    weight: 100

Deploy the InferencePool and InferenceModel.
```
kubectl apply -f inference-rule.yaml
```

Step 3: Deploy the gateway and routing rule

By default, a GatewayClass is created when you install Gateway with Inference Extension. Run the following command to verify its existence.
```
kubectl get gatewayclass
```
If the GatewayClass resource is not found, create it manually.
Create a GatewayClass
Save the following YAML content to a file named gatewayclass.yaml, and then run the kubectl apply -f gatewayclass.yaml command.
```
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: ack-gateway
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
```

Create gateway.yaml.

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: mock-gateway
spec:
  gatewayClassName: ack-gateway
  infrastructure:
    parametersRef:
      group: gateway.envoyproxy.io
      kind: EnvoyProxy
      name: custom-proxy-config
  listeners:
    - name: llm-gw
      protocol: HTTP
      port: 80
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
  name: custom-proxy-config
  namespace: default
spec:
  provider:
    type: Kubernetes
    kubernetes:
      envoyService:
        type: ClusterIP
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: ClientTrafficPolicy
metadata:
  name: mock-client-buffer-limit
spec:
  connection:
    bufferLimit: 20Mi
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: mock-gateway
---

Create httproute.yaml.

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: mock-route
spec:
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: mock-gateway
    sectionName: llm-gw
  rules:
  - backendRefs:
    - group: inference.networking.x-k8s.io
      kind: InferencePool
      name: mock-pool
      weight: 1
    matches:
    - path:
        type: PathPrefix
        value: /

Deploy the gateway and routing rule.

kubectl apply -f gateway.yaml
kubectl apply -f httproute.yaml

Step 4: Send a test request

Retrieve the gateway IP address.

export GATEWAY_ADDRESS=$(kubectl get gateway/mock-gateway -o jsonpath='{.status.addresses[0].value}')
echo ${GATEWAY_ADDRESS}

Send a request from the test application.

kubectl exec deployment/sleep -it -- curl -X POST ${GATEWAY_ADDRESS}/v1/chat/completions \
  -H 'Content-Type: application/json' -H "Host: example.com" -v -d '{
    "model": "mock",
    "max_completion_tokens": 100,
    "temperature": 0,
    "messages": [
      {
        "role": "user",
        "content": "introduce yourself"
      }
    ]
}'

Expected output:

*   Trying 192.168.12.230:80...
* Connected to 192.168.12.230 (192.168.12.230) port 80
> POST /v1/chat/completions HTTP/1.1
> Host: example.com
> User-Agent: curl/8.8.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 184
> 
* upload completely sent off: 184 bytes
< HTTP/1.1 200 OK
< date: Tue, 27 May 2025 08:21:37 GMT
< server: uvicorn
< content-length: 354
< content-type: application/json
< 
* Connection #0 to host 192.168.12.230 left intact
{"id":"3bcc1fdd-e514-4a06-95aa-36c904015639","object":"chat.completion","created":1748334097.297188,"model":"mock","choices":[{"index":"0","message":{"role":"assistant","content":"As a mock AI Assitant, I can only echo your last message: introduce yourself"},"finish_reason":"stop"}],"usage":{"prompt_tokens":18,"completion_tokens":76,"total_tokens":94}}

Step 5: Clean up the environment

If you no longer need this environment, clean it up:

Delete the cluster resources:

# Delete the gateway and route.
kubectl delete -f gateway.yaml
kubectl delete -f httproute.yaml
# Delete the test application.
kubectl delete -f sleep.yaml
# Delete the backend application.
kubectl delete -f mock-vllm.yaml
kubectl delete -f inference-rule.yaml

On the Component Management page, search for Gateway with Inference Extension, and then click Uninstall on the component card.