Quick start of intelligent inference routing using Gateway with Inference Extension - Container Compute Service

Large language model (LLM) applications typically require GPUs to run, and GPU-accelerated nodes or virtual nodes often have higher costs compared to CPU nodes. For this reason, Gateway with Inference Extension provides a way to quickly experience the intelligent load balancing capability for LLM inference environment using CPU computing power. This topic describes how to build a mock environment to experience this capability.

Prerequisites

You have installed Gateway with Inference Extension 1.4.0 or later and have selected the Enable Gateway API Inference Extension option during setup. For instructions, see Install Gateway with Inference Extension.

Important

The mock environment created in this topic is intended only for experiencing the basic AI capabilities of Gateway with Inference Extension, such as canary release, request circuit breaking, and traffic mirroring. It is not suitable for performance testing and is not recommended for use in production environments.

Procedure

Step 1: Deploy the mock application

Deploy a mock LLM inference service (mock-vllm) and a client application (sleep). The client application is used for sending test requests.

Create a file named mock-vllm.yaml.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: mock-vllm
---
apiVersion: v1
kind: Service
metadata:
  name: mock-vllm
  labels:
    app: mock-vllm
    service: mock-vllm
spec:
  ports:
  - name: http
    port: 8000
    targetPort: 8000
  selector:
    app: mock-vllm
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mock-vllm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mock-vllm
  template:
    metadata:
      labels:
        app: mock-vllm
    spec:
      serviceAccountName: mock-vllm
      containers:
      - image: registry-cn-hangzhou.ack.aliyuncs.com/dev/mock-vllm:v0.1.7-g3cffa27-aliyun
        imagePullPolicy: IfNotPresent
        name: mock-vllm
        ports:
        - containerPort: 8000

Deploy the mock inference service.
```
kubectl apply -f mock-vllm.yaml
```

Create a file named sleep.yaml to deploy the client application.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: sleep
---
apiVersion: v1
kind: Service
metadata:
  name: sleep
  labels:
    app: sleep
    service: sleep
spec:
  ports:
  - port: 80
    name: http
  selector:
    app: sleep
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sleep
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sleep
  template:
    metadata:
      labels:
        app: sleep
    spec:
      terminationGracePeriodSeconds: 0
      serviceAccountName: sleep
      containers:
      - name: sleep
        image:  registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/curl:asm-sleep
        command: ["/bin/sleep", "infinity"]
        imagePullPolicy: IfNotPresent
        volumeMounts:
        - mountPath: /etc/sleep/tls
          name: secret-volume
      volumes:
      - name: secret-volume
        secret:
          secretName: sleep-secret
          optional: true

Deploy the client application.
```
kubectl apply -f sleep.yaml
```

Step 2: Configure inference resources

Create the InferencePool and InferenceModel custom resources to represent your mock service.

Create a file named inference-rule.yaml.

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: mock-pool
spec:
  extensionRef:
    group: ""
    kind: Service
    name: mock-ext-proc
  selector:
    app: mock-vllm
  targetPortNumber: 8000
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: mock-model
spec:
  criticality: Critical
  modelName: mock
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: mock-pool
  targetModels:
  - name: mock
    weight: 100

Deploy the resources.
```
kubectl apply -f inference-rule.yaml
```

Step 3: Deploy the gateway and routing rule

Create a file named gateway.yaml to define the Gateway, GatewayClass, and associated policies.

kind: GatewayClass
apiVersion: gateway.networking.k8s.io/v1
metadata:
  name: inference-gateway
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: mock-gateway
spec:
  gatewayClassName: inference-gateway
  infrastructure:
    parametersRef:
      group: gateway.envoyproxy.io
      kind: EnvoyProxy
      name: custom-proxy-config
  listeners:
    - name: llm-gw
      protocol: HTTP
      port: 80
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
  name: custom-proxy-config
  namespace: default
spec:
  provider:
    type: Kubernetes
    kubernetes:
      envoyService:
        type: ClusterIP
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: ClientTrafficPolicy
metadata:
  name: mock-client-buffer-limit
spec:
  connection:
    bufferLimit: 20Mi
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: mock-gateway
---

Create a file named httproute.yaml to route traffic to the InferencePool.

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: mock-route
spec:
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: mock-gateway
    sectionName: llm-gw
  rules:
  - backendRefs:
    - group: inference.networking.x-k8s.io
      kind: InferencePool
      name: mock-pool
      weight: 1
    matches:
    - path:
        type: PathPrefix
        value: /

Deploy the gateway and the routing rule.

kubectl apply -f gateway.yaml
kubectl apply -f httproute.yaml

Step 4: Send a test request

Get the external IP address of the gateway.

export GATEWAY_ADDRESS=$(kubectl get gateway/mock-gateway -o jsonpath='{.status.addresses[0].value}')
echo ${GATEWAY_ADDRESS}

Send a request to the mock service from the sleep client application.

kubectl exec deployment/sleep -it -- curl -X POST ${GATEWAY_ADDRESS}/v1/chat/completions \
  -H 'Content-Type: application/json' -H "host: example.com" -v -d '{
    "model": "mock",
    "max_completion_tokens": 100,
    "temperature": 0,
    "messages": [
      {
        "role": "user",
        "content": "introduce yourself"
      }
    ]
}'

Expected output:

*   Trying 192.168.12.230:80...
* Connected to 192.168.12.230 (192.168.12.230) port 80
> POST /v1/chat/completions HTTP/1.1
> Host: example.com
> User-Agent: curl/8.8.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 184
> 
* upload completely sent off: 184 bytes
< HTTP/1.1 200 OK
< date: Tue, 27 May 2025 08:21:37 GMT
< server: uvicorn
< content-length: 354
< content-type: application/json
< 
* Connection #0 to host 192.168.12.230 left intact
{"id":"3bcc1fdd-e514-4a06-95aa-36c904015639","object":"chat.completion","created":1748334097.297188,"model":"mock","choices":[{"index":"0","message":{"role":"assistant","content":"As a mock AI Assitant, I can only echo your last message: introduce yourself"},"finish_reason":"stop"}],"usage":{"prompt_tokens":18,"completion_tokens":76,"total_tokens":94}}

Step 5: Clean up the environment

After you're finished, delete the resources created in this topic by placing all the YAML files in a directory, then running the following command:

kubectl delete -f .