All Products
Search
Document Center

Container Compute Service:Quick start of intelligent inference routing using Gateway with Inference Extension

Last Updated:Sep 26, 2025

Large language model (LLM) applications typically require GPUs to run, and GPU-accelerated nodes or virtual nodes often have higher costs compared to CPU nodes. For this reason, Gateway with Inference Extension provides a way to quickly experience the intelligent load balancing capability for LLM inference environment using CPU computing power. This topic describes how to build a mock environment to experience this capability.

Prerequisites

You have installed Gateway with Inference Extension 1.4.0 or later and have selected the Enable Gateway API Inference Extension option during setup. For instructions, see Install Gateway with Inference Extension.

Important

The mock environment created in this topic is intended only for experiencing the basic AI capabilities of Gateway with Inference Extension, such as canary release, request circuit breaking, and traffic mirroring. It is not suitable for performance testing and is not recommended for use in production environments.

Procedure

Step 1: Deploy the mock application

Deploy a mock LLM inference service (mock-vllm) and a client application (sleep). The client application is used for sending test requests.

  1. Create a file named mock-vllm.yaml.

    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: mock-vllm
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: mock-vllm
      labels:
        app: mock-vllm
        service: mock-vllm
    spec:
      ports:
      - name: http
        port: 8000
        targetPort: 8000
      selector:
        app: mock-vllm
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: mock-vllm
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: mock-vllm
      template:
        metadata:
          labels:
            app: mock-vllm
        spec:
          serviceAccountName: mock-vllm
          containers:
          - image: registry-cn-hangzhou.ack.aliyuncs.com/dev/mock-vllm:v0.1.7-g3cffa27-aliyun
            imagePullPolicy: IfNotPresent
            name: mock-vllm
            ports:
            - containerPort: 8000
  2. Deploy the mock inference service.

    kubectl apply -f mock-vllm.yaml
  3. Create a file named sleep.yaml to deploy the client application.

    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: sleep
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: sleep
      labels:
        app: sleep
        service: sleep
    spec:
      ports:
      - port: 80
        name: http
      selector:
        app: sleep
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: sleep
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: sleep
      template:
        metadata:
          labels:
            app: sleep
        spec:
          terminationGracePeriodSeconds: 0
          serviceAccountName: sleep
          containers:
          - name: sleep
            image:  registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/curl:asm-sleep
            command: ["/bin/sleep", "infinity"]
            imagePullPolicy: IfNotPresent
            volumeMounts:
            - mountPath: /etc/sleep/tls
              name: secret-volume
          volumes:
          - name: secret-volume
            secret:
              secretName: sleep-secret
              optional: true
  4. Deploy the client application.

    kubectl apply -f sleep.yaml

Step 2: Configure inference resources

Create the InferencePool and InferenceModel custom resources to represent your mock service.

  1. Create a file named inference-rule.yaml.

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: mock-pool
    spec:
      extensionRef:
        group: ""
        kind: Service
        name: mock-ext-proc
      selector:
        app: mock-vllm
      targetPortNumber: 8000
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: mock-model
    spec:
      criticality: Critical
      modelName: mock
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: mock-pool
      targetModels:
      - name: mock
        weight: 100
  2. Deploy the resources.

    kubectl apply -f inference-rule.yaml

Step 3: Deploy the gateway and routing rule

  1. Create a file named gateway.yaml to define the Gateway, GatewayClass, and associated policies.

    kind: GatewayClass
    apiVersion: gateway.networking.k8s.io/v1
    metadata:
      name: inference-gateway
    spec:
      controllerName: gateway.envoyproxy.io/gatewayclass-controller
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: mock-gateway
    spec:
      gatewayClassName: inference-gateway
      infrastructure:
        parametersRef:
          group: gateway.envoyproxy.io
          kind: EnvoyProxy
          name: custom-proxy-config
      listeners:
        - name: llm-gw
          protocol: HTTP
          port: 80
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: EnvoyProxy
    metadata:
      name: custom-proxy-config
      namespace: default
    spec:
      provider:
        type: Kubernetes
        kubernetes:
          envoyService:
            type: ClusterIP
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: ClientTrafficPolicy
    metadata:
      name: mock-client-buffer-limit
    spec:
      connection:
        bufferLimit: 20Mi
      targetRefs:
        - group: gateway.networking.k8s.io
          kind: Gateway
          name: mock-gateway
    ---
  2. Create a file named httproute.yaml to route traffic to the InferencePool.

    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: mock-route
    spec:
      parentRefs:
      - group: gateway.networking.k8s.io
        kind: Gateway
        name: mock-gateway
        sectionName: llm-gw
      rules:
      - backendRefs:
        - group: inference.networking.x-k8s.io
          kind: InferencePool
          name: mock-pool
          weight: 1
        matches:
        - path:
            type: PathPrefix
            value: /
  3. Deploy the gateway and the routing rule.

    kubectl apply -f gateway.yaml
    kubectl apply -f httproute.yaml

Step 4: Send a test request

  1. Get the external IP address of the gateway.

    export GATEWAY_ADDRESS=$(kubectl get gateway/mock-gateway -o jsonpath='{.status.addresses[0].value}')
    echo ${GATEWAY_ADDRESS}
  2. Send a request to the mock service from the sleep client application.

    kubectl exec deployment/sleep -it -- curl -X POST ${GATEWAY_ADDRESS}/v1/chat/completions \
      -H 'Content-Type: application/json' -H "host: example.com" -v -d '{
        "model": "mock",
        "max_completion_tokens": 100,
        "temperature": 0,
        "messages": [
          {
            "role": "user",
            "content": "introduce yourself"
          }
        ]
    }'

    Expected output:

    *   Trying 192.168.12.230:80...
    * Connected to 192.168.12.230 (192.168.12.230) port 80
    > POST /v1/chat/completions HTTP/1.1
    > Host: example.com
    > User-Agent: curl/8.8.0
    > Accept: */*
    > Content-Type: application/json
    > Content-Length: 184
    > 
    * upload completely sent off: 184 bytes
    < HTTP/1.1 200 OK
    < date: Tue, 27 May 2025 08:21:37 GMT
    < server: uvicorn
    < content-length: 354
    < content-type: application/json
    < 
    * Connection #0 to host 192.168.12.230 left intact
    {"id":"3bcc1fdd-e514-4a06-95aa-36c904015639","object":"chat.completion","created":1748334097.297188,"model":"mock","choices":[{"index":"0","message":{"role":"assistant","content":"As a mock AI Assitant, I can only echo your last message: introduce yourself"},"finish_reason":"stop"}],"usage":{"prompt_tokens":18,"completion_tokens":76,"total_tokens":94}}

Step 5: Clean up the environment

After you're finished, delete the resources created in this topic by placing all the YAML files in a directory, then running the following command:

kubectl delete -f .