All Products
Search
Document Center

Container Service for Kubernetes:Quick start for Gateway with Inference Extension intelligent inference routing

Last Updated:Oct 23, 2025

Large language model (LLM) applications typically require GPUs to run, and GPU-accelerated nodes or virtual nodes are more expensive than CPU nodes. Gateway with Inference Extension lets you use CPU computing power to quickly experience the intelligent load balancing capability for LLM inference scenarios. This topic describes how to build a mock environment with Gateway with Inference Extension to demonstrate the intelligent load balancing capability for inference services.

Applicability

Gateway with Inference Extension is installed and Enable Gateway API Inference Extension is selected when you create the cluster. For more information, see Step 2: Install the Gateway with Inference Extension component.

Important

The mock environment built in this topic is intended only for experiencing some basic AI capabilities of the Gateway with Inference Extension component, such as phased release, request circuit breaking, and traffic mirroring. It is not suitable for stress testing scenarios and is not recommended for use in production environments.

Procedure

Step 1: Deploy the mock sample application

  1. Create mock-vllm.yaml.

    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: mock-vllm
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: mock-vllm
      labels:
        app: mock-vllm
        service: mock-vllm
    spec:
      ports:
      - name: http
        port: 8000
        targetPort: 8000
      selector:
        app: mock-vllm
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: mock-vllm
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: mock-vllm
      template:
        metadata:
          labels:
            app: mock-vllm
        spec:
          serviceAccountName: mock-vllm
          containers:
          - image: registry-cn-hangzhou.ack.aliyuncs.com/dev/mock-vllm:v0.1.7-g3cffa27-aliyun
            imagePullPolicy: IfNotPresent
            name: mock-vllm
            ports:
            - containerPort: 8000
  2. Deploy the sample application.

    kubectl apply -f mock-vllm.yaml
  3. Create a sleep.yaml file with the following content.

    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: sleep
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: sleep
      labels:
        app: sleep
        service: sleep
    spec:
      ports:
      - port: 80
        name: http
      selector:
        app: sleep
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: sleep
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: sleep
      template:
        metadata:
          labels:
            app: sleep
        spec:
          terminationGracePeriodSeconds: 0
          serviceAccountName: sleep
          containers:
          - name: sleep
            image:  registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/curl:asm-sleep
            command: ["/bin/sleep", "infinity"]
            imagePullPolicy: IfNotPresent
  4. Deploy the test application to send test requests to the sample application.

    kubectl apply -f sleep.yaml

Step 2: Deploy inference resources

  1. Create inference-rule.yaml.

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: mock-pool
    spec:
      extensionRef:
        group: ""
        kind: Service
        name: mock-ext-proc
      selector:
        app: mock-vllm
      targetPortNumber: 8000
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: mock-model
    spec:
      criticality: Critical
      modelName: mock
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: mock-pool
      targetModels:
      - name: mock
        weight: 100
  2. Deploy the InferencePool and InferenceModel.

    kubectl apply -f inference-rule.yaml

Step 3: Deploy the gateway and routing rule

  1. By default, a GatewayClass is created when you install Gateway with Inference Extension. Run the following command to verify its existence.

    kubectl get gatewayclass

    If the GatewayClass resource is not found, create it manually.

    Create a GatewayClass

    Save the following YAML content to a file named gatewayclass.yaml, and then run the kubectl apply -f gatewayclass.yaml command.

    apiVersion: gateway.networking.k8s.io/v1
    kind: GatewayClass
    metadata:
      name: ack-gateway
    spec:
      controllerName: gateway.envoyproxy.io/gatewayclass-controller
  2. Create gateway.yaml.

    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: mock-gateway
    spec:
      gatewayClassName: ack-gateway
      infrastructure:
        parametersRef:
          group: gateway.envoyproxy.io
          kind: EnvoyProxy
          name: custom-proxy-config
      listeners:
        - name: llm-gw
          protocol: HTTP
          port: 80
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: EnvoyProxy
    metadata:
      name: custom-proxy-config
      namespace: default
    spec:
      provider:
        type: Kubernetes
        kubernetes:
          envoyService:
            type: ClusterIP
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: ClientTrafficPolicy
    metadata:
      name: mock-client-buffer-limit
    spec:
      connection:
        bufferLimit: 20Mi
      targetRefs:
        - group: gateway.networking.k8s.io
          kind: Gateway
          name: mock-gateway
    ---
  3. Create httproute.yaml.

    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: mock-route
    spec:
      parentRefs:
      - group: gateway.networking.k8s.io
        kind: Gateway
        name: mock-gateway
        sectionName: llm-gw
      rules:
      - backendRefs:
        - group: inference.networking.x-k8s.io
          kind: InferencePool
          name: mock-pool
          weight: 1
        matches:
        - path:
            type: PathPrefix
            value: /
  4. Deploy the gateway and routing rule.

    kubectl apply -f gateway.yaml
    kubectl apply -f httproute.yaml

Step 4: Send a test request

  1. Retrieve the gateway IP address.

    export GATEWAY_ADDRESS=$(kubectl get gateway/mock-gateway -o jsonpath='{.status.addresses[0].value}')
    echo ${GATEWAY_ADDRESS}
  2. Send a request from the test application.

    kubectl exec deployment/sleep -it -- curl -X POST ${GATEWAY_ADDRESS}/v1/chat/completions \
      -H 'Content-Type: application/json' -H "Host: example.com" -v -d '{
        "model": "mock",
        "max_completion_tokens": 100,
        "temperature": 0,
        "messages": [
          {
            "role": "user",
            "content": "introduce yourself"
          }
        ]
    }'

    Expected output:

    *   Trying 192.168.12.230:80...
    * Connected to 192.168.12.230 (192.168.12.230) port 80
    > POST /v1/chat/completions HTTP/1.1
    > Host: example.com
    > User-Agent: curl/8.8.0
    > Accept: */*
    > Content-Type: application/json
    > Content-Length: 184
    > 
    * upload completely sent off: 184 bytes
    < HTTP/1.1 200 OK
    < date: Tue, 27 May 2025 08:21:37 GMT
    < server: uvicorn
    < content-length: 354
    < content-type: application/json
    < 
    * Connection #0 to host 192.168.12.230 left intact
    {"id":"3bcc1fdd-e514-4a06-95aa-36c904015639","object":"chat.completion","created":1748334097.297188,"model":"mock","choices":[{"index":"0","message":{"role":"assistant","content":"As a mock AI Assitant, I can only echo your last message: introduce yourself"},"finish_reason":"stop"}],"usage":{"prompt_tokens":18,"completion_tokens":76,"total_tokens":94}}

Step 5: Clean up the environment

If you no longer need this environment, clean it up:

  • Delete the cluster resources:

    # Delete the gateway and route.
    kubectl delete -f gateway.yaml
    kubectl delete -f httproute.yaml
    # Delete the test application.
    kubectl delete -f sleep.yaml
    # Delete the backend application.
    kubectl delete -f mock-vllm.yaml
    kubectl delete -f inference-rule.yaml
  • On the Component Management page, search for Gateway with Inference Extension, and then click Uninstall on the component card.