All Products
Search
Document Center

Container Service for Kubernetes:Custom inference extension configuration

Last Updated:Mar 26, 2026

Gateway with Inference Extension supports two methods for customizing the routing behavior of an InferencePool: adding annotations directly to the InferencePool resource, or creating a ConfigMap to override the default deployment configuration.

Prerequisites

Before you begin, ensure that you have:

  • A running ACK cluster with Gateway with Inference Extension deployed

  • An InferencePool resource configured for your AI model service

  • kubectl access to the cluster with permissions to update InferencePool resources and create ConfigMaps

Choose a configuration method

Gateway with Inference Extension is associated with an InferencePool resource, which logically groups and manages the resources for an AI model service. Both configuration methods support hot updates — changes take effect without restarting the inference extension.

AspectAnnotationsConfigMap
ScopeLoad balancing policy, request queuing, and inference framework settingsFull override of the inference extension deployment, including Deployment, Service, and PodDisruptionBudget
Update mechanismHot updateHot update
ComplexityOne annotation per setting. Simple for a small number of changes; grows with the number of items to configureOne annotation on the InferencePool pointing to the ConfigMap. Subsequent changes only require updating the ConfigMap
Version requirementAll versionsGateway with Inference Extension 1.4.0-aliyun.2 or later

Use annotations for targeted changes such as switching the load balancing policy or declaring the inference framework. Use a ConfigMap for comprehensive deployment-level customizations.

Method 1: Customize with annotations

Add annotations to an InferencePool resource to modify its routing behavior. The following example sets the load balancing policy to prefix-aware:

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: vllm-app-pool
  annotations:
    # Set the load balancing policy to prefix-aware
    inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
spec:
  targetPortNumber: 8000
  selector:
    app: vllm-app
  extensionRef:
    name: inference-gateway-ext-proc

Supported annotations

Load balancing policy

AnnotationTypeDefaultDescription
inference.networking.x-k8s.io/routing-strategystringDEFAULTLoad balancing policy. DEFAULT routes based on inference server load. PREFIX_CACHE additionally directs requests that share a prompt prefix to the same pod, improving KV cache reuse. For details, see Use intelligent inference routing to implement prefix-aware load balancing.
inference-epp-env.networking.x-k8s.io/prefix-cache-hash-block-sizeint6464Block size (in characters) used to split prompts when computing prefix hashes. Adjust this value to match the internal block splitting policy of your model server.
inference-epp-env.networking.x-k8s.io/prefix-cache-max-prefix-blocksint64128Maximum number of prefix blocks matched for a single request. Requests exceeding this limit are processed based on the limit, and the extra matched parts are ignored.
inference-epp-env.networking.x-k8s.io/prefix-cache-lru-capacityint6450000Maximum number of blocks a single prefix record can hold in the cache. Larger values require more memory.

Request queuing

Enable request queuing when your inference service receives burst traffic that exceeds backend capacity. Queuing holds excess requests centrally in the inference extension rather than letting them pile up on individual pods, which prevents memory overflow and reduces unnecessary client timeouts.

AnnotationTypeDefaultDescription
inference.networking.x-k8s.io/queueingstringdisabledSet to enabled to turn on inference request queuing. For details on queuing behavior and priority scheduling, see Use intelligent inference routing to implement inference request queuing and priority scheduling.
inference-epp-env.networking.x-k8s.io/total-queue-capacityint64104857600Total queue capacity in bytes (sum of all prompt sizes). When the queue reaches this limit, the oldest requests are discarded to prevent memory overflow.
inference-epp-env.networking.x-k8s.io/queue-ttlDuration30sMaximum time a request can wait in the queue. Requests that exceed this limit are dropped to free resources and avoid indefinite client waits. The value is a duration string such as 300ms, 1.5h, or 2h45m. Valid units: ns, us (or µs), ms, s, m, h.

Inference framework support

Declare the inference framework of the model servers in an InferencePool to enable framework-specific optimizations in the inference extension.

AnnotationTypeDefaultDescription
inference.networking.x-k8s.io/model-server-runtimestringvllmInference framework used by the model servers. Valid values: vllm (vLLM v0 and v1), sglang (SGLang), trt-llm (Triton with TensorRT-LLM backend). For supported framework versions, see Inference service framework support.

Method 2: Customize with a ConfigMap

Important

This method requires Gateway with Inference Extension version 1.4.0-aliyun.2 or later.

Use this method to override the full deployment configuration of the inference extension, including container resources, pod affinity rules, Service settings, and PodDisruptionBudget.

The inference extension and the gateway run in the envoy-gateway-system namespace. To find the inference extension Deployment for an InferencePool, use the inference-pool and inference-pool-namespace label selectors. For example, to find the Deployment for an InferencePool named qwen-pool in the default namespace:

kubectl get deployments -n envoy-gateway-system -l inference-pool=qwen-pool,inference-pool-namespace=default

Apply a ConfigMap override

  1. Create a ConfigMap containing the configuration to override. The following ConfigMap overrides the container resource limits and adds a podAntiAffinity rule to spread inference extension pods across nodes:

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: custom-epp
    data:
      deployment: |-
        spec:
          replicas: 1
          template:
            spec:
              affinity:
                podAntiAffinity:
                  preferredDuringSchedulingIgnoredDuringExecution:
                  - weight: 100
                    podAffinityTerm:
                      labelSelector:
                        matchLabels:
                          inference-pool: qwen-pool
                          inference-pool-namespace: default
                      topologyKey: kubernetes.io/hostname
              containers:
                - name: inference-gateway-ext-proc
                  resources:
                    limits:
                      cpu: '4'
                      memory: 4G
                    requests:
                      cpu: 500m
                      memory: 1G
  2. Add the inference.networking.x-k8s.io/epp-overlay annotation to your InferencePool, specifying the name of the ConfigMap.

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-pool
      annotations:
        inference.networking.x-k8s.io/epp-overlay: custom-epp # Name of the ConfigMap containing the override configuration
    spec:
      extensionRef:
        group: ''
        kind: Service
        name: qwen-ext-proc
      selector:
        app: qwen
      targetPortNumber: 8000

    The inference extension picks up the ConfigMap and applies the overrides automatically.

What's next