All Products
Search
Document Center

Container Service for Kubernetes:Custom inference extension configuration

Last Updated:Sep 12, 2025

Gateway with Inference Extension enhances generative AI inference services through extended management capabilities. You can customize routing policies by adding annotations to your InferencePool resources or by creating a ConfigMap to override the default deployment configuration. This topic describes how to use both methods to customize the inference extension.

Overview of configuration methods

Gateway with Inference Extension is associated with an InferencePool resource, which logically groups and manages the resources for an AI model service. You can customize the routing policies for an InferencePool in two ways:

  1. Using annotations: Apply specific annotations directly to the InferencePool resource.

  2. Using a ConfigMap: Create a custom ConfigMap and link it to the InferencePool.

The following table compares these two methods:

Aspect

Using annotations

Using a ConfigMap

Scope

Allows you to modify the load balancing policy, request queuing strategy, and inference framework settings.

Allows you to fully override the default configurations for the inference extension, including Deployment, Service, and PodDisruptionBudget.

Update mechanism

Hot update. Changes are applied dynamically in real-time.

Hot update. Changes are applied in real-time.

Complexity

Simple for single changes, but scales with configuration items. Requires adding a new annotation for each setting you want to modify.

Requires creating a ConfigMap and adding one annotation to the InferencePool. All subsequent changes are managed by updating the ConfigMap file.

Version requirement

All versions are supported.

Requires Gateway with Inference Extension version 1.4.0-aliyun.2 or later.

Recommendation: Use annotations for simple changes, such as updating the load balancing policy or the supported inference framework. Use a ConfigMap for more advanced or comprehensive customizations.

Method 1: Customize with annotations

You can modify routing policies by adding the inference.networking.x-k8s.io/routing-strategy annotation to an InferencePool resource. For example, the following annotation changes the load balancing policy to be prefix-aware:

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: vllm-app-pool
  annotations:
    # Set the load balancing policy to be prefix-aware
    inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" 
spec:
  targetPortNumber: 8000
  selector:
    app: vllm-app
  extensionRef:
    name: inference-gateway-ext-proc

Supported annotations

The following tables describe the annotations supported by the Gateway with Inference Extension:

Load balancing policy

Annotation

Type

Default value

Description

inference.networking.x-k8s.io/routing-strategy

string

DEFAULT

Specifies the load balancing policy used by the inference extension. Valid values:

  • DEFAULT: The default load balancing policy that is aware of the inference server load.

  • PREFIX_CACHE: A load balancing policy that, based on the default load balancing policy, sends requests that share the same prefix content to the same inference server pod whenever possible.

For more information about prefix-aware load balancing, see Use intelligent inference routing to implement prefix-aware load balancing.

inference-epp-env.networking.x-k8s.io/prefix-cache-hash-block-size

int64

64

When prefix-aware load balancing is used, the inference extension splits a request into fixed-size blocks and matches them with the prefixes in the cache.

This parameter specifies the string length of each block. For optimal load balancing, adjust this value to match the internal block splitting policy of the model server.

inference-epp-env.networking.x-k8s.io/prefix-cache-max-prefix-blocks

int64

128

This parameter limits the maximum number of prefix blocks that a single request can match.

If the actual number of matches exceeds this limit, the system processes the request based on the limit and ignores the extra matched parts.

inference-epp-env.networking.x-k8s.io/prefix-cache-lru-capacity

int64

50000

This parameter specifies the maximum number of blocks that a single prefix record in the cache can contain.

Note that the larger the value, the more memory the cache requires.

Request queuing

Annotation

Type

Default value

Description

inference.networking.x-k8s.io/queueing

string

disabled

Specifies whether to enable the inference request queuing feature. For more information about the inference request queuing feature, see Use intelligent inference routing to implement inference request queuing and priority scheduling.

inference-epp-env.networking.x-k8s.io/total-queue-capacity

int64

104857600

This parameter limits the total capacity of the inference request queue (the sum of the number of bytes of all prompts).

If the total queue size exceeds this limit, the earliest requests are discarded to prevent memory overflow caused by request backlogs.

inference-epp-env.networking.x-k8s.io/queue-ttl

Duration

30s

This parameter specifies the maximum time that a request can wait in the queue. Requests that wait longer than this limit are discarded to prevent clients from waiting unnecessarily and to release system resources in a timely manner.

Note

The value of this parameter is a sequence of signed decimal numbers, each with an optional fraction and a unit suffix, such as "300ms", "-1.5h", or "2h45m". Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", and "h".

Inference framework support

Annotation

Type

Default value

Description

inference.networking.x-k8s.io/model-server-runtime

string

vllm

Declares the inference framework of the model server at the backend of the InferencePool to enable the inference extension to support specific inference frameworks. Valid values:

  • vllm: for vLLM v0 and vLLM v1.

  • sglang: for SGlang.

  • trt-llm: for Triton that uses the TensorRT-LLM inference backend.

For more information, see Inference service framework support.

Method 2: Customizing with a ConfigMap

Important

This method requires Gateway with Inference Extension version 1.4.0-aliyun.2 or later.

The inference extension and the gateway are deployed in the envoy-gateway-system namespace. You can find their associated resources using a label selector. For example, to find the inference extension deployment for an InferencePool named qwen-pool in the default namespace, run the following command:

kubectl get deployments -n envoy-gateway-system -l inference-pool=qwen-pool,inference-pool-namespace=default

Follow these steps to apply a custom configuration using a ConfigMap:

  1. Create and deploy a ConfigMap.

    The following ConfigMap overrides the default container resource configurations and adds a podAntiAffinity rule.

    apiVersion: v1
    data:
      deployment: |- 
        spec:
          replicas: 1
          template:
            spec:
              affinity:
                podAntiAffinity:
                  preferredDuringSchedulingIgnoredDuringExecution:
                  - weight: 100
                    podAffinityTerm:
                      labelSelector:
                        matchLabels:
                          inference-pool: qwen-pool
                          inference-pool-namespace: default
                      topologyKey: kubernetes.io/hostname
              containers:
                - name: inference-gateway-ext-proc
                  resources:
                    limits:
                      cpu: '4'
                      memory: 4G
                    requests:
                      cpu: 500m
                      memory: 1G
    kind: ConfigMap
    metadata:
      name: custom-epp
  2. Add an annotation to your InferencePool to specify the name of the ConfigMap containing your custom configuration.

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      annotations:
        inference.networking.x-k8s.io/epp-overlay: custom-epp # Specify the override configuration for the inference extension
      name: qwen-pool
    spec:
      extensionRef:
        group: ''
        kind: Service
        name: qwen-ext-proc
      selector:
        app: qwen
      targetPortNumber: 8000