Gateway with Inference Extension supports two methods for customizing the routing behavior of an InferencePool: adding annotations directly to the InferencePool resource, or creating a ConfigMap to override the default deployment configuration.
Prerequisites
Before you begin, ensure that you have:
A running ACK cluster with Gateway with Inference Extension deployed
An InferencePool resource configured for your AI model service
kubectlaccess to the cluster with permissions to update InferencePool resources and create ConfigMaps
Choose a configuration method
Gateway with Inference Extension is associated with an InferencePool resource, which logically groups and manages the resources for an AI model service. Both configuration methods support hot updates — changes take effect without restarting the inference extension.
| Aspect | Annotations | ConfigMap |
|---|---|---|
| Scope | Load balancing policy, request queuing, and inference framework settings | Full override of the inference extension deployment, including Deployment, Service, and PodDisruptionBudget |
| Update mechanism | Hot update | Hot update |
| Complexity | One annotation per setting. Simple for a small number of changes; grows with the number of items to configure | One annotation on the InferencePool pointing to the ConfigMap. Subsequent changes only require updating the ConfigMap |
| Version requirement | All versions | Gateway with Inference Extension 1.4.0-aliyun.2 or later |
Use annotations for targeted changes such as switching the load balancing policy or declaring the inference framework. Use a ConfigMap for comprehensive deployment-level customizations.
Method 1: Customize with annotations
Add annotations to an InferencePool resource to modify its routing behavior. The following example sets the load balancing policy to prefix-aware:
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
name: vllm-app-pool
annotations:
# Set the load balancing policy to prefix-aware
inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
spec:
targetPortNumber: 8000
selector:
app: vllm-app
extensionRef:
name: inference-gateway-ext-procSupported annotations
Load balancing policy
| Annotation | Type | Default | Description |
|---|---|---|---|
inference.networking.x-k8s.io/routing-strategy | string | DEFAULT | Load balancing policy. DEFAULT routes based on inference server load. PREFIX_CACHE additionally directs requests that share a prompt prefix to the same pod, improving KV cache reuse. For details, see Use intelligent inference routing to implement prefix-aware load balancing. |
inference-epp-env.networking.x-k8s.io/prefix-cache-hash-block-size | int64 | 64 | Block size (in characters) used to split prompts when computing prefix hashes. Adjust this value to match the internal block splitting policy of your model server. |
inference-epp-env.networking.x-k8s.io/prefix-cache-max-prefix-blocks | int64 | 128 | Maximum number of prefix blocks matched for a single request. Requests exceeding this limit are processed based on the limit, and the extra matched parts are ignored. |
inference-epp-env.networking.x-k8s.io/prefix-cache-lru-capacity | int64 | 50000 | Maximum number of blocks a single prefix record can hold in the cache. Larger values require more memory. |
Request queuing
Enable request queuing when your inference service receives burst traffic that exceeds backend capacity. Queuing holds excess requests centrally in the inference extension rather than letting them pile up on individual pods, which prevents memory overflow and reduces unnecessary client timeouts.
| Annotation | Type | Default | Description |
|---|---|---|---|
inference.networking.x-k8s.io/queueing | string | disabled | Set to enabled to turn on inference request queuing. For details on queuing behavior and priority scheduling, see Use intelligent inference routing to implement inference request queuing and priority scheduling. |
inference-epp-env.networking.x-k8s.io/total-queue-capacity | int64 | 104857600 | Total queue capacity in bytes (sum of all prompt sizes). When the queue reaches this limit, the oldest requests are discarded to prevent memory overflow. |
inference-epp-env.networking.x-k8s.io/queue-ttl | Duration | 30s | Maximum time a request can wait in the queue. Requests that exceed this limit are dropped to free resources and avoid indefinite client waits. The value is a duration string such as 300ms, 1.5h, or 2h45m. Valid units: ns, us (or µs), ms, s, m, h. |
Inference framework support
Declare the inference framework of the model servers in an InferencePool to enable framework-specific optimizations in the inference extension.
| Annotation | Type | Default | Description |
|---|---|---|---|
inference.networking.x-k8s.io/model-server-runtime | string | vllm | Inference framework used by the model servers. Valid values: vllm (vLLM v0 and v1), sglang (SGLang), trt-llm (Triton with TensorRT-LLM backend). For supported framework versions, see Inference service framework support. |
Method 2: Customize with a ConfigMap
This method requires Gateway with Inference Extension version 1.4.0-aliyun.2 or later.
Use this method to override the full deployment configuration of the inference extension, including container resources, pod affinity rules, Service settings, and PodDisruptionBudget.
The inference extension and the gateway run in the envoy-gateway-system namespace. To find the inference extension Deployment for an InferencePool, use the inference-pool and inference-pool-namespace label selectors. For example, to find the Deployment for an InferencePool named qwen-pool in the default namespace:
kubectl get deployments -n envoy-gateway-system -l inference-pool=qwen-pool,inference-pool-namespace=defaultApply a ConfigMap override
Create a ConfigMap containing the configuration to override. The following ConfigMap overrides the container resource limits and adds a
podAntiAffinityrule to spread inference extension pods across nodes:apiVersion: v1 kind: ConfigMap metadata: name: custom-epp data: deployment: |- spec: replicas: 1 template: spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchLabels: inference-pool: qwen-pool inference-pool-namespace: default topologyKey: kubernetes.io/hostname containers: - name: inference-gateway-ext-proc resources: limits: cpu: '4' memory: 4G requests: cpu: 500m memory: 1GAdd the
inference.networking.x-k8s.io/epp-overlayannotation to your InferencePool, specifying the name of the ConfigMap.apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-pool annotations: inference.networking.x-k8s.io/epp-overlay: custom-epp # Name of the ConfigMap containing the override configuration spec: extensionRef: group: '' kind: Service name: qwen-ext-proc selector: app: qwen targetPortNumber: 8000The inference extension picks up the ConfigMap and applies the overrides automatically.