Gateway with Inference Extension enhances generative AI inference services through extended management capabilities. You can customize routing policies by adding annotations to your InferencePool resources or by creating a ConfigMap to override the default deployment configuration. This topic describes how to use both methods to customize the inference extension.
Overview of configuration methods
Gateway with Inference Extension is associated with an InferencePool resource, which logically groups and manages the resources for an AI model service. You can customize the routing policies for an InferencePool in two ways:
Using annotations: Apply specific annotations directly to the InferencePool resource.
Using a ConfigMap: Create a custom ConfigMap and link it to the InferencePool.
The following table compares these two methods:
Aspect | Using annotations | Using a ConfigMap |
Scope | Allows you to modify the load balancing policy, request queuing strategy, and inference framework settings. | Allows you to fully override the default configurations for the inference extension, including Deployment, Service, and PodDisruptionBudget. |
Update mechanism | Hot update. Changes are applied dynamically in real-time. | Hot update. Changes are applied in real-time. |
Complexity | Simple for single changes, but scales with configuration items. Requires adding a new annotation for each setting you want to modify. | Requires creating a ConfigMap and adding one annotation to the InferencePool. All subsequent changes are managed by updating the ConfigMap file. |
Version requirement | All versions are supported. | Requires Gateway with Inference Extension version 1.4.0-aliyun.2 or later. |
Recommendation: Use annotations for simple changes, such as updating the load balancing policy or the supported inference framework. Use a ConfigMap for more advanced or comprehensive customizations.
Method 1: Customize with annotations
You can modify routing policies by adding the inference.networking.x-k8s.io/routing-strategy annotation to an InferencePool resource. For example, the following annotation changes the load balancing policy to be prefix-aware:
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
name: vllm-app-pool
annotations:
# Set the load balancing policy to be prefix-aware
inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
spec:
targetPortNumber: 8000
selector:
app: vllm-app
extensionRef:
name: inference-gateway-ext-procSupported annotations
The following tables describe the annotations supported by the Gateway with Inference Extension:
Load balancing policy
Annotation | Type | Default value | Description |
inference.networking.x-k8s.io/routing-strategy | string | DEFAULT | Specifies the load balancing policy used by the inference extension. Valid values:
For more information about prefix-aware load balancing, see Use intelligent inference routing to implement prefix-aware load balancing. |
inference-epp-env.networking.x-k8s.io/prefix-cache-hash-block-size | int64 | 64 | When prefix-aware load balancing is used, the inference extension splits a request into fixed-size blocks and matches them with the prefixes in the cache. This parameter specifies the string length of each block. For optimal load balancing, adjust this value to match the internal block splitting policy of the model server. |
inference-epp-env.networking.x-k8s.io/prefix-cache-max-prefix-blocks | int64 | 128 | This parameter limits the maximum number of prefix blocks that a single request can match. If the actual number of matches exceeds this limit, the system processes the request based on the limit and ignores the extra matched parts. |
inference-epp-env.networking.x-k8s.io/prefix-cache-lru-capacity | int64 | 50000 | This parameter specifies the maximum number of blocks that a single prefix record in the cache can contain. Note that the larger the value, the more memory the cache requires. |
Request queuing
Annotation | Type | Default value | Description |
inference.networking.x-k8s.io/queueing | string | disabled | Specifies whether to enable the inference request queuing feature. For more information about the inference request queuing feature, see Use intelligent inference routing to implement inference request queuing and priority scheduling. |
inference-epp-env.networking.x-k8s.io/total-queue-capacity | int64 | 104857600 | This parameter limits the total capacity of the inference request queue (the sum of the number of bytes of all prompts). If the total queue size exceeds this limit, the earliest requests are discarded to prevent memory overflow caused by request backlogs. |
inference-epp-env.networking.x-k8s.io/queue-ttl | Duration | 30s | This parameter specifies the maximum time that a request can wait in the queue. Requests that wait longer than this limit are discarded to prevent clients from waiting unnecessarily and to release system resources in a timely manner. Note The value of this parameter is a sequence of signed decimal numbers, each with an optional fraction and a unit suffix, such as "300ms", "-1.5h", or "2h45m". Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", and "h". |
Inference framework support
Annotation | Type | Default value | Description |
inference.networking.x-k8s.io/model-server-runtime | string | vllm | Declares the inference framework of the model server at the backend of the InferencePool to enable the inference extension to support specific inference frameworks. Valid values:
For more information, see Inference service framework support. |
Method 2: Customizing with a ConfigMap
This method requires Gateway with Inference Extension version 1.4.0-aliyun.2 or later.
The inference extension and the gateway are deployed in the envoy-gateway-system namespace. You can find their associated resources using a label selector. For example, to find the inference extension deployment for an InferencePool named qwen-pool in the default namespace, run the following command:
kubectl get deployments -n envoy-gateway-system -l inference-pool=qwen-pool,inference-pool-namespace=defaultFollow these steps to apply a custom configuration using a ConfigMap:
Create and deploy a ConfigMap.
The following ConfigMap overrides the default container resource configurations and adds a
podAntiAffinityrule.apiVersion: v1 data: deployment: |- spec: replicas: 1 template: spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchLabels: inference-pool: qwen-pool inference-pool-namespace: default topologyKey: kubernetes.io/hostname containers: - name: inference-gateway-ext-proc resources: limits: cpu: '4' memory: 4G requests: cpu: 500m memory: 1G kind: ConfigMap metadata: name: custom-eppAdd an annotation to your
InferencePoolto specify the name of the ConfigMap containing your custom configuration.apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: annotations: inference.networking.x-k8s.io/epp-overlay: custom-epp # Specify the override configuration for the inference extension name: qwen-pool spec: extensionRef: group: '' kind: Service name: qwen-ext-proc selector: app: qwen targetPortNumber: 8000