All Products
Search
Document Center

Container Service for Kubernetes:Gateway with Inference Extension

Last Updated:Dec 04, 2025

The Gateway with Inference Extension component is an enhanced component based on the Kubernetes Gateway API and its Inference Extension specification. It provides intelligent load balancing for large language model (LLM) inference scenarios and supports Kubernetes Layer 4 and Layer 7 routing services. This topic introduces the Gateway with Inference Extension component, explains how to use it, and provides its change log.

Component information

Built on the Envoy Gateway project, the Gateway with Inference Extension component is compatible with Gateway API features and integrates the inference extension from the Gateway API. It primarily provides load balancing and routing for LLM inference services.

Usage instructions

  • The Gateway with Inference Extension component depends on the CustomResourceDefinitions (CRDs) provided by the Gateway API component. Before you install the component, make sure that the Gateway API component is installed in your cluster. For more information, see Install components.

  • For more information about how to use the Gateway with Inference Extension component, see Gateway with Inference Extension overview.

Change log

September 2025

Version number

Change date

Changes

Impact

v1.4.0-apsara.3

September 4, 2025

  • Supports inference routing for SGLang services with separate prefill and decode stages.

  • Supports prefix cache-aware routing in precise mode.

  • Supports routing to external Model-as-a-Service (MaaS) services.

  • Supports integration with Alibaba Cloud Content Moderation for AI content review.

  • Supports configuring inference routing policies using the InferenceTrafficPolicy API.

Upgrading from an earlier version restarts the gateway pod. Perform the upgrade during off-peak hours.

May 2025

Version number

Change date

Changes

Impact

v1.4.0-aliyun.1

May 27, 2025

  • Supports Gateway API 1.3.0.

  • Inference extension:

    • Supports multiple inference service frameworks, such as vLLM, SGLang, and TensorRT-LLM.

    • Supports prefix-aware load balancing.

    • Supports routing inference services based on model names.

    • Supports inference request queuing and priority scheduling.

  • Provides observability for generative AI requests.

  • Supports global rate limiting.

  • Supports global rate limiting based on tokens in generative AI requests.

  • Supports adding Secret content to specified request headers.

Upgrading from an earlier version restarts the gateway pod. Perform the upgrade during off-peak hours.

April 2025

Version number

Change date

Changes

Impact

v1.3.0-aliyun.2

May 7, 2025

  • Supports ACS clusters.

  • Inference extension enhancements: Supports referencing InferencePool resources in HTTPRoute. Also supports InferencePool-level features such as weighted routing, traffic mirroring, and circuit breaking.

  • Supports prefix-aware load balancing.

Upgrading from an earlier version restarts the gateway pod. Perform the upgrade during off-peak hours.

March 2025

Version number

Change date

Changes

Impact

v1.3.0-aliyun.1

March 12, 2025

  • Supports Gateway API v1.2.

  • Supports Inference Extension, which provides intelligent load balancing for LLM inference scenarios.

This upgrade does not affect your services.