All Products
Search
Document Center

Container Compute Service:Gateway with Inference Extension

Last Updated:Oct 14, 2025

Gateway with Inference Extension enhances Kubernetes Gateway API with Inference Extension specifications. It provides Layer 4 and Layer 7 routing services in Kubernetes and delivers intelligent load balancing for large language models (LLMs) inference scenarios. This topic introduces the usage guidelines and release notes of Gateway with Inference Extension.

Introduction

Gateway with Inference Extension is built based on the Envoy Gateway project. It maintains compatibility with Gateway API while integrating its inference extensions. This add-on primarily delivers load balancing and routing capabilities for LLM inference services.

Usage notes

The installation and use of Gateway with Inference Extension depends on the custom resource definitions (CRDs) provided by Gateway API. Before installation, make sure that Gateway API is installed in the cluster.

Release notes

May 2025

Version number

Release date

Description

Impact

1.4.0-aliyun.1

2025-05-27

  • Gateway API 1.3.0 is supported.

  • Inference Extension enhancements:

    • Multiple inference frameworks (such as vLLM, SGLang, and TensorRT-LLM) are supported.

    • Prefix-aware load balancing is optimized.

    • Routing for inference services can be implemented based on model names.

    • Request queuing and priority scheduling for inference workloads are supported.

  • Observability for generative AI requests is available.

  • Global throttling is supported.

  • Token-based global throttling for generative AI requests is available.

  • Secret content injection into specified request headers is supported.

Gateway pod restarts will occur during updates. We recommend performing these updates during off-peak hours.

April 2025

Version number

Release date

Description

Impact

1.3.0-aliyun.2

2025-04-07

  • Alibaba Cloud Container Compute Service (ACS) clusters are supported.

  • Inference Extension enhancements:

    • InferencePool resource referencing is enabled in HTTPRoute.

    • Weight-based routing, traffic mirroring, and circuit breaking capabilities can be implemented at the InferencePool level.

  • Prefix-aware load balancing is supported.

Gateway pod restarts will occur during updates. We recommend performing these updates during off-peak hours.