All Products
Search
Document Center

Container Service for Kubernetes:Inference Service Framework compatibility specifications

Last Updated:Jun 05, 2025

Gateway with Inference Extension supports multiple generative AI inference service frameworks and provides consistent capabilities for generative AI inference services deployed on different inference service frameworks, including phased release policies, inference load balancing, and model-based routing. This topic describes how Gateway with Inference Extension supports and utilizes different generative AI inference service frameworks.

Supported inference service frameworks

Inference service framework

Inference service framework version requirements

vLLM v0

v0.6.4 and later.

vLLM v1

v0.8.0 and later.

SGLang

v0.3.6 and later.

Triton with TensorRT-LLM backend

25.03 and later.

VLLM support

vLLM is the default backend inference framework supported by Gateway with Inference Extension. When you use inference services built on vLLM, you can utilize the enhanced generative AI service capabilities without additional configuration.

SGLang support

When you use generative AI inference services built on SGLang, you can enable smart routing and load balancing capabilities for the SGLang inference service framework by adding the inference.networking.x-k8s.io/model-server-runtime: sglang annotation to InferencePool.

The following is an example of InferencePool when using SGLang. Beyond this, you do not need to make additional changes to other resources.

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: sglang
  name: deepseek-sglang-pool
spec:
  extensionRef:
    group: ""
    kind: Service
    name: deepseek-sglang-ext-proc
  selector:
    app: deepseek-r1-sglang
  targetPortNumber: 30000

TensorRT-LLM support

TensorRT-LLM is an open source engine provided by NVIDIA to optimize LLM inference performance. TensorRT-LLM is used to define LLMs and build TensorRT engines to optimize LLM inference performance on NVIDIA GPUs. TensorRT-LLM can be integrated with Triton to serve as the backend of Triton: TensorRT-LLM Backend. Models built with TensorRT-LLM can run on one or more GPUs and support Tensor Parallelism and Pipeline Parallelism.

When you use generative AI inference services built with the Triton model server based on the TensorRT-LLM backend, you can enable smart routing and load balancing capabilities for TensorRT-LLM by adding the inference.networking.x-k8s.io/model-server-runtime: trt-llm annotation to InferencePool.

The following is an example of InferencePool when using TensorRT-LLM. Beyond this, you do not need to make additional changes to other resources.

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: trt-llm
  name: qwen-trt-pool
spec:
  extensionRef:
    group: ""
    kind: Service
    name: trt-llm-ext-proc
  selector:
    app: qwen-trt-llm
  targetPortNumber: 8000