Gateway with Inference Extension supports multiple generative AI inference service frameworks and provides consistent capabilities for generative AI inference services deployed on different inference service frameworks, including phased release policies, inference load balancing, and model-based routing. This topic describes how Gateway with Inference Extension supports and utilizes different generative AI inference service frameworks.
Supported inference service frameworks
Inference service framework | Inference service framework version requirements |
vLLM v0 | v0.6.4 and later. |
vLLM v1 | v0.8.0 and later. |
SGLang | v0.3.6 and later. |
Triton with TensorRT-LLM backend | 25.03 and later. |
VLLM support
vLLM is the default backend inference framework supported by Gateway with Inference Extension. When you use inference services built on vLLM, you can utilize the enhanced generative AI service capabilities without additional configuration.
SGLang support
When you use generative AI inference services built on SGLang, you can enable smart routing and load balancing capabilities for the SGLang inference service framework by adding the inference.networking.x-k8s.io/model-server-runtime: sglang annotation to InferencePool.
The following is an example of InferencePool when using SGLang. Beyond this, you do not need to make additional changes to other resources.
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
annotations:
inference.networking.x-k8s.io/model-server-runtime: sglang
name: deepseek-sglang-pool
spec:
extensionRef:
group: ""
kind: Service
name: deepseek-sglang-ext-proc
selector:
app: deepseek-r1-sglang
targetPortNumber: 30000TensorRT-LLM support
TensorRT-LLM is an open source engine provided by NVIDIA to optimize LLM inference performance. TensorRT-LLM is used to define LLMs and build TensorRT engines to optimize LLM inference performance on NVIDIA GPUs. TensorRT-LLM can be integrated with Triton to serve as the backend of Triton: TensorRT-LLM Backend. Models built with TensorRT-LLM can run on one or more GPUs and support Tensor Parallelism and Pipeline Parallelism.
When you use generative AI inference services built with the Triton model server based on the TensorRT-LLM backend, you can enable smart routing and load balancing capabilities for TensorRT-LLM by adding the inference.networking.x-k8s.io/model-server-runtime: trt-llm annotation to InferencePool.
The following is an example of InferencePool when using TensorRT-LLM. Beyond this, you do not need to make additional changes to other resources.
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
annotations:
inference.networking.x-k8s.io/model-server-runtime: trt-llm
name: qwen-trt-pool
spec:
extensionRef:
group: ""
kind: Service
name: trt-llm-ext-proc
selector:
app: qwen-trt-llm
targetPortNumber: 8000