Support for inference frameworks - Container Compute Service

Gateway with Inference Extension supports multiple generative AI inference frameworks and delivers consistent capabilities for AI inference services deployed on top of different inference frameworks. The capabilities include canary release strategies, inference load balancing, and model name-based inference routing. This topic introduces the generative AI inference frameworks supported by Gateway with Inference Extension and describes how to use the frameworks.

Supported inference frameworks

Inference framework	Required version
vLLM v0	≥ v0.6.4
vLLM v1	≥ v0.8.0
SGLang	≥ v0.3.6
Triton with a TensorRT-LLM backend	≥ 25.03

vLLM support

vLLM is the default backend inference framework supported by Gateway with Inference Extension. When you use vLLM-based inference services, no additional configuration is required to leverage generative AI-enhanced capabilities.

SGLang support

When you deploy generative AI inference services with SGLang, you can add the inference.networking.x-k8s.io/model-server-runtime: sglang annotation to the InferencePool resources to enable intelligent routing and load balancing for the inference services deployed on top of the SGLang framework.

The following code block shows the sample configurations of an InferencePool when you use SGLang to deploy an inference service. No additional configuration on other resources is required.

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: sglang
  name: deepseek-sglang-pool
spec:
  extensionRef:
    group: ""
    kind: Service
    name: deepseek-sglang-ext-proc
  selector:
    app: deepseek-r1-sglang
  targetPortNumber: 30000

TensorRT-LLM support

TensorRT-LLM is an open source engine provided by NVIDIA to optimize LLM inference performance. TensorRT-LLM is used to define LLMs and build TensorRT engines to optimize LLM inference performance on NVIDIA GPUs. TensorRT-LLM can be integrated with Triton to serve as the backend of Triton: TensorRT-LLM Backend. Models built with TensorRT-LLM can run on one or more GPUs and support Tensor Parallelism and Pipeline Parallelism.

When you deploy generative AI inference services by using Triton with a TensorRT-LLM backend, you can add the inference.networking.x-k8s.io/model-server-runtime: trt-llm annotation to the InferencePool resources to enable intelligent routing and load balancing for the inference services deployed on top of TensorRT-LLM.

The following code block shows the sample configurations of an InferencePool when you use TensorRT-LLM to deploy an inference service. No additional configuration on other resources is required.

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: trt-llm
  name: qwen-trt-pool
spec:
  extensionRef:
    group: ""
    kind: Service
    name: trt-llm-ext-proc
  selector:
    app: qwen-trt-llm
  targetPortNumber: 8000