產生式AI推理架構支援詳情配置-Container Service Kubernetes 版 ACK-阿里雲

Gateway with Inference Extension支援多種產生式AI推理服務架構，並為基於不同推理服務架構部署的產生式AI推理服務提供一致的能力，包括制定灰階發布策略、推理負載平衡、基於模型名稱的路由等。本文介紹Gateway with Inference Extension對不同產生式AI推理服務架構的支援與使用方式。

支援的推理服務架構

推理服務架構	推理服務架構版本要求
vLLM v0	v0.6.4及以上。
vLLM v1	v0.8.0及以上。
SGLang	v0.3.6及以上。
使用TensorRT-LLM後端的Triton	25.03及以上。

vLLM支援

vLLM是Gateway with Inference Extension預設支援的後端推理架構，當您在使用基於vLLM構建的推理服務時，無需任何多餘配置即可使用產生式AI服務增強能力。

SGLang支援

當您在使用基於SGLang構建的產生式AI推理服務時，您可以通過為InferencePool加入inference.networking.x-k8s.io/model-server-runtime: sglang註解，來啟用針對SGLang推理服務架構的智能路由與負載平衡能力。

以下是使用SGLang時的InferencePool樣本。除此之外，您無需對其他資源進行額外更改。

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: sglang
  name: deepseek-sglang-pool
spec:
  extensionRef:
    group: ""
    kind: Service
    name: deepseek-sglang-ext-proc
  selector:
    app: deepseek-r1-sglang
  targetPortNumber: 30000

TensorRT-LLM支援

TensorRT-LLM 是NVIDIA開源的LLM（Large Language Model）模型最佳化引擎，用於定義LLM模型並將模型構建為TensorRT引擎，以提升服務在NVIDIA GPU上的推理效率。TensorRT-LLM還可以與Triton架構結合，作為Triton推理架構的一種後端TensorRT-LLM Backend。TensorRT-LLM構建的模型可以在單個或多個GPU上運行，支援Tensor Parallelism及Pipeline Parallelism。

當您使用基於TensorRT-LLM後端的Triton模型伺服器構建產生式AI推理服務時，您可以通過為InferencePool加入inference.networking.x-k8s.io/model-server-runtime: trt-llm註解，來啟用針對TensorRT-LLM的智能路由與負載平衡能力。

以下是使用TensorRT-LLM時的InferencePool樣本。除此之外，您無需對其他資源進行額外更改。

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: trt-llm
  name: qwen-trt-pool
spec:
  extensionRef:
    group: ""
    kind: Service
    name: trt-llm-ext-proc
  selector:
    app: qwen-trt-llm
  targetPortNumber: 8000