推論ゲートウェイを使用した LLM 推論サービスのスマートルーティングの設定 - Container Service for Kubernetes

従来のロードバランシングアルゴリズムは、標準の HTTP リクエストを異なるワークロードに均等に分散できます。しかし、大規模言語モデル (LLM) 推論サービスの場合、バックエンドでの各リクエストのペイロードを予測することは困難です。Gateway with Inference Extension は、Kubernetes Gateway API とその Inference Extension 仕様に基づいて構築された拡張コンポーネントです。スマートルーティングを使用して、複数の推論サービスワークロード間のロードバランシングを改善します。このゲートウェイは、さまざまな LLM 推論サービスシナリオに対応する多様なロードバランシングポリシーを提供し、段階的リリースや推論リクエストのキューイングなどの機能を有効にします。

前提条件

Gateway with Inference Extension コンポーネントをデプロイ済みであること。
単一マシン LLM 推論サービスまたは複数マシン分散推論サービスをデプロイ済みであること。

ステップ 1: 推論サービスのスマートルーティングを設定する

Gateway with Inference Extension は、さまざまな推論サービスのニーズを満たすために、2 つのスマートルーティングロードバランシングポリシーを提供します。

リクエストキューの長さと GPU キャッシュ使用率に基づくロードバランシング (デフォルトポリシー)。
プレフィックス対応ロードバランシングポリシー (Prefix Cache Aware Routing)。

推論サービスに対して InferencePool および InferenceModel リソースを宣言することで、推論ゲートウェイのスマートルーティング機能を有効にできます。バックエンドのデプロイメントメソッドと選択したロードバランシングポリシーに基づいて、InferencePool および InferenceModel リソースの構成を調整します。

リクエストキューの長さと GPU キャッシュ使用率に基づくロードバランシング

InferencePool のアノテーションが空の場合、デフォルトでリクエストキューの長さと GPU キャッシュ使用率に基づくスマートルーティングポリシーが使用されます。このポリシーは、バックエンド推論サービスのリアルタイムの負荷 (リクエストキューの長さと GPU キャッシュ使用率を含む) に基づいてリクエストを動的に割り当て、最適なロードバランシングを実現します。

inference_networking.yaml ファイルを作成します。

単一マシン vLLM デプロイメント

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: vllm-inference
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

単一マシン SGLang デプロイメント

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: sglang
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: sgl-inference
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

分散 vLLM デプロイメント

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: vllm-multi-nodes
    role: leader
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

分散 SGLang デプロイメント

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: sglang
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: sglang-multi-nodes
    role: leader
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

SGLang PD 分離デプロイメント

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference_backend: sglang # prefill と decode の両方のワークロードを選択します
---
# InferenceTrafficPolicy は InferencePool に適用されるトラフィックポリシーを指定します
apiVersion: inferenceextension.alibabacloud.com/v1alpha1
kind: InferenceTrafficPolicy
metadata:
  name: inference-policy
spec:
  poolRef:
    name: qwen-inference-pool
  modelServerRuntime: sglang # バックエンドサービスのランタイムフレームワークとして SGLang を指定します
  profile:
    pd:  # バックエンドサービスが PD 分離モードでデプロイされることを指定します
      pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role # Pod ラベルを使用して InferencePool 内の prefill と decode のロールを区別します
      kvTransfer:
        bootstrapPort: 34000 # SGLang PD 分離サービスで KV キャッシュ転送に使用されるブートストラップポート。これは RoleBasedGroup デプロイメントで指定された disaggregation-bootstrap-port パラメーターと一致する必要があります。

リクエストキューの長さと GPU キャッシュ使用率に基づいてロードバランサーを作成します。
```
kubectl create -f inference_networking.yaml
```

プレフィックス対応ロードバランシング (Prefix Cache Aware Routing)

Prefix Cache Aware Routing ポリシーは、同じプレフィックスコンテンツを共有するリクエストを、可能な限り同じ推論サーバー Pod に送信します。モデルサーバーで自動プレフィックスキャッシュ (APC) 機能が有効になっている場合、このポリシーはプレフィックスキャッシュのヒット率を向上させ、応答時間を短縮できます。

重要

このドキュメントで使用されている vLLM v0.9.2 バージョンと SGLang フレームワークでは、プレフィックスキャッシュ機能がデフォルトで有効になっています。プレフィックスキャッシュを有効にするためにサービスを再デプロイする必要はありません。

プレフィックス対応ロードバランシングポリシーを有効にするには、InferencePool に次のアノテーションを追加します: inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"

Prefix_Cache.yaml ファイルを作成します。

単一マシン vLLM デプロイメント

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: vllm-inference
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

単一マシン SGLang デプロイメント

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: sglang
    inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: sgl-inference
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

分散 vLLM デプロイメント

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: vllm-multi-nodes
    role: leader
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

分散 SGLang デプロイメント

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: sglang
    inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: sglang-multi-nodes
    role: leader
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

SGLang PD 分離デプロイメント

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference_backend: sglang # prefill と decode の両方のワークロードを選択します
---
# InferenceTrafficPolicy は InferencePool に適用されるトラフィックポリシーを指定します
apiVersion: inferenceextension.alibabacloud.com/v1alpha1
kind: InferenceTrafficPolicy
metadata:
  name: inference-policy
spec:
  poolRef:
    name: qwen-inference-pool
  modelServerRuntime: sglang # バックエンドサービスのランタイムフレームワークとして SGLang を指定します
  profile:
    pd:  # バックエンドサービスが PD 分離モードでデプロイされることを指定します
      trafficPolicy:
        prefixCache: # プレフィックスキャッシュロードバランシングポリシーを宣言します
          mode: estimate
      prefillPolicyRef: prefixCache
      decodePolicyRef: prefixCache # prefill と decode の両方にプレフィックス対応ロードバランシングを適用します
      pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role # Pod ラベルを使用して InferencePool 内の prefill と decode のロールを区別します
      kvTransfer:
        bootstrapPort: 34000 # SGLang PD 分離サービスで KV キャッシュ転送に使用されるブートストラップポート。これは RoleBasedGroup デプロイメントで指定された disaggregation-bootstrap-port パラメーターと一致する必要があります。

プレフィックス対応ロードバランサーを作成します。
```
kubectl create -f Prefix_Cache.yaml
```

InferencePool および InferenceModel の設定項目の説明を展開して表示します。

設定項目	タイプ	説明	デフォルト値
metadata.annotations.inference.networking.x-k8s.io/model-server-runtime	string	モデルサービスランタイム (sglang など) を指定します。	なし
metadata.annotations.inference.networking.x-k8s.io/routing-strategy	string	ルーティングポリシーを指定します。有効な値: DEFAULT および PREFIX_CACHE。	リクエストキューの長さと GPU キャッシュ使用率に基づくスマートルーティングポリシー
spec.targetPortNumber	int	推論サービスのポート番号を指定します。	なし
spec.selector	map[string]string	推論サービスの Pod と一致させるために使用されるセレクター。	なし
spec.extensionRef	ObjectReference	推論拡張サービスの宣言。	なし
spec.modelName	string	ルートマッチングに使用されるモデル名。	なし
spec.criticality	string	モデルの重要度レベル。有効な値: Critical および Standard。	なし
spec.poolRef	PoolReference	関連付けられた InferencePool リソース。	なし

ステップ 2: ゲートウェイをデプロイする

gateway_networking.yaml ファイルを作成します。

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: inference-gateway-class
spec:
  controllerName: inference.networking.x-k8s.io/gateway-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-gateway
spec:
  gatewayClassName: inference-gateway-class
  listeners:
  - name: http-llm
    protocol: HTTP
    port: 8080
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: inference-route
spec:
  parentRefs:
  - name: inference-gateway
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /v1
    backendRefs:
    - name: qwen-inference-pool
      kind: InferencePool
      group: inference.networking.x-k8s.io

GatewayClass、Gateway、および HTTPRoute リソースを作成して、ポート 8080 で LLM 推論サービスルートを設定します。
```
kubectl create -f gateway_networking.yaml
```

ステップ 3: 推論ゲートウェイの構成を検証する

次のコマンドを実行して、ゲートウェイの外部エンドポイントを取得します:

export GATEWAY_HOST=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')

curl コマンドを使用して、ポート 8080 でのサービスへのアクセスをテストします:

curl http://${GATEWAY_HOST}:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/models/Qwen3-32B",
    "messages": [
      {"role": "user", "content": "Hello, this is a test"}
    ],
    "max_tokens": 50
  }'

さまざまなロードバランシングポリシーを検証します。

リクエストキューの長さと GPU キャッシュ使用率に基づくロードバランシングポリシーの検証

デフォルトのポリシーは、リクエストキューの長さと GPU キャッシュ使用率に基づいてスマートルーティングを実行します。推論サービスにストレステストを行い、最初のトークンまでの時間 (TTFT) とスループットのメトリックを監視することで、その動作を観察できます。

特定のテスト方法の詳細については、「LLM サービスの可観測性メトリックとダッシュボードを設定する」をご参照ください。

プレフィックス対応ロードバランシングの検証

テストファイルを作成して、プレフィックス対応ロードバランシングが機能していることを確認します。

round1.txt を生成します:

echo '{"max_tokens":24,"messages":[{"content":"Hi, here's some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt

round2.txt を生成します:

echo '{"max_tokens":3,"messages":[{"content":"Hi, here's some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you're setting up a fun test. I'm ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt

次のコマンドを実行してテストを実行します:

curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt
curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt

Inference Extension Processor のログを確認して、プレフィックス対応ロードバランシングが機能していることを確認します:
```
kubectl logs deploy/inference-gateway-ext-proc -n envoy-gateway-system | grep "Request Handled"
```
両方のログエントリで同じ Pod 名が表示される場合、プレフィックス対応ロードバランシングは機能しています。
プレフィックス対応ロードバランシングの特定のテスト方法と結果の詳細については、「マルチターン対話テストを使用して推論サービスのパフォーマンスを評価する」をご参照ください。