ACK ゲートウェイ拡張機能を使用して推論トラフィックを安全にミラー化する - ACK

推論拡張機能付き ACK ゲートウェイコンポーネントは、推論サービスのインテリジェントな負荷分散を提供しながら、推論リクエストのトラフィックミラーリングをサポートします。実稼働環境に新しい推論モデルをデプロイする場合、実稼働トラフィックをミラーリングして新しいモデルのパフォーマンスを評価し、正式に公開する前にパフォーマンスと安定性が要件を満たしていることを確認できます。このトピックでは、推論拡張機能付き ACK ゲートウェイを使用して推論リクエストのトラフィックミラーリングを実装する方法について説明します。

重要

このトピックを読む前に、InferencePool と InferenceModel の概念を理解していることを確認してください。

前提条件

GPU ノードプールを持つ ACK マネージドクラスターが作成されていること。また、ACS GPU 計算能力を使用するために、ACK マネージドクラスターに ACK Virtual Node コンポーネントをインストールすることもできます。
推論拡張機能付き ACK ゲートウェイがインストールされ、[ゲートウェイ API 推論拡張を有効にする] が選択されていること。操作エントリの詳細については、「手順 2: 推論拡張機能付き ACK ゲートウェイコンポーネントをインストールする」をご参照ください。

説明

このガイドで使用されるイメージには、16 GiB を超える GPU メモリが必要です。16 GiB のビデオメモリを搭載した T4 カードタイプでは十分でない場合があります。したがって、ACK クラスタに推奨される GPU カードタイプは A10 であり、ACS GPU 計算能力には、推奨されるタイプは第 8 世代 GPU B です。

LLM イメージのサイズが大きいため、事前に ACR に転送し、内部ネットワークアドレスを使用してプルすることをお勧めします。パブリックネットワークからのプルの速度は、クラスターの Elastic IP アドレス (EIP) の帯域幅構成によって異なり、待機時間が長くなる可能性があります。

ワークフロー

この例では、次のリソースをデプロイします。

2 つの推論サービス: vllm-llama2-7b-pool と vllm-llama2-7b-pool-1 (以下の図の APP と APP1)。
ゲートウェイとして機能する ClusterIP サービス。
特定のトラフィック転送およびミラーリングルールを構成する HTTPRoute。
APP のインテリジェントな負荷分散を有効にする InferencePool と対応する InferenceModel。APP1 用の通常のサービス。現在、ミラーリングされたトラフィックのインテリジェントな負荷分散はサポートされていません。したがって、APP1 には通常のサービスが必要です。
テストクライアントとしての Sleep アプリケーション。

次の図は、トラフィックミラーリングプロセスを示しています。

クライアントがゲートウェイにアクセスすると、HTTPRoute はプレフィックス一致ルールに基づいて実稼働トラフィックを識別します。
ルールが正常に一致した後:
- 実稼働トラフィックは通常、対応する InferencePool に転送され、インテリジェントな負荷分散の後、バックエンド APP に転送されます。
- ルール内の HTTPFilter は、ミラーリングされたトラフィックを指定されたサービスに送信し、サービスはそれをバックエンド APP1 に転送します。
バックエンド APP と APP1 はどちらも通常の応答を返しますが、ゲートウェイは InferencePool から返された応答のみを処理し、ミラーリングされたサービスから返された応答は無視します。クライアントは、メインサービスの処理結果のみを認識します。

手順

サンプル推論サービス vllm-llama2-7b-pool と vllm-llama2-7b-pool-1 をデプロイします。

この手順では、vllm-llama2-7b-pool の YAML ファイルのみを提供します。 vllm-llama2-7b-pool-1 の構成は、名前を除いて vllm-llama2-7b-pool と同じです。 vllm-llama2-7b-pool-1 推論サービスをデプロイする場合は、次の YAML ファイルの対応するフィールドを変更してください。

展開して YAML コンテンツを表示する

# =============================================================
# inference_app.yaml
# =============================================================
apiVersion: v1
kind: ConfigMap
metadata:
  name: chat-template
data:
  llama-2-chat.jinja: |
    {% if messages[0]['role'] == 'system' %}
      {% set system_message = '<<SYS>>\n' + messages[0]['content'] | trim + '\n<</SYS>>\n\n' %}
      {% set messages = messages[1:] %}
    {% else %}
        {% set system_message = '' %}
    {% endif %}

    {% for message in messages %}
        {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
            {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
        {% endif %}

        {% if loop.index0 == 0 %}
            {% set content = system_message + message['content'] %}
        {% else %}
            {% set content = message['content'] %}
        {% endif %}
        {% if message['role'] == 'user' %}
            {{ bos_token + '[INST] ' + content | trim + ' [/INST]' }}
        {% elif message['role'] == 'assistant' %}
            {{ ' ' + content | trim + ' ' + eos_token }}
        {% endif %}
    {% endfor %}
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama2-7b-pool
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-llama2-7b-pool
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: '8000'
        prometheus.io/scrape: 'true'
      labels:
        app: vllm-llama2-7b-pool
    spec:
      containers:
        - name: lora  # ローラ
          image: "registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/llama2-with-lora:v0.2"
          imagePullPolicy: IfNotPresent
          command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
          args:  # 引数
          - "--model"  # --モデル
          - "/model/llama2"
          - "--tensor-parallel-size" # --テンソル並列サイズ
          - "1"
          - "--port" # --ポート
          - "8000"
          - '--gpu_memory_utilization' # --GPUメモリ使用率
          - '0.8'
          - "--enable-lora" # --ローラを有効にする
          - "--max-loras" # --最大ローラ数
          - "4"
          - "--max-cpu-loras" # --最大 CPU ローラ数
          - "12"
          - "--lora-modules" # --ローラモジュール
          - 'sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0'
          - 'sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1'
          - 'sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2'
          - 'sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3'
          - 'sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4'
          - 'tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0'
          - 'tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1'
          - 'tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2'
          - 'tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3'
          - 'tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4'
          - '--chat-template' # --チャットテンプレート
          - '/etc/vllm/llama-2-chat.jinja'
          env: # 環境変数
            - name: PORT # ポート
              value: "8000"
          ports: # ポート
            - containerPort: 8000
              name: http # HTTP
              protocol: TCP # TCP
          livenessProbe: # ライブネスプローブ
            failureThreshold: 2400
            httpGet: # HTTP GET
              path: /health # /health
              port: http # HTTP
              scheme: HTTP # HTTP
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          readinessProbe: # レディネスプローブ
            failureThreshold: 6000
            httpGet: # HTTP GET
              path: /health # /health
              port: http # HTTP
              scheme: HTTP # HTTP
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          resources: # リソース
            limits: # 制限
              nvidia.com/gpu: 1
            requests: # リクエスト
              nvidia.com/gpu: 1
          volumeMounts: # ボリュームマウント
            - mountPath: /data # マウントパス
              name: data # 名前
            - mountPath: /dev/shm # マウントパス
              name: shm # 名前
            - mountPath: /etc/vllm # マウントパス
              name: chat-template # 名前
      restartPolicy: Always # 再起動ポリシー
      schedulerName: default-scheduler # スケジューラ名
      terminationGracePeriodSeconds: 30
      volumes: # ボリューム
        - name: data # 名前
          emptyDir: {} # 空のディレクトリ
        - name: shm # 名前
          emptyDir: # 空のディレクトリ
            medium: Memory # メモリ
        - name: chat-template # 名前
          configMap: # ConfigMap
            name: chat-template # 名前

InferencePool と InferenceModel、および vllm-llama2-7b-pool-1 アプリケーションのサービスをデプロイします。

# =============================================================
# inference_rules.yaml
# =============================================================
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: vllm-llama2-7b-pool
spec:
  targetPortNumber: 8000
  selector:
    app: vllm-llama2-7b-pool
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: inferencemodel-sample
spec:
  modelName: /model/llama2
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: vllm-llama2-7b-pool
  targetModels:
  - name: /model/llama2
    weight: 100
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-llama2-7b-pool-1
spec:
  selector:
    app: vllm-llama2-7b-pool-1
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 8000
  type: ClusterIP

ゲートウェイと HTTPRoute をデプロイします。

ゲートウェイは ClusterIP サービスを使用します。これは、クラスター内からのみアクセスできます。実際のニーズに基づいて、サービスの種類を LoadBalancer に変更できます。

# =============================================================
# gateway.yaml
# =============================================================
kind: GatewayClass
apiVersion: gateway.networking.k8s.io/v1
metadata:
  name: example-gateway-class
  labels:
    example: http-routing
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  labels:
    example: http-routing
  name: example-gateway
  namespace: default
spec:
  gatewayClassName: example-gateway-class
  infrastructure:
    parametersRef:
      group: gateway.envoyproxy.io
      kind: EnvoyProxy
      name: custom-proxy-config
  listeners:
  - allowedRoutes:
      namespaces:
        from: Same
    name: http
    port: 80
    protocol: HTTP
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
  name: custom-proxy-config
  namespace: default
spec:
  provider:
    type: Kubernetes
    kubernetes:
      envoyService:
        type: ClusterIP
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: mirror-route
  labels:
    example: http-routing
spec:
  parentRefs:
    - name: example-gateway
  hostnames:
    - "example.com"
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /
      backendRefs:
      - group: inference.networking.x-k8s.io
        kind: InferencePool
        name: vllm-llama2-7b-pool
        weight: 1
      filters:
      - type: RequestMirror
        requestMirror:
          backendRef:
            kind: Service
            name: vllm-llama2-7b-pool-1
            port: 8000

sleep アプリケーションをデプロイします。

# =============================================================
# sleep.yaml
# =============================================================
apiVersion: v1
kind: ServiceAccount
metadata:
  name: sleep # スリープ
---
apiVersion: v1
kind: Service
metadata:
  name: sleep # スリープ
  labels:
    app: sleep # スリープ
    service: sleep # スリープ
spec:
  ports:
  - port: 80
    name: http # HTTP
  selector:
    app: sleep # スリープ
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sleep # スリープ
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sleep # スリープ
  template:
    metadata:
      labels:
        app: sleep # スリープ
    spec:
      terminationGracePeriodSeconds: 0
      serviceAccountName: sleep # スリープ
      containers:
      - name: sleep # スリープ
        image:  registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/curl:asm-sleep
        command: ["/bin/sleep", "infinity"] # 無限にスリープ
        imagePullPolicy: IfNotPresent
        volumeMounts: # ボリュームマウント
        - mountPath: /etc/sleep/tls # マウントパス
          name: secret-volume # 名前
      volumes: # ボリューム
      - name: secret-volume # 名前
        secret: # シークレット
          secretName: sleep-secret # シークレット名
          optional: true

トラフィックミラーリングを確認します。

ゲートウェイアドレスを取得します。

export GATEWAY_ADDRESS=$(kubectl get gateway/example-gateway -o jsonpath='{.status.addresses[0].value}')

テストリクエストを送信します。

kubectl exec deployment/sleep -it -- curl -X POST ${GATEWAY_ADDRESS}/v1/chat/completions -H 'Content-Type: application/json' -H "host: example.com" -d '{
    "model": "/model/llama2",
    "max_completion_tokens": 100,
    "temperature": 0,
    "messages": [
      {
        "role": "user",
        "content": "introduce yourself"  # 自己紹介をお願いします
      }
    ]
}'

期待される出力:

{"id":"chatcmpl-eb67bf29-1f87-4e29-8c3e-a83f3c74cd87","object":"chat.completion","created":1745207283,"model":"/model/llama2","choices":[{"index":0,"message":{"role":"assistant","content":"\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n        ","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":15,"total_tokens":115,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null}%

アプリケーションログを確認します。

echo "original logs↓↓↓" && kubectl logs deployments/vllm-llama2-7b-pool | grep /v1/chat/completions | grep OK
echo "mirror logs↓↓↓" && kubectl logs deployments/vllm-llama2-7b-pool-1 | grep /v1/chat/completions | grep OK

期待される出力:

original logs↓↓↓ # オリジナルログ↓↓↓
INFO:     10.2.14.146:39478 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.2.14.146:60660 - "POST /v1/chat/completions HTTP/1.1" 200 OK
mirror logs↓↓↓ # ミラーログ↓↓↓
INFO:     10.2.14.146:39742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.2.14.146:59976 - "POST /v1/chat/completions HTTP/1.1" 200 OK

出力は、リクエストが vllm-llama2-7b-pool と vllm-llama2-7b-pool-1 の両方にルーティングされていることを示しています。これは、トラフィックミラーリングが正常に機能していることを示しています。