在ASM中基於請求Header實現LLM Token限流 - Alibaba Cloud Service Mesh

Service Mesh (ASM)支援基於請求屬性（如TCP屬性、HTTP Header、路徑、Host）和路由目標，限制指定用戶端的LLM請求Token數。本文將示範如何通過LLM請求Header限制Token消耗量。若超過限額，Proxy將直接返迴響應，請求不會被發送到外部。

背景資訊

功能概述

ASM的LLM Token限流能力基於Wasm外掛程式實現，整體分為兩部分：限流外掛程式和限流服務。限流外掛程式負責攔截請求並提取限流的key，然後訪問限流服務來擷取是否對該請求進行限流。在LLM的響應階段，再次調用這個限流服務來更新指定key對應的限流記錄。

說明

如上圖所示，⑥返回 LLM 響應與⑥更新限流記錄是非同步進行的，互不阻塞。

其中限流服務由用戶端自我維護，ASM限流外掛程式會使用標準的HTTP介面調用這個服務。您可以根據不同的業務情境選擇不同的限流演算法（令牌桶、漏桶或者滑動視窗等）來實現特定的限流規則。您也可以根據後端服務的負載，來動態地調節限流規則。同時，ASM提供了預設的基於令牌桶演算法的限流實現，該實現依賴Redis資料庫儲存限流記錄。

適用情境

此能力適用於以下兩種情境：

用戶端調用外部大模型服務：外部的大模型服務通常根據請求消耗的token數來計費。使用LLM token限流可以有效控制用戶端的費用消耗。
推理服務提供者：外部客戶端調用本叢集的推理服務。推理服務需求大量計算資源，使用LLM token限流可以防止某一個客戶短時間內佔用過量資源，造成其他使用者服務不可用。

前提條件

已添加叢集到ASM執行個體，且ASM執行個體版本為1.23及以上。
已經閱讀並實際按照流量路由：使用ASM高效管理LLM流量中的步驟一到步驟二部署了LLMProvider及相關資源。
已經在叢集或本地部署Redis服務，您也可以使用Tair (Redis OSS-compatible)來快速建立Redis執行個體。具體操作，請參見快速入門概覽。

樣本說明

本文樣本將配置兩個使用者名稱為regular-user和subscriber的請求header。當wasm外掛程式讀取到請求後，限流外掛程式會讀取請求header，提取出最終的限流key，發送給限流服務。限流服務會根據配置的限流規則來對請求進行限流，實現訂閱使用者可以消耗更多的token，普通使用者只能消耗少量。

步驟一：部署限流服務

使用以下內容，建立token-limit.yaml。

展開查看YAML內容

apiVersion: v1
kind: ServiceAccount
metadata:
  name: asm-llm-token-rate-limit-example
---
apiVersion: v1
kind: Service
metadata:
  name: asm-llm-token-rate-limit-example
  labels:
    app: asm-llm-token-rate-limit-example
    service: asm-llm-token-rate-limit-example
spec:
  ports:
  - name: http
    port: 80
    targetPort: 8080
  selector:
    app: asm-llm-token-rate-limit-example
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: asm-llm-token-rate-limit-example
spec:
  replicas: 1
  selector:
    matchLabels:
      app: asm-llm-token-rate-limit-example
      version: v1
  template:
    metadata:
      labels:
        app: asm-llm-token-rate-limit-example
        version: v1
      annotations:
        sidecar.istio.io/inject: "true"
    spec:
      tolerations:
      - key: "node.kubernetes.io/disk-pressure"
        operator: "Equal"
        value: ""
        effect: "NoSchedule"
      serviceAccountName: asm-llm-token-rate-limit-example
      containers:
      - image: registry-cn-hangzhou.ack.aliyuncs.com/acs/asm-wasm-token-rate-limit-example:v1.23.6.34-g92d6a4b-aliyun
        imagePullPolicy: IfNotPresent
        name: asm-llm-token-rate-limit-example
        ports:
        - containerPort: 8080
        env:
        - name: REDIS_ADDRESS
          value: ${redis-address}:${redis-port}
        - name: REDIS_PASSWORD
          value: "${redis-user}:${password}"
        - name: RATE_LIMIT_CONFIG
          value: |
            [
              {
                "rate_limit_key_regex": "regular-user.*",
                "redis_expired_seconds": 300,
                "fill_interval_second": 30,
                "tokens_per_fill": 50,
                "max_tokens": 200
              },
              {
                "rate_limit_key_regex": "subscriber.*",
                "redis_expired_seconds": 600,
                "fill_interval_second": 60,
                "tokens_per_fill": 100,
                "max_tokens": 1000
              }
            ]
        resources:
          limits:
            memory: 256Mi
            cpu: 200m
          requests:
            memory: 64Mi
            cpu: 50m

上述YAML配置了環境變數RATE_LIMIT_CONFIG，具體作用如下：

rate_limit_key_regex成功匹配請求header中的regular-user.*時，限流規則為：Redis儲存的到期時間是300秒，每30秒填充一次令牌桶，每次填充50個token，令牌桶的最大容量為200。
rate_limit_key_regex成功匹配請求header中的subscriber-.*時，限流規則為：Redis儲存的到期時間是600秒，每60秒填充一次令牌桶，每次填充100個token，令牌桶的最大容量為1000。

使用資料面叢集的kubeconfig，執行以下命令。
```
kubectl apply -f token-limit.yaml
```
ASM支援自訂限流服務，並且提供了基於Redis精準匹配的預設實現。更多資訊，請參見代碼倉庫。如果您有其他定製化需求，可以仿照該樣本自行開發對應的限流服務。

步驟二：部署限流外掛程式

使用以下內容，建立wasm.yaml。

apiVersion: extensions.istio.io/v1alpha1
kind: WasmPlugin
metadata:
  name: llm-token-ratelimit
  namespace: default
spec:
  failStrategy: FAIL_OPEN
  imagePullPolicy: IfNotPresent
  selector:
    matchLabels:
      app: sleep
  match:
  - mode: CLIENT
    ports:
    - number: 80
  phase: STATS
  pluginConfig:
    matches:
    - host:
        exact: "dashscope.aliyuncs.com"
    rateLimitKeys:
    - "{{request.headers.user-type}}"
    rateLimitService:
      service: asm-llm-token-rate-limit-example.default.svc.cluster.local
      port: 80
  priority: 10
  url: registry-cn-hangzhou.ack.aliyuncs.com/acs/asm-wasm-llm-token-ratelimit:v1.23.6.34-g92d6a4b-aliyun

部分配置項說明如下：

配置項	說明
.spec.pluginConfig.matches	匹配要執行限流邏輯的請求，未命中請求直接允許存取。
.spec.pluginConfig.rateLimitKeys	產生限流key的規則。更多資訊，請參見Attributes。本樣本為`{{request.headers.user-type}}`。
.spec.pluginConfig.rateLimitService	限流服務的資訊，Service需要填寫服務的全稱。

使用控制面叢集的kubeconfig，執行以下命令。
```
kubectl apply -f wasm.yaml
```

步驟三：測試

使用資料面叢集的kubeconfig，分別以regular-user和subscriber身份多次執行以下命令。

kubectl exec deployment/sleep -it -- curl 'http://dashscope.aliyuncs.com' \
--header 'Content-Type: application/json' \
--header "user-type: regular-user" \
--data '{
    "messages": [
        {"role": "user", "content": "介紹一下你自己"}
    ]
}'

kubectl exec deployment/sleep -it -- curl 'http://dashscope.aliyuncs.com' \
--header 'Content-Type: application/json' \
--header "user-type: subscriber" \
--data '{
    "messages": [
        {"role": "user", "content": "介紹一下你自己"}
    ]
}'

預期輸出：

{"choices":[{"message":{"role":"assistant","content":"我是來自阿里雲的大規模語言模型，我叫千問。我是一個能夠回答問題、創作文字，還能表達觀點、撰寫代碼的超大規模語言模型。我的知識來自於互連網上的文本，經過多輪迭代和最佳化，我的能力不斷提升，現在已經能夠回答各種問題，比如科技、文化、歷史、娛樂等各類話題的問題，也可以進行連續的對話。如果您有任何問題或需要協助，請隨時告訴我，我會儘力提供支援。"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":10,"completion_tokens":95,"total_tokens":105},"created":1735103573,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-7de0bd64-341a-9196-b676-99b5644ec111"}%
regular-user is being rate-limited

{"choices":[{"message":{"role":"assistant","content":"我是來自阿里雲的大規模語言模型，我叫千問。我是一個能夠回答問題、創作文字，還能表達觀點、撰寫代碼的超大規模語言模型。我的知識來自於互連網上的文本，經過多輪迭代和最佳化，我的能力不斷提升，現在已經能夠回答各種問題，比如科技、文化、歷史、娛樂等各類話題的問題，也可以進行連續的對話。如果您有任何問題或需要協助，請隨時告訴我，我會儘力提供支援。"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":10,"completion_tokens":95,"total_tokens":105},"created":1735103890,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-1b284b71-f850-95f5-9a7e-12678812763c"}%
{"choices":[{"message":{"role":"assistant","content":"我是來自阿里雲的大規模語言模型，我叫千問。我是一個能夠回答問題、創作文字，還能表達觀點、撰寫代碼的超大規模語言模型。我的知識來自於阿里雲的海量文本資料，包含各種書籍、文檔、網頁、論文等，旨在使我能夠理解和回答各種主題。如果您有任何問題或需要協助，請隨時告訴我，我會儘力提供支援。"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":10,"completion_tokens":81,"total_tokens":91},"created":1735103895,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-0d29d820-c9c5-9e94-9a5a-d054233ed35a"}
subscriber is being rate-limited