All Products
Search
Document Center

Alibaba Cloud Service Mesh:Implement LLM Token Rate Limiting Based on Request Headers

Last Updated:Dec 01, 2025

Service Mesh (ASM) enables you to limit the number of tokens requested by specific clients based on request attributes, such as TCP attributes, HTTP headers, paths, hosts, and route destinations. This topic describes how to limit the number of tokens consumed during a request based on LLM request headers. If the token limit is reached, the proxy returns a response instead of forwarding the request to external services.

Background information

Function overview

The LLM token rate limiting feature in ASM is implemented using WebAssembly (Wasm) plug-ins and consists of two components: a rate limiting plug-in and a rate limiting service. The rate limiting plug-in intercepts requests, extracts the rate limiting key, and then queries the rate limiting service to determine whether to apply rate limiting to the request. During the LLM response phase, the rate limiting service is called again to update the rate limiting record for the specified key.

image
Note

As shown in the preceding figure, returning the LLM response and updating the rate limiting record in Step ⑥ are performed asynchronously and do not block each other.

The rate limiting service is maintained by the client. The ASM rate limiting plug-in calls this service using standard HTTP interfaces. You can select different rate limiting algorithms, such as token bucket, leaky bucket, or sliding window, to implement specific rate limiting rules for different business scenarios. You can also dynamically adjust the rate limiting rules based on the load of the backend services. Additionally, ASM provides a default rate limiting implementation that uses the token bucket algorithm. This implementation relies on a Redis database to store rate limiting records.

Scenarios

This feature is applicable to the following two scenarios:

  • Clients call external large language model services: External large language model services are typically billed based on the number of tokens consumed by requests. You can use LLM token rate limiting to control client costs effectively.

  • Inference service providers: External clients call the inference services in the cluster. Inference services require a large amount of computing resources. You can use LLM token rate limiting to prevent a single customer from consuming excessive resources in a short period, which can cause services for other users to become unavailable.

Prerequisites

Example

This example configures two request headers with the usernames regular-user and subscriber. When the Wasm plug-in processes a request, it reads the request header, extracts the rate limiting key, and sends the key to the rate limiting service. The rate limiting service then throttles the request based on the configured rate limiting rules. This allows subscribers to consume more tokens, while regular users can consume only a small number of tokens.

Step 1: Deploy the rate limiting service

  1. Create a file named token-limit.yaml that contains the following content.

    Click to view the YAML content

    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: asm-llm-token-rate-limit-example
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: asm-llm-token-rate-limit-example
      labels:
        app: asm-llm-token-rate-limit-example
        service: asm-llm-token-rate-limit-example
    spec:
      ports:
      - name: http
        port: 80
        targetPort: 8080
      selector:
        app: asm-llm-token-rate-limit-example
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: asm-llm-token-rate-limit-example
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: asm-llm-token-rate-limit-example
          version: v1
      template:
        metadata:
          labels:
            app: asm-llm-token-rate-limit-example
            version: v1
          annotations:
            sidecar.istio.io/inject: "true"
        spec:
          tolerations:
          - key: "node.kubernetes.io/disk-pressure"
            operator: "Equal"
            value: ""
            effect: "NoSchedule"
          serviceAccountName: asm-llm-token-rate-limit-example
          containers:
          - image: registry-cn-hangzhou.ack.aliyuncs.com/acs/asm-wasm-token-rate-limit-example:v1.23.6.34-g92d6a4b-aliyun
            imagePullPolicy: IfNotPresent
            name: asm-llm-token-rate-limit-example
            ports:
            - containerPort: 8080
            env:
            - name: REDIS_ADDRESS
              value: ${redis-address}:${redis-port}
            - name: REDIS_PASSWORD
              value: "${redis-user}:${password}"
            - name: RATE_LIMIT_CONFIG
              value: |
                [
                  {
                    "rate_limit_key_regex": "regular-user.*",
                    "redis_expired_seconds": 300,
                    "fill_interval_second": 30,
                    "tokens_per_fill": 50,
                    "max_tokens": 200
                  },
                  {
                    "rate_limit_key_regex": "subscriber.*",
                    "redis_expired_seconds": 600,
                    "fill_interval_second": 60,
                    "tokens_per_fill": 100,
                    "max_tokens": 1000
                  }
                ]
            resources:
              limits:
                memory: 256Mi
                cpu: 200m
              requests:
                memory: 64Mi
                cpu: 50m

    The preceding YAML file configures the RATE_LIMIT_CONFIG environment variable. The variable works as follows:

    • If rate_limit_key_regex matches regular-user.* in the request header, the following rate limiting rule is applied: The record in Redis expires after 300 seconds. The token bucket is refilled with 50 tokens every 30 seconds. The maximum capacity of the token bucket is 200 tokens.

    • If rate_limit_key_regex matches subscriber.* in the request header, the following rate limiting rule is applied: The record in Redis expires after 600 seconds. The token bucket is refilled with 100 tokens every 60 seconds. The maximum capacity of the token bucket is 1,000 tokens.

  2. Use the kubeconfig file of the data plane cluster to run the following command.

    kubectl apply -f token-limit.yaml

    ASM supports custom rate limiting services and provides a default implementation that uses precise matching in Redis. For more information, see the code repository. If you have other custom requirements, you can develop a rate limiting service based on this example.

Step 2: Deploy the rate limiting plug-in

  1. Create a file named wasm.yaml that contains the following content.

    apiVersion: extensions.istio.io/v1alpha1
    kind: WasmPlugin
    metadata:
      name: llm-token-ratelimit
      namespace: default
    spec:
      failStrategy: FAIL_OPEN
      imagePullPolicy: IfNotPresent
      selector:
        matchLabels:
          app: sleep
      match:
      - mode: CLIENT
        ports:
        - number: 80
      phase: STATS
      pluginConfig:
        matches:
        - host:
            exact: "dashscope.aliyuncs.com"
        rateLimitKeys:
        - "{{request.headers.user-type}}"
        rateLimitService:
          service: asm-llm-token-rate-limit-example.default.svc.cluster.local
          port: 80
      priority: 10
      url: registry-cn-hangzhou.ack.aliyuncs.com/acs/asm-wasm-llm-token-ratelimit:v1.23.6.34-g92d6a4b-aliyun

    The following table describes some of the configuration items.

    Configuration item

    Description

    .spec.pluginConfig.matches

    Matches the requests for which the rate limiting logic is to be executed. Requests that are not matched are allowed to pass through.

    .spec.pluginConfig.rateLimitKeys

    The rule for generating the rate limiting key. For more information, see Attributes. In this example, the value is {{request.headers.user-type}}.

    .spec.pluginConfig.rateLimitService

    The information about the rate limiting service. You must specify the fully qualified name (FQDN) of the Service.

  2. Use the kubeconfig file of the control plane cluster to run the following command.

    kubectl apply -f wasm.yaml

Step 3: Test the feature

Use the kubeconfig file of the data plane cluster to run the following commands multiple times as a regular-user and a subscriber.

kubectl exec deployment/sleep -it -- curl 'http://dashscope.aliyuncs.com' \
--header 'Content-Type: application/json' \
--header "user-type: regular-user" \
--data '{
    "messages": [
        {"role": "user", "content": "Please introduce yourself"}
    ]
}'
kubectl exec deployment/sleep -it -- curl 'http://dashscope.aliyuncs.com' \
--header 'Content-Type: application/json' \
--header "user-type: subscriber" \
--data '{
    "messages": [
        {"role": "user", "content": "Please introduce yourself"}
    ]
}'

Expected output:

{"choices":[{"message":{"role":"assistant","content":"I am a large language model from Alibaba Cloud. My name is Tongyi Qianwen. I am a super-large language model that can answer questions, create text, express opinions, and write code. My knowledge comes from text on the Internet. After multiple iterations and optimizations, my capabilities have continuously improved. I can now answer questions on various topics, such as technology, culture, history, and entertainment, and can also engage in continuous conversations. If you have any questions or need help, feel free to let me know, and I will do my best to provide support."},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":10,"completion_tokens":95,"total_tokens":105},"created":1735103573,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-7de0bd64-341a-9196-b676-99b5644ec111"}%
regular-user is being rate-limited
{"choices":[{"message":{"role":"assistant","content":"I am a large language model from Alibaba Cloud. My name is Tongyi Qianwen. I am a super-large language model that can answer questions, create text, express opinions, and write code. My knowledge comes from text on the Internet. After multiple iterations and optimizations, my capabilities have continuously improved. I can now answer questions on various topics, such as technology, culture, history, and entertainment, and can also engage in continuous conversations. If you have any questions or need help, feel free to let me know, and I will do my best to provide support."},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":10,"completion_tokens":95,"total_tokens":105},"created":1735103890,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-1b284b71-f850-95f5-9a7e-12678812763c"}%
{"choices":[{"message":{"role":"assistant","content":"I am a large language model from Alibaba Cloud. My name is Tongyi Qianwen. I am a super-large language model that can answer questions, create text, express opinions, and write code. My knowledge comes from the massive text data of Alibaba Cloud, including various books, documents, web pages, and papers, which is intended to enable me to understand and answer various topics. If you have any questions or need help, feel free to let me know, and I will do my best to provide support."},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":10,"completion_tokens":81,"total_tokens":91},"created":1735103895,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-0d29d820-c9c5-9e94-9a5a-d054233ed35a"}
subscriber is being rate-limited