All Products
Search
Document Center

Alibaba Cloud Service Mesh:Implement LLM token rate limiting based on request headers

Last Updated:Mar 11, 2026

Service Mesh (ASM) limits the number of tokens that specific clients can consume when calling large language model (LLM) services. Rate limits are enforced based on request attributes such as TCP attributes, HTTP headers, paths, hosts, and route destinations. When a client exceeds its token budget, the sidecar proxy returns a rate-limiting response instead of forwarding the request to the upstream LLM service.

Use cases

  • Control LLM API costs: External LLM services typically bill by token consumption. Token rate limiting caps spending per client or user tier, preventing unexpected cost spikes.

  • Protect shared inference services: When external clients call inference services in your cluster, token rate limiting prevents any single client from monopolizing computing resources and degrading availability for others.

How it works

ASM implements LLM token rate limiting through two components built on WebAssembly (Wasm):

  • Rate limiting plug-in -- A Wasm plug-in deployed as a sidecar filter. It intercepts each outgoing LLM request, extracts a rate limiting key (for example, the value of a user-type header), and queries the rate limiting service to determine whether the request should be allowed or rejected.

  • Rate limiting service -- A backend service, maintained by the client, that tracks token consumption and enforces rate limiting rules. The ASM rate limiting plug-in calls this service using standard HTTP interfaces. ASM provides a default implementation that uses the token bucket algorithm with Redis as the storage backend. To implement custom logic, build your own service using any algorithm (token bucket, leaky bucket, or sliding window). You can also dynamically adjust the rate limiting rules based on the load of the backend services.

Architecture diagram

Request flow:

  1. A client sends an LLM request through the sidecar proxy.

  2. The Wasm plug-in extracts the rate limiting key from the request header.

  3. The plug-in queries the rate limiting service to check whether the key has exceeded its token budget.

  4. If budget remains, the request is forwarded to the LLM service. If the budget is exceeded, the proxy returns a rate-limiting response (for example, regular-user is being rate-limited).

  5. The LLM service processes the request and returns a response that includes token usage data in the usage field (for example, prompt_tokens, completion_tokens, and total_tokens).

  6. The plug-in reports the consumed tokens back to the rate limiting service to update the record.

Note

Step 6 runs asynchronously and does not block the LLM response from reaching the client.

Prerequisites

Before you begin, make sure that you have:

Step 1: Deploy the rate limiting service

This step deploys a rate limiting service that uses the token bucket algorithm to enforce per-user token budgets. The example defines two user tiers -- regular-user and subscriber -- with different token allowances.

  1. Create a file named token-limit.yaml with the following content:

    View the YAML content

    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: asm-llm-token-rate-limit-example
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: asm-llm-token-rate-limit-example
      labels:
        app: asm-llm-token-rate-limit-example
        service: asm-llm-token-rate-limit-example
    spec:
      ports:
      - name: http
        port: 80
        targetPort: 8080
      selector:
        app: asm-llm-token-rate-limit-example
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: asm-llm-token-rate-limit-example
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: asm-llm-token-rate-limit-example
          version: v1
      template:
        metadata:
          labels:
            app: asm-llm-token-rate-limit-example
            version: v1
          annotations:
            sidecar.istio.io/inject: "true"
        spec:
          tolerations:
          - key: "node.kubernetes.io/disk-pressure"
            operator: "Equal"
            value: ""
            effect: "NoSchedule"
          serviceAccountName: asm-llm-token-rate-limit-example
          containers:
          - image: registry-cn-hangzhou.ack.aliyuncs.com/acs/asm-wasm-token-rate-limit-example:v1.23.6.34-g92d6a4b-aliyun
            imagePullPolicy: IfNotPresent
            name: asm-llm-token-rate-limit-example
            ports:
            - containerPort: 8080
            env:
            # Redis connection: replace with your Redis endpoint and port
            - name: REDIS_ADDRESS
              value: ${redis-address}:${redis-port}
            # Redis credentials: replace with your Redis username and password
            - name: REDIS_PASSWORD
              value: "${redis-user}:${password}"
            # Rate limiting rules: one entry per user tier
            - name: RATE_LIMIT_CONFIG
              value: |
                [
                  {
                    "rate_limit_key_regex": "regular-user.*",
                    "redis_expired_seconds": 300,
                    "fill_interval_second": 30,
                    "tokens_per_fill": 50,
                    "max_tokens": 200
                  },
                  {
                    "rate_limit_key_regex": "subscriber.*",
                    "redis_expired_seconds": 600,
                    "fill_interval_second": 60,
                    "tokens_per_fill": 100,
                    "max_tokens": 1000
                  }
                ]
            resources:
              limits:
                memory: 256Mi
                cpu: 200m
              requests:
                memory: 64Mi
                cpu: 50m

    Replace the following placeholders with your actual values:

    PlaceholderDescriptionExample
    ${redis-address}Redis host addressr-bp1xxxxxx.redis.rds.aliyuncs.com
    ${redis-port}Redis port number6379
    ${redis-user}Redis usernamedefault
    ${password}Redis passwordMyP@ssw0rd

    The RATE_LIMIT_CONFIG environment variable defines token bucket rules. Each rule matches a rate limiting key by regex and applies a separate bucket:

    ParameterDescription
    rate_limit_key_regexRegex pattern to match the rate limiting key extracted from the request header.
    max_tokensMaximum capacity of the token bucket. Requests that would exceed this limit are rejected.
    tokens_per_fillNumber of tokens added to the bucket at each refill interval.
    fill_interval_secondTime interval (in seconds) between token refills.
    redis_expired_secondsTime-to-live (TTL) for the rate limiting record in Redis. After this period, the record expires and the bucket resets.

    The following table shows the rate limiting rules configured in this example:

    RuleKey regexBucket capacityRefill rateRedis TTL
    Regular userregular-user.*200 tokens50 tokens every 30s300s
    Subscribersubscriber.*1,000 tokens100 tokens every 60s600s

    Subscribers get a 5x larger bucket and more tokens per refill, supporting sustained higher-volume LLM usage.

  2. Apply the configuration using the kubeconfig of the data plane cluster:

    kubectl apply -f token-limit.yaml
Note

This is the default rate limiting implementation provided by ASM. For custom requirements such as different algorithms or storage backends, see the source code on GitHub.

Step 2: Deploy the Wasm plug-in

This step deploys a WasmPlugin that configures the sidecar proxy to intercept LLM requests, extract the rate limiting key from the user-type header, and check it against the rate limiting service.

  1. Create a file named wasm.yaml with the following content:

    apiVersion: extensions.istio.io/v1alpha1
    kind: WasmPlugin
    metadata:
      name: llm-token-ratelimit
      namespace: default
    spec:
      failStrategy: FAIL_OPEN
      imagePullPolicy: IfNotPresent
      selector:
        matchLabels:
          app: sleep
      match:
      - mode: CLIENT
        ports:
        - number: 80
      phase: STATS
      pluginConfig:
        matches:
        - host:
            exact: "dashscope.aliyuncs.com"
        rateLimitKeys:
        - "{{request.headers.user-type}}"
        rateLimitService:
          service: asm-llm-token-rate-limit-example.default.svc.cluster.local
          port: 80
      priority: 10
      url: registry-cn-hangzhou.ack.aliyuncs.com/acs/asm-wasm-llm-token-ratelimit:v1.23.6.34-g92d6a4b-aliyun

    The following table explains the key configuration fields:

    FieldDescription
    .spec.pluginConfig.matchesDefines which requests trigger rate limiting. Unmatched requests pass through without rate limit checks.
    .spec.pluginConfig.rateLimitKeysSpecifies how to extract the rate limiting key. Uses Envoy request attributes syntax. In this example, {{request.headers.user-type}} extracts the value of the user-type header.
    .spec.pluginConfig.rateLimitServiceSpecifies the rate limiting service endpoint. Provide the fully qualified domain name (FQDN) of the Kubernetes Service.
  2. Apply the configuration using the kubeconfig of the control plane cluster:

    kubectl apply -f wasm.yaml

Step 3: Verify the configuration

Send test requests as both user types to confirm that rate limiting works correctly.

Run each of the following commands multiple times using the kubeconfig of the data plane cluster.

As a regular user:

kubectl exec deployment/sleep -it -- curl 'http://dashscope.aliyuncs.com' \
  --header 'Content-Type: application/json' \
  --header "user-type: regular-user" \
  --data '{
      "messages": [
          {"role": "user", "content": "Please introduce yourself"}
      ]
  }'

As a subscriber:

kubectl exec deployment/sleep -it -- curl 'http://dashscope.aliyuncs.com' \
  --header 'Content-Type: application/json' \
  --header "user-type: subscriber" \
  --data '{
      "messages": [
          {"role": "user", "content": "Please introduce yourself"}
      ]
  }'

Expected behavior:

  • Initial requests from both user types return a normal LLM response containing a usage field with token counts.

  • After several requests, regular-user hits the rate limit first and receives regular-user is being rate-limited.

  • subscriber can send more requests before being rate-limited, confirming the higher token budget.

This validates that the rate limiting service differentiates between user tiers and enforces the configured token budgets.

What's next

  • Customize rate limiting rules: Adjust max_tokens, tokens_per_fill, and fill_interval_second in the RATE_LIMIT_CONFIG to match your production traffic patterns.

  • Build a custom rate limiting service: Fork the example implementation to implement custom algorithms (leaky bucket, sliding window) or use a different storage backend.

  • Extend to more attributes: Change rateLimitKeys to extract keys from other request attributes such as paths, hosts, or TCP attributes. See Envoy attributes for available options.