Enable LLM Token Rate Limiting via Request Headers - Alibaba Cloud ASM

Service Mesh (ASM) limits the number of tokens that specific clients can consume when calling large language model (LLM) services. Rate limits are enforced based on request attributes such as TCP attributes, HTTP headers, paths, hosts, and route destinations. When a client exceeds its token budget, the sidecar proxy returns a rate-limiting response instead of forwarding the request to the upstream LLM service.

Use cases

Control LLM API costs: External LLM services typically bill by token consumption. Token rate limiting caps spending per client or user tier, preventing unexpected cost spikes.
Protect shared inference services: When external clients call inference services in your cluster, token rate limiting prevents any single client from monopolizing computing resources and degrading availability for others.

How it works

ASM implements LLM token rate limiting through two components built on WebAssembly (Wasm):

Rate limiting plug-in -- A Wasm plug-in deployed as a sidecar filter. It intercepts each outgoing LLM request, extracts a rate limiting key (for example, the value of a user-type header), and queries the rate limiting service to determine whether the request should be allowed or rejected.
Rate limiting service -- A backend service, maintained by the client, that tracks token consumption and enforces rate limiting rules. The ASM rate limiting plug-in calls this service using standard HTTP interfaces. ASM provides a default implementation that uses the token bucket algorithm with Redis as the storage backend. To implement custom logic, build your own service using any algorithm (token bucket, leaky bucket, or sliding window). You can also dynamically adjust the rate limiting rules based on the load of the backend services.

Request flow:

A client sends an LLM request through the sidecar proxy.
The Wasm plug-in extracts the rate limiting key from the request header.
The plug-in queries the rate limiting service to check whether the key has exceeded its token budget.
If budget remains, the request is forwarded to the LLM service. If the budget is exceeded, the proxy returns a rate-limiting response (for example, regular-user is being rate-limited).
The LLM service processes the request and returns a response that includes token usage data in the usage field (for example, prompt_tokens, completion_tokens, and total_tokens).
The plug-in reports the consumed tokens back to the rate limiting service to update the record.

Note

Step 6 runs asynchronously and does not block the LLM response from reaching the client.

Prerequisites

Before you begin, make sure that you have:

An ASM instance (version 1.23 or later) with a cluster added
An LLMProvider and related resources, deployed by completing Steps 1 and 2 in Traffic routing: Use ASM to efficiently manage LLM traffic
A Redis service accessible from the cluster. To create a managed instance, see Tair (Redis OSS-compatible) Quick Start

Step 1: Deploy the rate limiting service

This step deploys a rate limiting service that uses the token bucket algorithm to enforce per-user token budgets. The example defines two user tiers -- regular-user and subscriber -- with different token allowances.

Create a file named token-limit.yaml with the following content:

View the YAML content

apiVersion: v1
kind: ServiceAccount
metadata:
  name: asm-llm-token-rate-limit-example
---
apiVersion: v1
kind: Service
metadata:
  name: asm-llm-token-rate-limit-example
  labels:
    app: asm-llm-token-rate-limit-example
    service: asm-llm-token-rate-limit-example
spec:
  ports:
  - name: http
    port: 80
    targetPort: 8080
  selector:
    app: asm-llm-token-rate-limit-example
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: asm-llm-token-rate-limit-example
spec:
  replicas: 1
  selector:
    matchLabels:
      app: asm-llm-token-rate-limit-example
      version: v1
  template:
    metadata:
      labels:
        app: asm-llm-token-rate-limit-example
        version: v1
      annotations:
        sidecar.istio.io/inject: "true"
    spec:
      tolerations:
      - key: "node.kubernetes.io/disk-pressure"
        operator: "Equal"
        value: ""
        effect: "NoSchedule"
      serviceAccountName: asm-llm-token-rate-limit-example
      containers:
      - image: registry-cn-hangzhou.ack.aliyuncs.com/acs/asm-wasm-token-rate-limit-example:v1.23.6.34-g92d6a4b-aliyun
        imagePullPolicy: IfNotPresent
        name: asm-llm-token-rate-limit-example
        ports:
        - containerPort: 8080
        env:
        # Redis connection: replace with your Redis endpoint and port
        - name: REDIS_ADDRESS
          value: ${redis-address}:${redis-port}
        # Redis credentials: replace with your Redis username and password
        - name: REDIS_PASSWORD
          value: "${redis-user}:${password}"
        # Rate limiting rules: one entry per user tier
        - name: RATE_LIMIT_CONFIG
          value: |
            [
              {
                "rate_limit_key_regex": "regular-user.*",
                "redis_expired_seconds": 300,
                "fill_interval_second": 30,
                "tokens_per_fill": 50,
                "max_tokens": 200
              },
              {
                "rate_limit_key_regex": "subscriber.*",
                "redis_expired_seconds": 600,
                "fill_interval_second": 60,
                "tokens_per_fill": 100,
                "max_tokens": 1000
              }
            ]
        resources:
          limits:
            memory: 256Mi
            cpu: 200m
          requests:
            memory: 64Mi
            cpu: 50m

Replace the following placeholders with your actual values:

Placeholder	Description	Example
`${redis-address}`	Redis host address	`r-bp1xxxxxx.redis.rds.aliyuncs.com`
`${redis-port}`	Redis port number	`6379`
`${redis-user}`	Redis username	`default`
`${password}`	Redis password	`MyP@ssw0rd`

The RATE_LIMIT_CONFIG environment variable defines token bucket rules. Each rule matches a rate limiting key by regex and applies a separate bucket:

Parameter	Description
`rate_limit_key_regex`	Regex pattern to match the rate limiting key extracted from the request header.
`max_tokens`	Maximum capacity of the token bucket. Requests that would exceed this limit are rejected.
`tokens_per_fill`	Number of tokens added to the bucket at each refill interval.
`fill_interval_second`	Time interval (in seconds) between token refills.
`redis_expired_seconds`	Time-to-live (TTL) for the rate limiting record in Redis. After this period, the record expires and the bucket resets.

The following table shows the rate limiting rules configured in this example:

Rule	Key regex	Bucket capacity	Refill rate	Redis TTL
Regular user	`regular-user.*`	200 tokens	50 tokens every 30s	300s
Subscriber	`subscriber.*`	1,000 tokens	100 tokens every 60s	600s

Subscribers get a 5x larger bucket and more tokens per refill, supporting sustained higher-volume LLM usage.

Apply the configuration using the kubeconfig of the data plane cluster:
```
kubectl apply -f token-limit.yaml
```

Note

This is the default rate limiting implementation provided by ASM. For custom requirements such as different algorithms or storage backends, see the source code on GitHub.

Step 2: Deploy the Wasm plug-in

This step deploys a WasmPlugin that configures the sidecar proxy to intercept LLM requests, extract the rate limiting key from the user-type header, and check it against the rate limiting service.

Create a file named wasm.yaml with the following content:

apiVersion: extensions.istio.io/v1alpha1
kind: WasmPlugin
metadata:
  name: llm-token-ratelimit
  namespace: default
spec:
  failStrategy: FAIL_OPEN
  imagePullPolicy: IfNotPresent
  selector:
    matchLabels:
      app: sleep
  match:
  - mode: CLIENT
    ports:
    - number: 80
  phase: STATS
  pluginConfig:
    matches:
    - host:
        exact: "dashscope.aliyuncs.com"
    rateLimitKeys:
    - "{{request.headers.user-type}}"
    rateLimitService:
      service: asm-llm-token-rate-limit-example.default.svc.cluster.local
      port: 80
  priority: 10
  url: registry-cn-hangzhou.ack.aliyuncs.com/acs/asm-wasm-llm-token-ratelimit:v1.23.6.34-g92d6a4b-aliyun

The following table explains the key configuration fields:

Field	Description
`.spec.pluginConfig.matches`	Defines which requests trigger rate limiting. Unmatched requests pass through without rate limit checks.
`.spec.pluginConfig.rateLimitKeys`	Specifies how to extract the rate limiting key. Uses Envoy request attributes syntax. In this example, `{{request.headers.user-type}}` extracts the value of the `user-type` header.
`.spec.pluginConfig.rateLimitService`	Specifies the rate limiting service endpoint. Provide the fully qualified domain name (FQDN) of the Kubernetes Service.

Apply the configuration using the kubeconfig of the control plane cluster:
```
kubectl apply -f wasm.yaml
```

Step 3: Verify the configuration

Send test requests as both user types to confirm that rate limiting works correctly.

Run each of the following commands multiple times using the kubeconfig of the data plane cluster.

As a regular user:

kubectl exec deployment/sleep -it -- curl 'http://dashscope.aliyuncs.com' \
  --header 'Content-Type: application/json' \
  --header "user-type: regular-user" \
  --data '{
      "messages": [
          {"role": "user", "content": "Please introduce yourself"}
      ]
  }'

As a subscriber:

kubectl exec deployment/sleep -it -- curl 'http://dashscope.aliyuncs.com' \
  --header 'Content-Type: application/json' \
  --header "user-type: subscriber" \
  --data '{
      "messages": [
          {"role": "user", "content": "Please introduce yourself"}
      ]
  }'

Expected behavior:

Initial requests from both user types return a normal LLM response containing a usage field with token counts.
After several requests, regular-user hits the rate limit first and receives regular-user is being rate-limited.
subscriber can send more requests before being rate-limited, confirming the higher token budget.

This validates that the rate limiting service differentiates between user tiers and enforces the configured token budgets.

What's next

Customize rate limiting rules: Adjust max_tokens, tokens_per_fill, and fill_interval_second in the RATE_LIMIT_CONFIG to match your production traffic patterns.
Build a custom rate limiting service: Fork the example implementation to implement custom algorithms (leaky bucket, sliding window) or use a different storage backend.
Extend to more attributes: Change rateLimitKeys to extract keys from other request attributes such as paths, hosts, or TCP attributes. See Envoy attributes for available options.