Service Mesh (ASM) limits the number of tokens that specific clients can consume when calling large language model (LLM) services. Rate limits are enforced based on request attributes such as TCP attributes, HTTP headers, paths, hosts, and route destinations. When a client exceeds its token budget, the sidecar proxy returns a rate-limiting response instead of forwarding the request to the upstream LLM service.
Use cases
Control LLM API costs: External LLM services typically bill by token consumption. Token rate limiting caps spending per client or user tier, preventing unexpected cost spikes.
Protect shared inference services: When external clients call inference services in your cluster, token rate limiting prevents any single client from monopolizing computing resources and degrading availability for others.
How it works
ASM implements LLM token rate limiting through two components built on WebAssembly (Wasm):
Rate limiting plug-in -- A Wasm plug-in deployed as a sidecar filter. It intercepts each outgoing LLM request, extracts a rate limiting key (for example, the value of a
user-typeheader), and queries the rate limiting service to determine whether the request should be allowed or rejected.Rate limiting service -- A backend service, maintained by the client, that tracks token consumption and enforces rate limiting rules. The ASM rate limiting plug-in calls this service using standard HTTP interfaces. ASM provides a default implementation that uses the token bucket algorithm with Redis as the storage backend. To implement custom logic, build your own service using any algorithm (token bucket, leaky bucket, or sliding window). You can also dynamically adjust the rate limiting rules based on the load of the backend services.
Request flow:
A client sends an LLM request through the sidecar proxy.
The Wasm plug-in extracts the rate limiting key from the request header.
The plug-in queries the rate limiting service to check whether the key has exceeded its token budget.
If budget remains, the request is forwarded to the LLM service. If the budget is exceeded, the proxy returns a rate-limiting response (for example,
regular-user is being rate-limited).The LLM service processes the request and returns a response that includes token usage data in the
usagefield (for example,prompt_tokens,completion_tokens, andtotal_tokens).The plug-in reports the consumed tokens back to the rate limiting service to update the record.
Step 6 runs asynchronously and does not block the LLM response from reaching the client.
Prerequisites
Before you begin, make sure that you have:
An ASM instance (version 1.23 or later) with a cluster added
An LLMProvider and related resources, deployed by completing Steps 1 and 2 in Traffic routing: Use ASM to efficiently manage LLM traffic
A Redis service accessible from the cluster. To create a managed instance, see Tair (Redis OSS-compatible) Quick Start
Step 1: Deploy the rate limiting service
This step deploys a rate limiting service that uses the token bucket algorithm to enforce per-user token budgets. The example defines two user tiers -- regular-user and subscriber -- with different token allowances.
Create a file named
token-limit.yamlwith the following content:Replace the following placeholders with your actual values:
Placeholder Description Example ${redis-address}Redis host address r-bp1xxxxxx.redis.rds.aliyuncs.com${redis-port}Redis port number 6379${redis-user}Redis username default${password}Redis password MyP@ssw0rdThe
RATE_LIMIT_CONFIGenvironment variable defines token bucket rules. Each rule matches a rate limiting key by regex and applies a separate bucket:Parameter Description rate_limit_key_regexRegex pattern to match the rate limiting key extracted from the request header. max_tokensMaximum capacity of the token bucket. Requests that would exceed this limit are rejected. tokens_per_fillNumber of tokens added to the bucket at each refill interval. fill_interval_secondTime interval (in seconds) between token refills. redis_expired_secondsTime-to-live (TTL) for the rate limiting record in Redis. After this period, the record expires and the bucket resets. The following table shows the rate limiting rules configured in this example:
Rule Key regex Bucket capacity Refill rate Redis TTL Regular user regular-user.*200 tokens 50 tokens every 30s 300s Subscriber subscriber.*1,000 tokens 100 tokens every 60s 600s Subscribers get a 5x larger bucket and more tokens per refill, supporting sustained higher-volume LLM usage.
Apply the configuration using the kubeconfig of the data plane cluster:
kubectl apply -f token-limit.yaml
This is the default rate limiting implementation provided by ASM. For custom requirements such as different algorithms or storage backends, see the source code on GitHub.
Step 2: Deploy the Wasm plug-in
This step deploys a WasmPlugin that configures the sidecar proxy to intercept LLM requests, extract the rate limiting key from the user-type header, and check it against the rate limiting service.
Create a file named
wasm.yamlwith the following content:apiVersion: extensions.istio.io/v1alpha1 kind: WasmPlugin metadata: name: llm-token-ratelimit namespace: default spec: failStrategy: FAIL_OPEN imagePullPolicy: IfNotPresent selector: matchLabels: app: sleep match: - mode: CLIENT ports: - number: 80 phase: STATS pluginConfig: matches: - host: exact: "dashscope.aliyuncs.com" rateLimitKeys: - "{{request.headers.user-type}}" rateLimitService: service: asm-llm-token-rate-limit-example.default.svc.cluster.local port: 80 priority: 10 url: registry-cn-hangzhou.ack.aliyuncs.com/acs/asm-wasm-llm-token-ratelimit:v1.23.6.34-g92d6a4b-aliyunThe following table explains the key configuration fields:
Field Description .spec.pluginConfig.matchesDefines which requests trigger rate limiting. Unmatched requests pass through without rate limit checks. .spec.pluginConfig.rateLimitKeysSpecifies how to extract the rate limiting key. Uses Envoy request attributes syntax. In this example, {{request.headers.user-type}}extracts the value of theuser-typeheader..spec.pluginConfig.rateLimitServiceSpecifies the rate limiting service endpoint. Provide the fully qualified domain name (FQDN) of the Kubernetes Service. Apply the configuration using the kubeconfig of the control plane cluster:
kubectl apply -f wasm.yaml
Step 3: Verify the configuration
Send test requests as both user types to confirm that rate limiting works correctly.
Run each of the following commands multiple times using the kubeconfig of the data plane cluster.
As a regular user:
kubectl exec deployment/sleep -it -- curl 'http://dashscope.aliyuncs.com' \
--header 'Content-Type: application/json' \
--header "user-type: regular-user" \
--data '{
"messages": [
{"role": "user", "content": "Please introduce yourself"}
]
}'As a subscriber:
kubectl exec deployment/sleep -it -- curl 'http://dashscope.aliyuncs.com' \
--header 'Content-Type: application/json' \
--header "user-type: subscriber" \
--data '{
"messages": [
{"role": "user", "content": "Please introduce yourself"}
]
}'Expected behavior:
Initial requests from both user types return a normal LLM response containing a
usagefield with token counts.After several requests,
regular-userhits the rate limit first and receivesregular-user is being rate-limited.subscribercan send more requests before being rate-limited, confirming the higher token budget.
This validates that the rate limiting service differentiates between user tiers and enforces the configured token budgets.
What's next
Customize rate limiting rules: Adjust
max_tokens,tokens_per_fill, andfill_interval_secondin theRATE_LIMIT_CONFIGto match your production traffic patterns.Build a custom rate limiting service: Fork the example implementation to implement custom algorithms (leaky bucket, sliding window) or use a different storage backend.
Extend to more attributes: Change
rateLimitKeysto extract keys from other request attributes such as paths, hosts, or TCP attributes. See Envoy attributes for available options.