Service Mesh (ASM) enables you to limit the number of tokens requested by specific clients based on request attributes, such as TCP attributes, HTTP headers, paths, hosts, and route destinations. This topic describes how to limit the number of tokens consumed during a request based on LLM request headers. If the token limit is reached, the proxy returns a response instead of forwarding the request to external services.
Background information
Function overview
The LLM token rate limiting feature in ASM is implemented using WebAssembly (Wasm) plug-ins and consists of two components: a rate limiting plug-in and a rate limiting service. The rate limiting plug-in intercepts requests, extracts the rate limiting key, and then queries the rate limiting service to determine whether to apply rate limiting to the request. During the LLM response phase, the rate limiting service is called again to update the rate limiting record for the specified key.
As shown in the preceding figure, returning the LLM response and updating the rate limiting record in Step ⑥ are performed asynchronously and do not block each other.
The rate limiting service is maintained by the client. The ASM rate limiting plug-in calls this service using standard HTTP interfaces. You can select different rate limiting algorithms, such as token bucket, leaky bucket, or sliding window, to implement specific rate limiting rules for different business scenarios. You can also dynamically adjust the rate limiting rules based on the load of the backend services. Additionally, ASM provides a default rate limiting implementation that uses the token bucket algorithm. This implementation relies on a Redis database to store rate limiting records.
Scenarios
This feature is applicable to the following two scenarios:
Clients call external large language model services: External large language model services are typically billed based on the number of tokens consumed by requests. You can use LLM token rate limiting to control client costs effectively.
Inference service providers: External clients call the inference services in the cluster. Inference services require a large amount of computing resources. You can use LLM token rate limiting to prevent a single customer from consuming excessive resources in a short period, which can cause services for other users to become unavailable.
Prerequisites
You have added a cluster to an ASM instance, and the ASM instance is version 1.23 or later.
You have read and followed Steps 1 and 2 in Traffic routing: Use ASM to efficiently manage LLM traffic to deploy the LLMProvider and its related resources.
You have deployed a Redis service in the cluster or locally. You can also use Tair (Redis OSS-compatible) to quickly create a Redis instance. For more information, see the Quick Start Overview.
Example
This example configures two request headers with the usernames regular-user and subscriber. When the Wasm plug-in processes a request, it reads the request header, extracts the rate limiting key, and sends the key to the rate limiting service. The rate limiting service then throttles the request based on the configured rate limiting rules. This allows subscribers to consume more tokens, while regular users can consume only a small number of tokens.
Step 1: Deploy the rate limiting service
Create a file named token-limit.yaml that contains the following content.
The preceding YAML file configures the
RATE_LIMIT_CONFIGenvironment variable. The variable works as follows:If
rate_limit_key_regexmatchesregular-user.*in the request header, the following rate limiting rule is applied: The record in Redis expires after 300 seconds. The token bucket is refilled with 50 tokens every 30 seconds. The maximum capacity of the token bucket is 200 tokens.If
rate_limit_key_regexmatchessubscriber.*in the request header, the following rate limiting rule is applied: The record in Redis expires after 600 seconds. The token bucket is refilled with 100 tokens every 60 seconds. The maximum capacity of the token bucket is 1,000 tokens.
Use the kubeconfig file of the data plane cluster to run the following command.
kubectl apply -f token-limit.yamlASM supports custom rate limiting services and provides a default implementation that uses precise matching in Redis. For more information, see the code repository. If you have other custom requirements, you can develop a rate limiting service based on this example.
Step 2: Deploy the rate limiting plug-in
Create a file named wasm.yaml that contains the following content.
apiVersion: extensions.istio.io/v1alpha1 kind: WasmPlugin metadata: name: llm-token-ratelimit namespace: default spec: failStrategy: FAIL_OPEN imagePullPolicy: IfNotPresent selector: matchLabels: app: sleep match: - mode: CLIENT ports: - number: 80 phase: STATS pluginConfig: matches: - host: exact: "dashscope.aliyuncs.com" rateLimitKeys: - "{{request.headers.user-type}}" rateLimitService: service: asm-llm-token-rate-limit-example.default.svc.cluster.local port: 80 priority: 10 url: registry-cn-hangzhou.ack.aliyuncs.com/acs/asm-wasm-llm-token-ratelimit:v1.23.6.34-g92d6a4b-aliyunThe following table describes some of the configuration items.
Configuration item
Description
.spec.pluginConfig.matches
Matches the requests for which the rate limiting logic is to be executed. Requests that are not matched are allowed to pass through.
.spec.pluginConfig.rateLimitKeys
The rule for generating the rate limiting key. For more information, see Attributes. In this example, the value is
{{request.headers.user-type}}..spec.pluginConfig.rateLimitService
The information about the rate limiting service. You must specify the fully qualified name (FQDN) of the Service.
Use the kubeconfig file of the control plane cluster to run the following command.
kubectl apply -f wasm.yaml
Step 3: Test the feature
Use the kubeconfig file of the data plane cluster to run the following commands multiple times as a regular-user and a subscriber.
kubectl exec deployment/sleep -it -- curl 'http://dashscope.aliyuncs.com' \
--header 'Content-Type: application/json' \
--header "user-type: regular-user" \
--data '{
"messages": [
{"role": "user", "content": "Please introduce yourself"}
]
}'kubectl exec deployment/sleep -it -- curl 'http://dashscope.aliyuncs.com' \
--header 'Content-Type: application/json' \
--header "user-type: subscriber" \
--data '{
"messages": [
{"role": "user", "content": "Please introduce yourself"}
]
}'Expected output:
{"choices":[{"message":{"role":"assistant","content":"I am a large language model from Alibaba Cloud. My name is Tongyi Qianwen. I am a super-large language model that can answer questions, create text, express opinions, and write code. My knowledge comes from text on the Internet. After multiple iterations and optimizations, my capabilities have continuously improved. I can now answer questions on various topics, such as technology, culture, history, and entertainment, and can also engage in continuous conversations. If you have any questions or need help, feel free to let me know, and I will do my best to provide support."},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":10,"completion_tokens":95,"total_tokens":105},"created":1735103573,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-7de0bd64-341a-9196-b676-99b5644ec111"}%
regular-user is being rate-limited{"choices":[{"message":{"role":"assistant","content":"I am a large language model from Alibaba Cloud. My name is Tongyi Qianwen. I am a super-large language model that can answer questions, create text, express opinions, and write code. My knowledge comes from text on the Internet. After multiple iterations and optimizations, my capabilities have continuously improved. I can now answer questions on various topics, such as technology, culture, history, and entertainment, and can also engage in continuous conversations. If you have any questions or need help, feel free to let me know, and I will do my best to provide support."},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":10,"completion_tokens":95,"total_tokens":105},"created":1735103890,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-1b284b71-f850-95f5-9a7e-12678812763c"}%
{"choices":[{"message":{"role":"assistant","content":"I am a large language model from Alibaba Cloud. My name is Tongyi Qianwen. I am a super-large language model that can answer questions, create text, express opinions, and write code. My knowledge comes from the massive text data of Alibaba Cloud, including various books, documents, web pages, and papers, which is intended to enable me to understand and answer various topics. If you have any questions or need help, feel free to let me know, and I will do my best to provide support."},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":10,"completion_tokens":81,"total_tokens":91},"created":1735103895,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-0d29d820-c9c5-9e94-9a5a-d054233ed35a"}
subscriber is being rate-limited