Application Load Balancer (ALB) Extensible Edition supports token-based rate limiting for AI workloads. You can limit token consumption by dimensions such as user or account within a specified time window. This prevents resource abuse and controls costs when calling large language models (LLMs).
Solution architecture
An ALB Extensible Edition instance receives client requests and matches them to forwarding rules based on the requested domain name. The token rate limiting component is associated with the forwarding rule through a service extension and executes before the forwarding action. The component extracts a rate limiting identifier from an HTTP header and queries the token consumption for that identifier within a specified time window. If the consumption exceeds the threshold, the component returns an HTTP 429 response and blocks the request from reaching the backend AI service. If the consumption is within the threshold, the request is forwarded. After receiving the response, the component extracts the token usage from the response body and updates the consumption statistics for that identifier.
The solution involves these resources:
Resource | Purpose |
ALB Extensible Edition instance | Provides load balancing and traffic forwarding. |
AI Service-type server group | Connects to a backend LLM service (Alibaba Cloud Model Studio). |
HTTPS listener | Receives client requests on port 443. |
Forwarding rule | Matches requests based on the domain name and triggers the service extension. |
Service extension | Implements token consumption tracking and rate limiting through the token rate limiting component. |
Prerequisites
You have got access to the ALB Extensible Edition public preview.
You have created a virtual private cloud (VPC) in the Singapore region with a vSwitch in both Zone A and Zone B. Internet SNAT is enabled on the vSwitches so the AI Service-type server group can call public LLM endpoints.
You have activated Alibaba Cloud Model Studio and got an API key.
You have registered a custom domain name.
You have prepared a server certificate that matches your custom domain name. If the certificate was not purchased from Alibaba Cloud, upload it to Alibaba Cloud Certificate Management Service.
Procedure
Step 1: Create an ALB Extensible Edition instance
Go to the ALB console, select the Singapore region, and click Create ALB.
On the buy page, configure the following settings and click Buy Now.
Parameter
Value
Region
Singapore
Network Type
Internet
VPC and Zone
Select the target VPC. Select Zone A and Zone B, choose the corresponding vSwitches, and select Automatically assign EIP.
IP Version
IPv4
Edition
Extensible
On the Confirm Order page, review the instance configuration and click Activate Now.
Step 2: Create an AI Service-type server group
Create an AI Service-type server group to connect to Alibaba Cloud Model Studio.
Go to the Server Groups page and click Create Server Group.
Set Server Group Type to AI Service, enter a name such as
sgp-ai-qwen, and click Create.In the dialog box that appears, click Add Backend Server.
Configure the following settings and click OK.
Parameter
Value
Model provider
Alibaba Cloud Model Studio
Endpoint
Auto-populated after you select the provider.
API Key
Enter the API key for Alibaba Cloud Model Studio.
Step 3: Create a listener
In the ALB console, click the target instance ID to go to the instance details page. On the Listener tab, click Create Listener.
In the Configure Listener step, set Listener Protocol to HTTPS and Listener Port to
443, then click Next.In the Configure SSL Certificate step, select the server certificate that matches your custom domain name and click Next.
In the Select Server Group step, select the AI Service type and the
sgp-ai-qwenserver group, then click Next.The selected server group is used for the default forwarding rule, which handles requests that do not match any other forwarding rule. Adjust this setting as needed.
In the Configuration Review step, confirm the settings and click Submit.
Step 4: Create a service extension
Create a service extension and add a token rate limiting component. The component identifies requests by an HTTP header and applies rate limiting.
Go to the Service Extensions page and click Create Service Extension.
In the Service Extension Configuration section, enter an Extension name such as
ext-token-rate-limit.Extension Type is set to Plug-in by default. From the Component name drop-down list, select Token Rate Limiting.
Configure the rate limiting policy and click Create.
Parameter | Value |
Throttling Condition | Select By HTTP Header. Enter |
Throttling Range | Select Total Tokens. Set the limit to |
Timeout and Processing policy | This tutorial uses the default values: |
The x-account-id value is extracted from the HTTP request header and used as the rate limiting identifier. Token consumption is tracked separately for each unique value, and rate limits are calculated independently. You can use other HTTP header fields or adjust the condition type and match type as needed.
When a service extension contains multiple rate limiting policies, requests are matched from top to bottom. The first matching policy is applied. Subsequent policies are not evaluated.
Step 5: Configure a forwarding rule
Create a forwarding rule for the listener. Add a domain name match condition and associate the service extension.
In the ALB console, click the target instance ID to go to the instance details page. Click the Listener tab, click the target listener ID, and go to the Forwarding Rules tab.
Click Add New Rule and configure the following settings, then click OK.
Parameter
Value
Add Condition
Select Domain Name. Set the match type to Exact Match. Enter the domain name used to access the ALB instance, such
ai.example.com.Service Extension (Optional)
Select Use Existing Service Extension and choose
ext-token-rate-limitfrom the drop-down list.Action
Select Forward to and choose the AI Service-type server group
sgp-ai-qwen.
After the forwarding rule is created, requests with Host: ai.example.com in the HTTP header match this rule. The service extension checks the x-account-id header value against the rate limit. If the token consumption is within the threshold, the request is forwarded to the sgp-ai-qwen server group.
For production environments, you can use other types of forwarding conditions as needed.
Step 6: Configure DNS resolution
Create a CNAME record to point your custom domain name to the DNS name of the ALB instance. This allows clients to access the ALB instance through your domain name.
This tutorial uses Alibaba Cloud DNS as an example. If your domain name is not registered with Alibaba Cloud, add the domain name to the Alibaba Cloud DNS console first.
In the ALB console, copy the Domain Name of the target instance.
Go to the Alibaba Cloud DNS console. Find the target domain name, and click Settings in the Actions column. On the Settings tab, click Add Record.
Configure the following settings and click OK.
Parameter
Value
Record Type
CNAME
Hostname
Enter the domain name prefix, such as ai. If your root domain name is example.com, the full domain name for accessing the ALB instance is ai.example.com. This must match the domain name configured in Step 5.
Query Source and TTL
Keep the default values.
Record Value
Enter the DNS name of the ALB instance.
In the confirmation dialog box, review the DNS record and click OK.
Step 7: Verify the configuration
Use curl commands to send requests and verify the token rate limiting feature. Requests must meet the following conditions:
Domain name: Access the ALB instance through
ai.example.com(theHostheader value must match).Rate limiting header: Include
x-account-id:. The rate limiting policy identifies requests by this header. Each unique header value has an independent token quota.OpenAI-compatible protocol: The request path must be
/v1/completions,/v1/chat/completions, or/v1/embeddings, and the request body must comply with the protocol format.
Replace ai.example.com in the following commands with the actual domain name configured in Step 6. Make sure the DNS record has taken effect before testing.Test a normal request
Send a single request to verify that the service works:
curl -v \
-H "x-account-id: id" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-turbo",
"messages": [
{
"role": "user",
"content": "Who are you"
}
]
}' \
https://ai.example.com/v1/chat/completions
A successful request returns an HTTP 200 status code and the AI service response:
{
"choices": [
{
"message": {
"role": "assistant",
"content": "Hello! I am Qwen, a large language model..."
},
"finish_reason": "stop",
"index": 0
}
],
"object": "chat.completion",
"usage": {
"prompt_tokens": 14,
"completion_tokens": 53,
"total_tokens": 67
},
"model": "qwen-turbo"
}
The usage.total_tokens field shows the number of tokens consumed by this request.
Test rate limit triggering
Send multiple requests in quick succession to exceed the rate limit threshold (100 tokens per minute):
for i in {1..3}; do
curl -v \
-H "x-account-id: 12345" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-turbo",
"messages": [
{
"role": "user",
"content": "Who are you"
}
]
}' \
https://ai.example.com/v1/chat/completions
sleep 2
done
When the cumulative token consumption exceeds 100, subsequent requests return an HTTP 429 status code.
Only the next request after the token quota is exhausted is rejected. The last successful request may consume tokens beyond the quota.
In streaming output scenarios, token usage is calculated when the last data chunk is returned. Even if a request exceeds the quota for the current window, the full response is returned without truncation. Rate limiting applies only to subsequent requests.
The following example shows a rate-limited response:
Response headers:
HTTP/2 429 x-tokenratelimit-reset: 52 content-length: 17 content-type: text/plain date: Wed, 21 Jan 2026 07:59:38 GMTResponse body:
Too Many Requests
Header/Field | Description |
HTTP status code | The request was rejected because the token rate limit was exceeded. |
| Seconds remaining until the rate limit counter resets (52 seconds in this example). After the counter resets, requests are processed normally. |
Response body |
|
Billing
Item | Description |
ALB Extensible Edition | Currently in public preview. No charges apply. |
Internet access | Internet NAT gateways incur instance fees and capacity unit (CU) fees. EIPs associated with the ALB instance have separate billing rules. |
Domain name and DNS resolution | Domain name registration fees from your provider, plus public authoritative DNS resolution fees from Alibaba Cloud. |
Certificate | Server certificate fees if you purchase or upload certificates through Alibaba Cloud Certificate Management Service. |
Model Studio | API call fees based on the model and token usage. For more information, see Alibaba Cloud Model Studio pricing. |
Supported regions
For the list of regions that support ALB Extensible Edition, see Regions and zones where ALB Extensible Edition is available.
Apply in production
Set the rate limiting threshold: Monitor normal token consumption patterns first, then set a threshold slightly higher than the observed values. This prevents the policy from being too aggressive and blocking legitimate requests.
Select the time window:
For high-frequency scenarios, use a shorter time window for more precise control.
For low-frequency but high-consumption scenarios, use a longer time window to prevent individual requests from being mistakenly rate-limited.
The time window must be longer than the duration of a single model inference. Response-based rate limiting cannot function if the inference completes after the window expires.
FAQ
How are requests handled if they do not match the rate limiting policy?
If a request does not include the x-account-id: header, it matches the forwarding rule but not the rate limiting policy. The request is forwarded directly without rate limiting.
Each unique x-account-id value has its own independent token quota.
I configured token rate limiting, but requests are not being limited. What should I check?
Forwarding rule: Verify that the forwarding condition matches the request, the rule has a high enough priority, and a service extension is associated.
Service extension: Verify that the token rate limiting component is configured correctly, the rate limiting condition matches the request, and the rate limit values are appropriate.
Request format: Verify that the request matches both the forwarding condition and the rate limiting condition.
Token consumption: Check the
usage.total_tokensfield in the response. The rate limiting policy counts cumulative token consumption within the time window. A single request may not reach the threshold.
If the model inference duration exceeds the rate limiting time window, rate limiting cannot be enforced.