Implement token-based rate limiting with ALB Extensible Edition - Server Load Balancer

Application Load Balancer (ALB) Extensible Edition supports token-based rate limiting for AI workloads. You can limit token consumption by dimensions such as user or account within a specified time window. This prevents resource abuse and controls costs when calling large language models (LLMs).

Solution architecture

An ALB Extensible Edition instance receives client requests and matches them to forwarding rules based on the requested domain name. The token rate limiting component is associated with the forwarding rule through a service extension and executes before the forwarding action. The component extracts a rate limiting identifier from an HTTP header and queries the token consumption for that identifier within a specified time window. If the consumption exceeds the threshold, the component returns an HTTP 429 response and blocks the request from reaching the backend AI service. If the consumption is within the threshold, the request is forwarded. After receiving the response, the component extracts the token usage from the response body and updates the consumption statistics for that identifier.

The solution involves these resources:

Resource	Purpose
ALB Extensible Edition instance	Provides load balancing and traffic forwarding.
AI Service-type server group	Connects to a backend LLM service (Alibaba Cloud Model Studio).
HTTPS listener	Receives client requests on port 443.
Forwarding rule	Matches requests based on the domain name and triggers the service extension.
Service extension	Implements token consumption tracking and rate limiting through the token rate limiting component.

Prerequisites

You have got access to the ALB Extensible Edition public preview.
You have created a virtual private cloud (VPC) in the Singapore region with a vSwitch in both Zone A and Zone B. Internet SNAT is enabled on the vSwitches so the AI Service-type server group can call public LLM endpoints.
You have activated Alibaba Cloud Model Studio and got an API key.
You have registered a custom domain name.
You have prepared a server certificate that matches your custom domain name. If the certificate was not purchased from Alibaba Cloud, upload it to Alibaba Cloud Certificate Management Service.

Procedure

Step 1: Create an ALB Extensible Edition instance

Go to the ALB console, select the Singapore region, and click Create ALB.

On the buy page, configure the following settings and click Buy Now.

Parameter	Value
Region	Singapore
Network Type	Internet
VPC and Zone	Select the target VPC. Select Zone A and Zone B, choose the corresponding vSwitches, and select Automatically assign EIP.
IP Version	IPv4
Edition	Extensible

On the Confirm Order page, review the instance configuration and click Activate Now.

Step 2: Create an AI Service-type server group

Create an AI Service-type server group to connect to Alibaba Cloud Model Studio.

Go to the Server Groups page and click Create Server Group.
Set Server Group Type to AI Service, enter a name such as sgp-ai-qwen, and click Create.
In the dialog box that appears, click Add Backend Server.
Configure the following settings and click OK.
Parameter
Value
Model provider
Alibaba Cloud Model Studio
Endpoint
Auto-populated after you select the provider.
API Key
Enter the API key for Alibaba Cloud Model Studio.

Step 3: Create a listener

In the ALB console, click the target instance ID to go to the instance details page. On the Listener tab, click Create Listener.
In the Configure Listener step, set Listener Protocol to HTTPS and Listener Port to 443, then click Next.
In the Configure SSL Certificate step, select the server certificate that matches your custom domain name and click Next.
In the Select Server Group step, select the AI Service type and the sgp-ai-qwen server group, then click Next.
The selected server group is used for the default forwarding rule, which handles requests that do not match any other forwarding rule. Adjust this setting as needed.
In the Configuration Review step, confirm the settings and click Submit.

Step 4: Create a service extension

Create a service extension and add a token rate limiting component. The component identifies requests by an HTTP header and applies rate limiting.

Go to the Service Extensions page and click Create Service Extension.
In the Service Extension Configuration section, enter an Extension name such as ext-token-rate-limit.
Extension Type is set to Plug-in by default. From the Component name drop-down list, select Token Rate Limiting.
Configure the rate limiting policy and click Create.

Parameter	Value
Throttling Condition	Select By HTTP Header. Enter `x-account-id` as the parameter. Set the match type to Wildcard Match.
Throttling Range	Select Total Tokens. Set the limit to `100` tokens per `1` minute.
Timeout and Processing policy	This tutorial uses the default values: `1000` and `Skip`. Adjust as needed.

The x-account-id value is extracted from the HTTP request header and used as the rate limiting identifier. Token consumption is tracked separately for each unique value, and rate limits are calculated independently. You can use other HTTP header fields or adjust the condition type and match type as needed.

When a service extension contains multiple rate limiting policies, requests are matched from top to bottom. The first matching policy is applied. Subsequent policies are not evaluated.

Step 5: Configure a forwarding rule

Create a forwarding rule for the listener. Add a domain name match condition and associate the service extension.

In the ALB console, click the target instance ID to go to the instance details page. Click the Listener tab, click the target listener ID, and go to the Forwarding Rules tab.

Click Add New Rule and configure the following settings, then click OK.

Parameter	Value
Add Condition	Select Domain Name. Set the match type to Exact Match. Enter the domain name used to access the ALB instance, such `ai.example.com`.
Service Extension (Optional)	Select Use Existing Service Extension and choose `ext-token-rate-limit` from the drop-down list.
Action	Select Forward to and choose the AI Service-type server group `sgp-ai-qwen`.

After the forwarding rule is created, requests with Host: ai.example.com in the HTTP header match this rule. The service extension checks the x-account-id header value against the rate limit. If the token consumption is within the threshold, the request is forwarded to the sgp-ai-qwen server group.

For production environments, you can use other types of forwarding conditions as needed.

Step 6: Configure DNS resolution

Create a CNAME record to point your custom domain name to the DNS name of the ALB instance. This allows clients to access the ALB instance through your domain name.

This tutorial uses Alibaba Cloud DNS as an example. If your domain name is not registered with Alibaba Cloud, add the domain name to the Alibaba Cloud DNS console first.

In the ALB console, copy the Domain Name of the target instance.
Go to the Alibaba Cloud DNS console. Find the target domain name, and click Settings in the Actions column. On the Settings tab, click Add Record.

Configure the following settings and click OK.

Parameter	Value
Record Type	CNAME
Hostname	Enter the domain name prefix, such as ai. If your root domain name is example.com, the full domain name for accessing the ALB instance is ai.example.com. This must match the domain name configured in Step 5.
Query Source and TTL	Keep the default values.
Record Value	Enter the DNS name of the ALB instance.

In the confirmation dialog box, review the DNS record and click OK.

Step 7: Verify the configuration

Use curl commands to send requests and verify the token rate limiting feature. Requests must meet the following conditions:

Domain name: Access the ALB instance through ai.example.com (the Host header value must match).
Rate limiting header: Include x-account-id: . The rate limiting policy identifies requests by this header. Each unique header value has an independent token quota.
OpenAI-compatible protocol: The request path must be /v1/completions, /v1/chat/completions, or /v1/embeddings, and the request body must comply with the protocol format.

Replace ai.example.com in the following commands with the actual domain name configured in Step 6. Make sure the DNS record has taken effect before testing.

Test a normal request

Send a single request to verify that the service works:

curl -v \
-H "x-account-id: id" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-turbo",
"messages": [
{
"role": "user",
"content": "Who are you"
}
]
}' \
https://ai.example.com/v1/chat/completions

A successful request returns an HTTP 200 status code and the AI service response:

{
"choices": [
{
"message": {
"role": "assistant",
"content": "Hello! I am Qwen, a large language model..."
},
"finish_reason": "stop",
"index": 0
}
],
"object": "chat.completion",
"usage": {
"prompt_tokens": 14,
"completion_tokens": 53,
"total_tokens": 67
},
"model": "qwen-turbo"
}

The usage.total_tokens field shows the number of tokens consumed by this request.

Test rate limit triggering

Send multiple requests in quick succession to exceed the rate limit threshold (100 tokens per minute):

for i in {1..3}; do
curl -v \
-H "x-account-id: 12345" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-turbo",
"messages": [
{
"role": "user",
"content": "Who are you"
}
]
}' \
https://ai.example.com/v1/chat/completions
sleep 2
done

When the cumulative token consumption exceeds 100, subsequent requests return an HTTP 429 status code.

Only the next request after the token quota is exhausted is rejected. The last successful request may consume tokens beyond the quota.
In streaming output scenarios, token usage is calculated when the last data chunk is returned. Even if a request exceeds the quota for the current window, the full response is returned without truncation. Rate limiting applies only to subsequent requests.

The following example shows a rate-limited response:

Response headers:

HTTP/2 429
x-tokenratelimit-reset: 52
content-length: 17
content-type: text/plain
date: Wed, 21 Jan 2026 07:59:38 GMT

Response body:
```
Too Many Requests
```

Header/Field	Description
HTTP status code `429`	The request was rejected because the token rate limit was exceeded.
`x-tokenratelimit-reset`	Seconds remaining until the rate limit counter resets (52 seconds in this example). After the counter resets, requests are processed normally.
Response body	`Too Many Requests` in plain text.

Billing

Item	Description
ALB Extensible Edition	Currently in public preview. No charges apply.
Internet access	Internet NAT gateways incur instance fees and capacity unit (CU) fees. EIPs associated with the ALB instance have separate billing rules.
Domain name and DNS resolution	Domain name registration fees from your provider, plus public authoritative DNS resolution fees from Alibaba Cloud.
Certificate	Server certificate fees if you purchase or upload certificates through Alibaba Cloud Certificate Management Service.
Model Studio	API call fees based on the model and token usage. For more information, see Alibaba Cloud Model Studio pricing.

Supported regions

For the list of regions that support ALB Extensible Edition, see Regions and zones where ALB Extensible Edition is available.

Apply in production

Set the rate limiting threshold: Monitor normal token consumption patterns first, then set a threshold slightly higher than the observed values. This prevents the policy from being too aggressive and blocking legitimate requests.
Select the time window:
- For high-frequency scenarios, use a shorter time window for more precise control.
- For low-frequency but high-consumption scenarios, use a longer time window to prevent individual requests from being mistakenly rate-limited.
- The time window must be longer than the duration of a single model inference. Response-based rate limiting cannot function if the inference completes after the window expires.

FAQ

How are requests handled if they do not match the rate limiting policy?

If a request does not include the x-account-id: header, it matches the forwarding rule but not the rate limiting policy. The request is forwarded directly without rate limiting.

Each unique x-account-id value has its own independent token quota.

I configured token rate limiting, but requests are not being limited. What should I check?

Forwarding rule: Verify that the forwarding condition matches the request, the rule has a high enough priority, and a service extension is associated.
Service extension: Verify that the token rate limiting component is configured correctly, the rate limiting condition matches the request, and the rate limit values are appropriate.
Request format: Verify that the request matches both the forwarding condition and the rate limiting condition.
Token consumption: Check the usage.total_tokens field in the response. The rate limiting policy counts cumulative token consumption within the time window. A single request may not reach the threshold.

If the model inference duration exceeds the rate limiting time window, rate limiting cannot be enforced.

Parameter	Value
Model provider	Alibaba Cloud Model Studio
Endpoint	Auto-populated after you select the provider.
API Key	Enter the API key for Alibaba Cloud Model Studio.