All Products
Search
Document Center

Server Load Balancer:Implement token-based rate limiting with ALB Extensible Edition

Last Updated:Mar 20, 2026

Application Load Balancer (ALB) Extensible Edition supports token-based rate limiting for AI workloads. You can limit token consumption by dimensions such as user or account within a specified time window. This prevents resource abuse and controls costs when calling large language models (LLMs).

Solution architecture

An ALB Extensible Edition instance receives client requests and matches them to forwarding rules based on the requested domain name. The token rate limiting component is associated with the forwarding rule through a service extension and executes before the forwarding action. The component extracts a rate limiting identifier from an HTTP header and queries the token consumption for that identifier within a specified time window. If the consumption exceeds the threshold, the component returns an HTTP 429 response and blocks the request from reaching the backend AI service. If the consumption is within the threshold, the request is forwarded. After receiving the response, the component extracts the token usage from the response body and updates the consumption statistics for that identifier.

The solution involves these resources:

Resource

Purpose

ALB Extensible Edition instance

Provides load balancing and traffic forwarding.

AI Service-type server group

Connects to a backend LLM service (Alibaba Cloud Model Studio).

HTTPS listener

Receives client requests on port 443.

Forwarding rule

Matches requests based on the domain name and triggers the service extension.

Service extension

Implements token consumption tracking and rate limiting through the token rate limiting component.

image

Prerequisites

Procedure

Step 1: Create an ALB Extensible Edition instance

  1. Go to the ALB console, select the Singapore region, and click Create ALB.

  2. On the buy page, configure the following settings and click Buy Now.

    Parameter

    Value

    Region

    Singapore

    Network Type

    Internet

    VPC and Zone

    Select the target VPC. Select Zone A and Zone B, choose the corresponding vSwitches, and select Automatically assign EIP.

    IP Version

    IPv4

    Edition

    Extensible

  3. On the Confirm Order page, review the instance configuration and click Activate Now.

Step 2: Create an AI Service-type server group

Create an AI Service-type server group to connect to Alibaba Cloud Model Studio.

  1. Go to the Server Groups page and click Create Server Group.

  2. Set Server Group Type to AI Service, enter a name such as sgp-ai-qwen, and click Create.

  3. In the dialog box that appears, click Add Backend Server.

  4. Configure the following settings and click OK.

    Parameter

    Value

    Model provider

    Alibaba Cloud Model Studio

    Endpoint

    Auto-populated after you select the provider.

    API Key

    Enter the API key for Alibaba Cloud Model Studio.

Step 3: Create a listener

  1. In the ALB console, click the target instance ID to go to the instance details page. On the Listener tab, click Create Listener.

  2. In the Configure Listener step, set Listener Protocol to HTTPS and Listener Port to 443, then click Next.

  3. In the Configure SSL Certificate step, select the server certificate that matches your custom domain name and click Next.

  4. In the Select Server Group step, select the AI Service type and the sgp-ai-qwen server group, then click Next.

    The selected server group is used for the default forwarding rule, which handles requests that do not match any other forwarding rule. Adjust this setting as needed.
  5. In the Configuration Review step, confirm the settings and click Submit.

Step 4: Create a service extension

Create a service extension and add a token rate limiting component. The component identifies requests by an HTTP header and applies rate limiting.

  1. Go to the Service Extensions page and click Create Service Extension.

  2. In the Service Extension Configuration section, enter an Extension name such as ext-token-rate-limit.

  3. Extension Type is set to Plug-in by default. From the Component name drop-down list, select Token Rate Limiting.

  4. Configure the rate limiting policy and click Create.

Parameter

Value

Throttling Condition

Select By HTTP Header. Enter x-account-id as the parameter. Set the match type to Wildcard Match.

Throttling Range

Select Total Tokens. Set the limit to 100 tokens per 1 minute.

Timeout and Processing policy

This tutorial uses the default values: 1000 and Skip. Adjust as needed.

The x-account-id value is extracted from the HTTP request header and used as the rate limiting identifier. Token consumption is tracked separately for each unique value, and rate limits are calculated independently. You can use other HTTP header fields or adjust the condition type and match type as needed.

When a service extension contains multiple rate limiting policies, requests are matched from top to bottom. The first matching policy is applied. Subsequent policies are not evaluated.

Step 5: Configure a forwarding rule

Create a forwarding rule for the listener. Add a domain name match condition and associate the service extension.

  1. In the ALB console, click the target instance ID to go to the instance details page. Click the Listener tab, click the target listener ID, and go to the Forwarding Rules tab.

  2. Click Add New Rule and configure the following settings, then click OK.

    Parameter

    Value

    Add Condition

    Select Domain Name. Set the match type to Exact Match. Enter the domain name used to access the ALB instance, such ai.example.com.

    Service Extension (Optional)

    Select Use Existing Service Extension and choose ext-token-rate-limit from the drop-down list.

    Action

    Select Forward to and choose the AI Service-type server group sgp-ai-qwen.

After the forwarding rule is created, requests with Host: ai.example.com in the HTTP header match this rule. The service extension checks the x-account-id header value against the rate limit. If the token consumption is within the threshold, the request is forwarded to the sgp-ai-qwen server group.

For production environments, you can use other types of forwarding conditions as needed.

Step 6: Configure DNS resolution

Create a CNAME record to point your custom domain name to the DNS name of the ALB instance. This allows clients to access the ALB instance through your domain name.

This tutorial uses Alibaba Cloud DNS as an example. If your domain name is not registered with Alibaba Cloud, add the domain name to the Alibaba Cloud DNS console first.

  1. In the ALB console, copy the Domain Name of the target instance.

  2. Go to the Alibaba Cloud DNS console. Find the target domain name, and click Settings in the Actions column. On the Settings tab, click Add Record.

  3. Configure the following settings and click OK.

    Parameter

    Value

    Record Type

    CNAME

    Hostname

    Enter the domain name prefix, such as ai. If your root domain name is example.com, the full domain name for accessing the ALB instance is ai.example.com. This must match the domain name configured in Step 5.

    Query Source and TTL

    Keep the default values.

    Record Value

    Enter the DNS name of the ALB instance.

  1. In the confirmation dialog box, review the DNS record and click OK.

Step 7: Verify the configuration

Use curl commands to send requests and verify the token rate limiting feature. Requests must meet the following conditions:

  • Domain name: Access the ALB instance through ai.example.com (the Host header value must match).

  • Rate limiting header: Include x-account-id: . The rate limiting policy identifies requests by this header. Each unique header value has an independent token quota.

  • OpenAI-compatible protocol: The request path must be /v1/completions, /v1/chat/completions, or /v1/embeddings, and the request body must comply with the protocol format.

Replace ai.example.com in the following commands with the actual domain name configured in Step 6. Make sure the DNS record has taken effect before testing.

Test a normal request

Send a single request to verify that the service works:

curl -v \
-H "x-account-id: id" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-turbo",
"messages": [
{
"role": "user",
"content": "Who are you"
}
]
}' \
https://ai.example.com/v1/chat/completions

A successful request returns an HTTP 200 status code and the AI service response:

{
"choices": [
{
"message": {
"role": "assistant",
"content": "Hello! I am Qwen, a large language model..."
},
"finish_reason": "stop",
"index": 0
}
],
"object": "chat.completion",
"usage": {
"prompt_tokens": 14,
"completion_tokens": 53,
"total_tokens": 67
},
"model": "qwen-turbo"
}

The usage.total_tokens field shows the number of tokens consumed by this request.

Test rate limit triggering

Send multiple requests in quick succession to exceed the rate limit threshold (100 tokens per minute):

for i in {1..3}; do
curl -v \
-H "x-account-id: 12345" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-turbo",
"messages": [
{
"role": "user",
"content": "Who are you"
}
]
}' \
https://ai.example.com/v1/chat/completions
sleep 2
done

When the cumulative token consumption exceeds 100, subsequent requests return an HTTP 429 status code.

  • Only the next request after the token quota is exhausted is rejected. The last successful request may consume tokens beyond the quota.

  • In streaming output scenarios, token usage is calculated when the last data chunk is returned. Even if a request exceeds the quota for the current window, the full response is returned without truncation. Rate limiting applies only to subsequent requests.

  • The following example shows a rate-limited response:

    • Response headers:

      HTTP/2 429
      x-tokenratelimit-reset: 52
      content-length: 17
      content-type: text/plain
      date: Wed, 21 Jan 2026 07:59:38 GMT
    • Response body:

      Too Many Requests

Header/Field

Description

HTTP status code 429

The request was rejected because the token rate limit was exceeded.

x-tokenratelimit-reset

Seconds remaining until the rate limit counter resets (52 seconds in this example). After the counter resets, requests are processed normally.

Response body

Too Many Requests in plain text.

Billing

Item

Description

ALB Extensible Edition

Currently in public preview. No charges apply.

Internet access

Internet NAT gateways incur instance fees and capacity unit (CU) fees. EIPs associated with the ALB instance have separate billing rules.

Domain name and DNS resolution

Domain name registration fees from your provider, plus public authoritative DNS resolution fees from Alibaba Cloud.

Certificate

Server certificate fees if you purchase or upload certificates through Alibaba Cloud Certificate Management Service.

Model Studio

API call fees based on the model and token usage. For more information, see Alibaba Cloud Model Studio pricing.

Supported regions

For the list of regions that support ALB Extensible Edition, see Regions and zones where ALB Extensible Edition is available.

Apply in production

  • Set the rate limiting threshold: Monitor normal token consumption patterns first, then set a threshold slightly higher than the observed values. This prevents the policy from being too aggressive and blocking legitimate requests.

  • Select the time window:

    • For high-frequency scenarios, use a shorter time window for more precise control.

    • For low-frequency but high-consumption scenarios, use a longer time window to prevent individual requests from being mistakenly rate-limited.

    • The time window must be longer than the duration of a single model inference. Response-based rate limiting cannot function if the inference completes after the window expires.

FAQ

How are requests handled if they do not match the rate limiting policy?

If a request does not include the x-account-id: header, it matches the forwarding rule but not the rate limiting policy. The request is forwarded directly without rate limiting.

Each unique x-account-id value has its own independent token quota.

I configured token rate limiting, but requests are not being limited. What should I check?

  1. Forwarding rule: Verify that the forwarding condition matches the request, the rule has a high enough priority, and a service extension is associated.

  2. Service extension: Verify that the token rate limiting component is configured correctly, the rate limiting condition matches the request, and the rate limit values are appropriate.

  3. Request format: Verify that the request matches both the forwarding condition and the rate limiting condition.

  4. Token consumption: Check the usage.total_tokens field in the response. The rate limiting policy counts cumulative token consumption within the time window. A single request may not reach the threshold.

If the model inference duration exceeds the rate limiting time window, rate limiting cannot be enforced.