You can call the UpdateHttpApi API to configure or update the aiTokenRateLimitConfig (AI Token rate limiting) parameter in the deployConfigs of your Model API. This enables multi-dimensional rate limiting based on token consumption.
Scenarios
AI Token rate limiting uses token consumption as its core metric. Unlike traditional request-count-based rate limiting, it precisely reflects the actual compute resource usage by Large Language Models (LLMs). You can set rate limiting rules based on the consumer, request header, query parameter, cookie, client IP address, or model name. This feature also supports global (API-level) rate limiting.
High-concurrency scenarios: During e-commerce sales promotions, you can limit the total tokens consumed per user per time window to prevent malicious high-frequency calls.
AI service calls: You can apply rate limiting to LLM API calls to avoid service degradation caused by traffic spikes.
Multi-tenant systems: You can assign independent rate limiting quotas to different tenants to ensure fairness and resource isolation.
Fine-grained model-level control: You can set different rate limiting thresholds for different models to protect high-cost model resources.
API information
HTTP method:
PUTAction:
UpdateHttpApi
Request parameter structure
The AI Token rate limiting configuration is located in the UpdateHttpApiRequest object, under deployConfigs[].policyConfigs[]. You can pass this configuration using the PolicyConfig structure.
PolicyConfig structure
Parameter | Type | Required | Description |
type | String | Yes | Policy type. Set this to |
enable | Boolean | Yes | Whether to enable rate limiting. Set to |
aiTokenRateLimitConfig | Object | Yes (if enable is true) | Details of the AI Token rate limiting configuration. |
AiTokenRateLimitConfig structure
Parameter | Type | Required | Description |
rules | Array<AiTokenRateLimitRule> | Conditionally required | List of standard rate limiting rules (including dimension-based and model-based rules). Configure at least one of |
enableGlobalRules | Boolean | No | Whether to enable global (API-level) rate limiting rules. Default is |
globalRules | Array<AiTokenRateLimitRule> | Conditionally required | List of global rate limiting rules. These take effect only when |
redisConfig | Object | No | Redis configuration.
|
AiTokenRateLimitRule structure
Parameter | Type | Required | Description |
limitType | String | Yes | Rate limiting dimension type. |
matchKey | String | Required condition | Matching key name.
|
matchType | String | Conditionally required | Matching pattern.
|
matchValue | String | You must specify a condition. | Matching value.
|
limitMode | String | Yes | Rate limiting mode. |
limitValue | Integer | Yes | Rate limiting threshold. Must be greater than 0. |
limitType enumeration values
Value | Description |
Header | Rate limit by request header. |
Parameter | Rate limit by query parameter. |
Consumer | Rate limit by consumer (requires consumer authentication to be enabled first). |
Cookie | Rate limit by cookie. |
IP | Rate limit by client IP address. |
Model | Rate limit by model name (set separate quotas for specific models). |
Request | Rate limit by request count (for requests with a specific key). |
Concurrency | Rate limit by concurrency (for concurrent requests with a specific key). |
Global | Global rate limiting (API-level, no key distinction. Used only in |
matchType enumeration values
Value | Description | Applicable limitType |
Exact | Exact match. | Header / Parameter / Consumer / Cookie / Request / Concurrency |
Prefix | Prefix match. | Header / Parameter / Consumer / Cookie / Request / Concurrency |
Regex | Regular expression matching. | Header / Parameter / Consumer / Cookie / Request / Concurrency |
All | Match any value ( | Header / Parameter / Consumer / Cookie / Request / Concurrency |
IP | IP matching (set automatically. Do not specify manually). | IP |
Matching priority: Exact > Prefix > Regex > All. Rate limiting is triggered when any rule matches.
limitMode enumeration values
Value | Description | Applicable limitType |
TokenPerSecond | Token-based rate limiting per second. | All types |
TokenPerMinute | Token-based rate limiting per minute. | All types |
TokenPerHour | Token-based rate limiting per hour. | All types |
TokenPerDay | Token-based rate limiting per day. | All types |
RequestPerSecond | Request-based rate limiting per second. | Model / Global / Request |
RequestPerMinute | Request-based rate limiting per minute. | Model / Global / Request |
RequestPerHour | Request-based rate limiting per hour. | Model / Global / Request |
RequestPerDay | Request-based rate limiting per day. | Model / Global / Request |
ConcurrencyLimit | Concurrency-based rate limiting. | Model / Global / Concurrency |
redisConfig structure
Parameter | Type | Required | Description |
host | String | Yes | Redis endpoint. |
port | Integer | Yes | Redis service port. |
username | String | No | Redis username. |
password | String | No | Redis password. |
timeout | Integer | No | Connection timeout. |
databaseNumber | Integer | No | Redis database number. |
AI Gateway instances use an internal key-value store by default. You do not need to configure
redisConfig.You only need to configure
redisConfigto use an external Redis instance to store rate limiting keys.
Configuration examples
Example 1: Rate limiting by consumer and IP address
Set consumer-based rate limiting for your Model API: Allow up to 1000 tokens per minute for any consumer.
Set IP-based rate limiting for your Model API: Allow up to 500 tokens per minute per client IP address.
PUT /v1/http-apis/{httpApiId}
{
"deployConfigs": [
{
"gatewayId": "gw-xxxxxxxxxxxxx",
"policyConfigs": [
{
"type": "AiTokenRateLimit",
"enable": true,
"aiTokenRateLimitConfig": {
"rules": [
{
"limitType": "Consumer",
"matchKey": "",
"matchType": "All",
"matchValue": "*",
"limitMode": "TokenPerMinute",
"limitValue": 1000
},
{
"limitType": "IP",
"matchValue": "0.0.0.0/0",
"limitMode": "TokenPerMinute",
"limitValue": 500
}
]
}
}
]
}
]
}The system automatically sets the
IP-typematchKeyandmatchType. You must not set them manually.Set
matchValueto0.0.0.0/0to match all client IP addresses.
Example 2: Header-based exact-match rate limiting
Limit requests to 100 tokens per minute if the x-user-level header is set to beta.
PUT /v1/http-apis/{httpApiId}
{
"deployConfigs": [
{
"gatewayId": "gw-xxxxxxxxxxxxx",
"policyConfigs": [
{
"type": "AiTokenRateLimit",
"enable": true,
"aiTokenRateLimitConfig": {
"rules": [
{
"limitType": "Header",
"matchKey": "x-user-level",
"matchType": "Exact",
"matchValue": "beta",
"limitMode": "TokenPerMinute",
"limitValue": 100
}
]
}
}
]
}
]
}Example 3: Model-based rate limiting
Apply different rate limiting thresholds for different models:
qwen-max: 500 tokens per minute.qwen-plus: 2000 tokens per minute.qwen-max: 10 requests per minute maximum.
PUT /v1/http-apis/{httpApiId}
{
"deployConfigs": [
{
"gatewayId": "gw-xxxxxxxxxxxxx",
"policyConfigs": [
{
"type": "AiTokenRateLimit",
"enable": true,
"aiTokenRateLimitConfig": {
"rules": [
{
"limitType": "Model",
"matchValue": "qwen-max",
"limitMode": "TokenPerMinute",
"limitValue": 500
},
{
"limitType": "Model",
"matchValue": "qwen-plus",
"limitMode": "TokenPerMinute",
"limitValue": 2000
},
{
"limitType": "Model",
"matchValue": "qwen-max",
"limitMode": "RequestPerMinute",
"limitValue": 10
}
]
}
}
]
}
]
}For the
Modeltype, the system automatically setsmatchKeytox-higress-llm-modelandmatchTypetoExact. You do not need to set these parameters manually.Set
matchValueto the name of the target model.
Example 4: Enable global rate limiting
Enable API-level global rate limiting: Allow up to 10,000 tokens per minute, 100 requests per minute, and 20 concurrent requests for the entire API.
PUT /v1/http-apis/{httpApiId}
{
"deployConfigs": [
{
"gatewayId": "gw-xxxxxxxxxxxxx",
"policyConfigs": [
{
"type": "AiTokenRateLimit",
"enable": true,
"aiTokenRateLimitConfig": {
"rules": [
{
"limitType": "Consumer",
"matchKey": "",
"matchType": "All",
"matchValue": "*",
"limitMode": "TokenPerMinute",
"limitValue": 1000
}
],
"enableGlobalRules": true,
"globalRules": [
{
"limitType": "Global",
"limitMode": "TokenPerMinute",
"limitValue": 10000
},
{
"limitType": "Global",
"limitMode": "RequestPerMinute",
"limitValue": 100
},
{
"limitType": "Global",
"limitMode": "ConcurrencyLimit",
"limitValue": 20
}
]
}
}
]
}
]
}The
limitTypefor global rules must be"Global". Do not set thematchKey,matchType, ormatchValueparameters because the system clears them automatically.Global rate limiting supports the Token, Request, and Concurrency modes.
Example 5: Configure external Redis
PUT /v1/http-apis/{httpApiId}
{
"deployConfigs": [
{
"gatewayId": "gw-xxxxxxxxxxxxx",
"gatewayType": "API",
"policyConfigs": [
{
"type": "AiTokenRateLimit",
"enable": true,
"aiTokenRateLimitConfig": {
"rules": [
{
"limitType": "Consumer",
"matchKey": "",
"matchType": "All",
"matchValue": "*",
"limitMode": "TokenPerMinute",
"limitValue": 1000
}
],
"redisConfig": {
"host": "r-bp1xxxxxxxxxxxxx.redis.rds.aliyuncs.com",
"port": 6379,
"username": "",
"password": "your-redis-password",
"databaseNumber": 0
}
}
}
]
}
]
}Example 6: Disable token rate limiting
Set enable to false to disable rate limiting. After rate limiting is disabled, the configured rules do not take effect, but their configurations are still stored.
PUT /v1/http-apis/{httpApiId}
{
"deployConfigs": [
{
"gatewayId": "gw-xxxxxxxxxxxxx",
"policyConfigs": [
{
"type": "AiTokenRateLimit",
"enable": false,
"aiTokenRateLimitConfig": {
"rules": [
{
"limitType": "Consumer",
"matchKey": "",
"matchType": "All",
"matchValue": "*",
"limitMode": "TokenPerMinute",
"limitValue": 1000
}
]
}
}
]
}
]
}Example 7: Update rate limiting rules
To update existing rate limiting rules, you must send the full policyConfigs configuration in your request. This example updates the consumer rate limiting threshold from 1000 to 2000.
PUT /v1/http-apis/{httpApiId}
{
"deployConfigs": [
{
"gatewayId": "gw-xxxxxxxxxxxxx",
"policyConfigs": [
{
"type": "AiTokenRateLimit",
"enable": true,
"aiTokenRateLimitConfig": {
"rules": [
{
"limitType": "Consumer",
"matchKey": "",
"matchType": "All",
"matchValue": "*",
"limitMode": "TokenPerMinute",
"limitValue": 2000
}
]
}
}
]
}
]
}Configuration update rules
When you update the policy configuration using UpdateHttpApi, the system processes deployConfigs and policyConfigs according to the following rules:
Replace policyConfigs entirely: If the
policyConfigsfield is not empty, the system replaces all policy configurations for the specifiedgatewayId. When you update rate limiting rules, you must include other policy configurations, such as AiFallback or AiStatistics, in your request. Otherwise, those configurations will be deleted.Update rate limiting rules entirely: The
aiTokenRateLimitConfig.rulesandaiTokenRateLimitConfig.globalRulesarrays are completely replaced with each update. You must send the complete list of rules in every request.
Before you update the rate limiting configuration, we recommend that you first call GetHttpApi to retrieve the current and complete deployConfigs and policyConfigs. Then, you can modify the rate limiting configurations and submit the update. This prevents you from accidentally overwriting other policy configurations.
Configuration validation rules
When you submit the configuration, the system validates aiTokenRateLimitConfig based on the following rules:
Validation item | Rule |
Number of rules | Configure at least one of |
Valid limitType and matchType combinations |
|
matchKey |
|
matchValue |
|
limitValue |
|
limitType in globalRules | Must be |
limitType in rules | Cannot be |
redisConfig |
|
FAQ
Q: Do I need to configure redisConfig for my AI Gateway instance?
A: No. AI Gateway instances use an internal key-value store by default. Do not configure redisConfig. You can configure redisConfig only to use an external Redis instance for storing rate limiting keys.
Q: What should I consider when configuring consumer-based rate limiting?
A: Before you configure rate limiting per consumer (by setting limitType to Consumer), you must first enable consumer authentication for the Model API. Otherwise, the rate limiting policy cannot identify consumers, and the rule will not take effect.
Q: How do multiple rules interact?
A: Multiple rules have an OR relationship, which means that rate limiting is triggered if any rule is hit. Rules with the same rate limiting dimension, that is, the same limitType and matchKey, are merged into a single rule group for execution.
Q: Can I use global rate limiting and standard rules together?
A: Yes, you can. Global rate limiting (globalRules) applies to the entire API without differentiating between keys, while standard rules (rules) apply to specific dimensions. You can use both types of rules simultaneously. Rate limiting is triggered if the conditions for either a global rule or a standard rule are met.
Q: How long does it take for updated rate limiting configurations to take effect?
A: After you update the configuration, the system pushes the new rules to the gateway data plane. They usually take effect within seconds.
Q: How accurate is rate limiting in a distributed architecture?
A: Because of the nature of distributed systems, rate limiting counters may experience minor drift. The actual number of allowed requests may vary slightly from the configured values, depending on the traffic volume, request rate, and backend latency.