Configure ModelAPI rate limiting - API Gateway - Alibaba Cloud Documentation Center

You can call the UpdateHttpApi API to configure or update the aiTokenRateLimitConfig (AI Token rate limiting) parameter in the deployConfigs of your Model API. This enables multi-dimensional rate limiting based on token consumption.

Scenarios

AI Token rate limiting uses token consumption as its core metric. Unlike traditional request-count-based rate limiting, it precisely reflects the actual compute resource usage by Large Language Models (LLMs). You can set rate limiting rules based on the consumer, request header, query parameter, cookie, client IP address, or model name. This feature also supports global (API-level) rate limiting.

High-concurrency scenarios: During e-commerce sales promotions, you can limit the total tokens consumed per user per time window to prevent malicious high-frequency calls.
AI service calls: You can apply rate limiting to LLM API calls to avoid service degradation caused by traffic spikes.
Multi-tenant systems: You can assign independent rate limiting quotas to different tenants to ensure fairness and resource isolation.
Fine-grained model-level control: You can set different rate limiting thresholds for different models to protect high-cost model resources.

API information

HTTP method: PUT
Action: UpdateHttpApi

Request parameter structure

The AI Token rate limiting configuration is located in the UpdateHttpApiRequest object, under deployConfigs[].policyConfigs[]. You can pass this configuration using the PolicyConfig structure.

PolicyConfig structure

Parameter	Type	Required	Description
type	String	Yes	Policy type. Set this to `"AiTokenRateLimit"`.
enable	Boolean	Yes	Whether to enable rate limiting. Set to `true` to enable or `false` to disable.
aiTokenRateLimitConfig	Object	Yes (if enable is true)	Details of the AI Token rate limiting configuration.

AiTokenRateLimitConfig structure

Parameter	Type	Required	Description
rules	Array<AiTokenRateLimitRule>	Conditionally required	List of standard rate limiting rules (including dimension-based and model-based rules). Configure at least one of `rules` or `globalRules`.
enableGlobalRules	Boolean	No	Whether to enable global (API-level) rate limiting rules. Default is `false`.
globalRules	Array<AiTokenRateLimitRule>	Conditionally required	List of global rate limiting rules. These take effect only when `enableGlobalRules` is `true`. The `limitType` of each rule must be `"Global"`.
redisConfig	Object	No	Redis configuration. AI Gateway instances use an internal key-value store by default. Do not configure `redisConfig`. Only configure `redisConfig` to use an external Redis instance to store rate limiting keys.

AiTokenRateLimitRule structure

Parameter	Type	Required	Description
limitType	String	Yes	Rate limiting dimension type.
matchKey	String	Required condition	Matching key name. For `Consumer`, this field may be empty. For `IP`, the system sets this to `from-remote-addr`. For `Model`, the system sets this to `x-higress-llm-model`. For `Global`, the system clears this field.
matchType	String	Conditionally required	Matching pattern. For `IP`, the system sets this to `IP`. For `Model`, the system sets this to `Exact`. For `Global`, the system clears this field.
matchValue	String	You must specify a condition.	Matching value. In `All` mode, the system sets this to `*`. For `IP`, provide a valid IP address or CIDR block. For `Model`, provide the model name. For `Global`, the system clears this field.
limitMode	String	Yes	Rate limiting mode.
limitValue	Integer	Yes	Rate limiting threshold. Must be greater than 0.

limitType enumeration values

Value	Description
Header	Rate limit by request header.
Parameter	Rate limit by query parameter.
Consumer	Rate limit by consumer (requires consumer authentication to be enabled first).
Cookie	Rate limit by cookie.
IP	Rate limit by client IP address.
Model	Rate limit by model name (set separate quotas for specific models).
Request	Rate limit by request count (for requests with a specific key).
Concurrency	Rate limit by concurrency (for concurrent requests with a specific key).
Global	Global rate limiting (API-level, no key distinction. Used only in `globalRules`).

matchType enumeration values

Value	Description	Applicable limitType
Exact	Exact match.	Header / Parameter / Consumer / Cookie / Request / Concurrency
Prefix	Prefix match.	Header / Parameter / Consumer / Cookie / Request / Concurrency
Regex	Regular expression matching.	Header / Parameter / Consumer / Cookie / Request / Concurrency
All	Match any value (`*` is set automatically).	Header / Parameter / Consumer / Cookie / Request / Concurrency
IP	IP matching (set automatically. Do not specify manually).	IP

Note

Matching priority: Exact > Prefix > Regex > All. Rate limiting is triggered when any rule matches.

limitMode enumeration values

Value	Description	Applicable limitType
TokenPerSecond	Token-based rate limiting per second.	All types
TokenPerMinute	Token-based rate limiting per minute.	All types
TokenPerHour	Token-based rate limiting per hour.	All types
TokenPerDay	Token-based rate limiting per day.	All types
RequestPerSecond	Request-based rate limiting per second.	Model / Global / Request
RequestPerMinute	Request-based rate limiting per minute.	Model / Global / Request
RequestPerHour	Request-based rate limiting per hour.	Model / Global / Request
RequestPerDay	Request-based rate limiting per day.	Model / Global / Request
ConcurrencyLimit	Concurrency-based rate limiting.	Model / Global / Concurrency

redisConfig structure

Parameter	Type	Required	Description
host	String	Yes	Redis endpoint.
port	Integer	Yes	Redis service port.
username	String	No	Redis username.
password	String	No	Redis password.
timeout	Integer	No	Connection timeout.
databaseNumber	Integer	No	Redis database number.

Important

AI Gateway instances use an internal key-value store by default. You do not need to configure redisConfig.
You only need to configure redisConfig to use an external Redis instance to store rate limiting keys.

Configuration examples

Example 1: Rate limiting by consumer and IP address

Set consumer-based rate limiting for your Model API: Allow up to 1000 tokens per minute for any consumer.
Set IP-based rate limiting for your Model API: Allow up to 500 tokens per minute per client IP address.

PUT /v1/http-apis/{httpApiId}

{
  "deployConfigs": [
    {
      "gatewayId": "gw-xxxxxxxxxxxxx",
      "policyConfigs": [
        {
          "type": "AiTokenRateLimit",
          "enable": true,
          "aiTokenRateLimitConfig": {
            "rules": [
              {
                "limitType": "Consumer",
                "matchKey": "",
                "matchType": "All",
                "matchValue": "*",
                "limitMode": "TokenPerMinute",
                "limitValue": 1000
              },
              {
                "limitType": "IP",
                "matchValue": "0.0.0.0/0",
                "limitMode": "TokenPerMinute",
                "limitValue": 500
              }
            ]
          }
        }
      ]
    }
  ]
}

Note

The system automatically sets the IP-type matchKey and matchType. You must not set them manually.
Set matchValue to 0.0.0.0/0 to match all client IP addresses.

Example 2: Header-based exact-match rate limiting

Limit requests to 100 tokens per minute if the x-user-level header is set to beta.

PUT /v1/http-apis/{httpApiId}

{
  "deployConfigs": [
    {
      "gatewayId": "gw-xxxxxxxxxxxxx",
      "policyConfigs": [
        {
          "type": "AiTokenRateLimit",
          "enable": true,
          "aiTokenRateLimitConfig": {
            "rules": [
              {
                "limitType": "Header",
                "matchKey": "x-user-level",
                "matchType": "Exact",
                "matchValue": "beta",
                "limitMode": "TokenPerMinute",
                "limitValue": 100
              }
            ]
          }
        }
      ]
    }
  ]
}

Example 3: Model-based rate limiting

Apply different rate limiting thresholds for different models:

qwen-max: 500 tokens per minute.
qwen-plus: 2000 tokens per minute.
qwen-max: 10 requests per minute maximum.

PUT /v1/http-apis/{httpApiId}

{
  "deployConfigs": [
    {
      "gatewayId": "gw-xxxxxxxxxxxxx",
      "policyConfigs": [
        {
          "type": "AiTokenRateLimit",
          "enable": true,
          "aiTokenRateLimitConfig": {
            "rules": [
              {
                "limitType": "Model",
                "matchValue": "qwen-max",
                "limitMode": "TokenPerMinute",
                "limitValue": 500
              },
              {
                "limitType": "Model",
                "matchValue": "qwen-plus",
                "limitMode": "TokenPerMinute",
                "limitValue": 2000
              },
              {
                "limitType": "Model",
                "matchValue": "qwen-max",
                "limitMode": "RequestPerMinute",
                "limitValue": 10
              }
            ]
          }
        }
      ]
    }
  ]
}

Note

For the Model type, the system automatically sets matchKey to x-higress-llm-model and matchType to Exact. You do not need to set these parameters manually.
Set matchValue to the name of the target model.

Example 4: Enable global rate limiting

Enable API-level global rate limiting: Allow up to 10,000 tokens per minute, 100 requests per minute, and 20 concurrent requests for the entire API.

PUT /v1/http-apis/{httpApiId}

{
  "deployConfigs": [
    {
      "gatewayId": "gw-xxxxxxxxxxxxx",
      "policyConfigs": [
        {
          "type": "AiTokenRateLimit",
          "enable": true,
          "aiTokenRateLimitConfig": {
            "rules": [
              {
                "limitType": "Consumer",
                "matchKey": "",
                "matchType": "All",
                "matchValue": "*",
                "limitMode": "TokenPerMinute",
                "limitValue": 1000
              }
            ],
            "enableGlobalRules": true,
            "globalRules": [
              {
                "limitType": "Global",
                "limitMode": "TokenPerMinute",
                "limitValue": 10000
              },
              {
                "limitType": "Global",
                "limitMode": "RequestPerMinute",
                "limitValue": 100
              },
              {
                "limitType": "Global",
                "limitMode": "ConcurrencyLimit",
                "limitValue": 20
              }
            ]
          }
        }
      ]
    }
  ]
}

Note

The limitType for global rules must be "Global". Do not set the matchKey, matchType, or matchValue parameters because the system clears them automatically.
Global rate limiting supports the Token, Request, and Concurrency modes.

Example 5: Configure external Redis

PUT /v1/http-apis/{httpApiId}

{
  "deployConfigs": [
    {
      "gatewayId": "gw-xxxxxxxxxxxxx",
      "gatewayType": "API",
      "policyConfigs": [
        {
          "type": "AiTokenRateLimit",
          "enable": true,
          "aiTokenRateLimitConfig": {
            "rules": [
              {
                "limitType": "Consumer",
                "matchKey": "",
                "matchType": "All",
                "matchValue": "*",
                "limitMode": "TokenPerMinute",
                "limitValue": 1000
              }
            ],
            "redisConfig": {
              "host": "r-bp1xxxxxxxxxxxxx.redis.rds.aliyuncs.com",
              "port": 6379,
              "username": "",
              "password": "your-redis-password",
              "databaseNumber": 0
            }
          }
        }
      ]
    }
  ]
}

Example 6: Disable token rate limiting

Set enable to false to disable rate limiting. After rate limiting is disabled, the configured rules do not take effect, but their configurations are still stored.

PUT /v1/http-apis/{httpApiId}

{
  "deployConfigs": [
    {
      "gatewayId": "gw-xxxxxxxxxxxxx",
      "policyConfigs": [
        {
          "type": "AiTokenRateLimit",
          "enable": false,
          "aiTokenRateLimitConfig": {
            "rules": [
              {
                "limitType": "Consumer",
                "matchKey": "",
                "matchType": "All",
                "matchValue": "*",
                "limitMode": "TokenPerMinute",
                "limitValue": 1000
              }
            ]
          }
        }
      ]
    }
  ]
}

Example 7: Update rate limiting rules

To update existing rate limiting rules, you must send the full policyConfigs configuration in your request. This example updates the consumer rate limiting threshold from 1000 to 2000.

PUT /v1/http-apis/{httpApiId}

{
  "deployConfigs": [
    {
      "gatewayId": "gw-xxxxxxxxxxxxx",
      "policyConfigs": [
        {
          "type": "AiTokenRateLimit",
          "enable": true,
          "aiTokenRateLimitConfig": {
            "rules": [
              {
                "limitType": "Consumer",
                "matchKey": "",
                "matchType": "All",
                "matchValue": "*",
                "limitMode": "TokenPerMinute",
                "limitValue": 2000
              }
            ]
          }
        }
      ]
    }
  ]
}

Configuration update rules

When you update the policy configuration using UpdateHttpApi, the system processes deployConfigs and policyConfigs according to the following rules:

Replace policyConfigs entirely: If the policyConfigs field is not empty, the system replaces all policy configurations for the specified gatewayId. When you update rate limiting rules, you must include other policy configurations, such as AiFallback or AiStatistics, in your request. Otherwise, those configurations will be deleted.
Update rate limiting rules entirely: The aiTokenRateLimitConfig.rules and aiTokenRateLimitConfig.globalRules arrays are completely replaced with each update. You must send the complete list of rules in every request.

Important

Before you update the rate limiting configuration, we recommend that you first call GetHttpApi to retrieve the current and complete deployConfigs and policyConfigs. Then, you can modify the rate limiting configurations and submit the update. This prevents you from accidentally overwriting other policy configurations.

Configuration validation rules

When you submit the configuration, the system validates aiTokenRateLimitConfig based on the following rules:

Validation item	Rule
Number of rules	Configure at least one of `rules` or `globalRules`.
Valid limitType and matchType combinations	Header / Parameter / Consumer / Cookie / Request / Concurrency support Exact / Prefix / Regex / All. IP supports only IP mode. Model is set to Exact automatically. Global does not require matchType.
matchKey	Required for Header / Parameter / Cookie / Request / Concurrency. Optional for Consumer. Set automatically for IP / Model / Global.
matchValue	Required for Exact / Prefix / Regex modes. Set to `*` automatically in All mode. For IP, provide a valid IP address or CIDR block.
limitValue	Must be greater than 0. For Model type, `limitValue` must not exceed the maximum Int32 value.
limitType in globalRules	Must be `"Global"`.
limitType in rules	Cannot be `"Global"` (Global is allowed only in globalRules).
redisConfig	AI Gateway instances use an internal key-value store by default. Do not configure `redisConfig`. Only configure `redisConfig` to use an external Redis instance to store rate limiting keys.

FAQ

Q: Do I need to configure redisConfig for my AI Gateway instance?

A: No. AI Gateway instances use an internal key-value store by default. Do not configure redisConfig. You can configure redisConfig only to use an external Redis instance for storing rate limiting keys.

Q: What should I consider when configuring consumer-based rate limiting?

A: Before you configure rate limiting per consumer (by setting limitType to Consumer), you must first enable consumer authentication for the Model API. Otherwise, the rate limiting policy cannot identify consumers, and the rule will not take effect.

Q: How do multiple rules interact?

A: Multiple rules have an OR relationship, which means that rate limiting is triggered if any rule is hit. Rules with the same rate limiting dimension, that is, the same limitType and matchKey, are merged into a single rule group for execution.

Q: Can I use global rate limiting and standard rules together?

A: Yes, you can. Global rate limiting (globalRules) applies to the entire API without differentiating between keys, while standard rules (rules) apply to specific dimensions. You can use both types of rules simultaneously. Rate limiting is triggered if the conditions for either a global rule or a standard rule are met.

Q: How long does it take for updated rate limiting configurations to take effect?

A: After you update the configuration, the system pushes the new rules to the gateway data plane. They usually take effect within seconds.

Q: How accurate is rate limiting in a distributed architecture?

A: Because of the nature of distributed systems, rate limiting counters may experience minor drift. The actual number of allowed requests may vary slightly from the configured values, depending on the traffic volume, request rate, and backend latency.