All Products
Search
Document Center

API Gateway:Configure ModelAPI rate limiting

Last Updated:Feb 12, 2026

You can call the UpdateHttpApi API to configure or update the aiTokenRateLimitConfig (AI Token rate limiting) parameter in the deployConfigs of your Model API. This enables multi-dimensional rate limiting based on token consumption.

Scenarios

AI Token rate limiting uses token consumption as its core metric. Unlike traditional request-count-based rate limiting, it precisely reflects the actual compute resource usage by Large Language Models (LLMs). You can set rate limiting rules based on the consumer, request header, query parameter, cookie, client IP address, or model name. This feature also supports global (API-level) rate limiting.

  • High-concurrency scenarios: During e-commerce sales promotions, you can limit the total tokens consumed per user per time window to prevent malicious high-frequency calls.

  • AI service calls: You can apply rate limiting to LLM API calls to avoid service degradation caused by traffic spikes.

  • Multi-tenant systems: You can assign independent rate limiting quotas to different tenants to ensure fairness and resource isolation.

  • Fine-grained model-level control: You can set different rate limiting thresholds for different models to protect high-cost model resources.

API information

  • HTTP method: PUT

  • Action: UpdateHttpApi

Request parameter structure

The AI Token rate limiting configuration is located in the UpdateHttpApiRequest object, under deployConfigs[].policyConfigs[]. You can pass this configuration using the PolicyConfig structure.

PolicyConfig structure

Parameter

Type

Required

Description

type

String

Yes

Policy type. Set this to "AiTokenRateLimit".

enable

Boolean

Yes

Whether to enable rate limiting. Set to true to enable or false to disable.

aiTokenRateLimitConfig

Object

Yes (if enable is true)

Details of the AI Token rate limiting configuration.

AiTokenRateLimitConfig structure

Parameter

Type

Required

Description

rules

Array<AiTokenRateLimitRule>

Conditionally required

List of standard rate limiting rules (including dimension-based and model-based rules). Configure at least one of rules or globalRules.

enableGlobalRules

Boolean

No

Whether to enable global (API-level) rate limiting rules. Default is false.

globalRules

Array<AiTokenRateLimitRule>

Conditionally required

List of global rate limiting rules. These take effect only when enableGlobalRules is true. The limitType of each rule must be "Global".

redisConfig

Object

No

Redis configuration.

  • AI Gateway instances use an internal key-value store by default. Do not configure redisConfig.

  • Only configure redisConfig to use an external Redis instance to store rate limiting keys.

AiTokenRateLimitRule structure

Parameter

Type

Required

Description

limitType

String

Yes

Rate limiting dimension type.

matchKey

String

Required condition

Matching key name.

  • For Consumer, this field may be empty.

  • For IP, the system sets this to from-remote-addr.

  • For Model, the system sets this to x-higress-llm-model.

  • For Global, the system clears this field.

matchType

String

Conditionally required

Matching pattern.

  • For IP, the system sets this to IP.

  • For Model, the system sets this to Exact.

  • For Global, the system clears this field.

matchValue

String

You must specify a condition.

Matching value.

  • In All mode, the system sets this to *.

  • For IP, provide a valid IP address or CIDR block.

  • For Model, provide the model name.

  • For Global, the system clears this field.

limitMode

String

Yes

Rate limiting mode.

limitValue

Integer

Yes

Rate limiting threshold. Must be greater than 0.

limitType enumeration values

Value

Description

Header

Rate limit by request header.

Parameter

Rate limit by query parameter.

Consumer

Rate limit by consumer (requires consumer authentication to be enabled first).

Cookie

Rate limit by cookie.

IP

Rate limit by client IP address.

Model

Rate limit by model name (set separate quotas for specific models).

Request

Rate limit by request count (for requests with a specific key).

Concurrency

Rate limit by concurrency (for concurrent requests with a specific key).

Global

Global rate limiting (API-level, no key distinction. Used only in globalRules).

matchType enumeration values

Value

Description

Applicable limitType

Exact

Exact match.

Header / Parameter / Consumer / Cookie / Request / Concurrency

Prefix

Prefix match.

Header / Parameter / Consumer / Cookie / Request / Concurrency

Regex

Regular expression matching.

Header / Parameter / Consumer / Cookie / Request / Concurrency

All

Match any value (* is set automatically).

Header / Parameter / Consumer / Cookie / Request / Concurrency

IP

IP matching (set automatically. Do not specify manually).

IP

Note

Matching priority: Exact > Prefix > Regex > All. Rate limiting is triggered when any rule matches.

limitMode enumeration values

Value

Description

Applicable limitType

TokenPerSecond

Token-based rate limiting per second.

All types

TokenPerMinute

Token-based rate limiting per minute.

All types

TokenPerHour

Token-based rate limiting per hour.

All types

TokenPerDay

Token-based rate limiting per day.

All types

RequestPerSecond

Request-based rate limiting per second.

Model / Global / Request

RequestPerMinute

Request-based rate limiting per minute.

Model / Global / Request

RequestPerHour

Request-based rate limiting per hour.

Model / Global / Request

RequestPerDay

Request-based rate limiting per day.

Model / Global / Request

ConcurrencyLimit

Concurrency-based rate limiting.

Model / Global / Concurrency

redisConfig structure

Parameter

Type

Required

Description

host

String

Yes

Redis endpoint.

port

Integer

Yes

Redis service port.

username

String

No

Redis username.

password

String

No

Redis password.

timeout

Integer

No

Connection timeout.

databaseNumber

Integer

No

Redis database number.

Important
  • AI Gateway instances use an internal key-value store by default. You do not need to configure redisConfig.

  • You only need to configure redisConfig to use an external Redis instance to store rate limiting keys.

Configuration examples

Example 1: Rate limiting by consumer and IP address

  • Set consumer-based rate limiting for your Model API: Allow up to 1000 tokens per minute for any consumer.

  • Set IP-based rate limiting for your Model API: Allow up to 500 tokens per minute per client IP address.

PUT /v1/http-apis/{httpApiId}

{
  "deployConfigs": [
    {
      "gatewayId": "gw-xxxxxxxxxxxxx",
      "policyConfigs": [
        {
          "type": "AiTokenRateLimit",
          "enable": true,
          "aiTokenRateLimitConfig": {
            "rules": [
              {
                "limitType": "Consumer",
                "matchKey": "",
                "matchType": "All",
                "matchValue": "*",
                "limitMode": "TokenPerMinute",
                "limitValue": 1000
              },
              {
                "limitType": "IP",
                "matchValue": "0.0.0.0/0",
                "limitMode": "TokenPerMinute",
                "limitValue": 500
              }
            ]
          }
        }
      ]
    }
  ]
}
Note
  • The system automatically sets the IP-type matchKey and matchType. You must not set them manually.

  • Set matchValue to 0.0.0.0/0 to match all client IP addresses.

Example 2: Header-based exact-match rate limiting

Limit requests to 100 tokens per minute if the x-user-level header is set to beta.

PUT /v1/http-apis/{httpApiId}

{
  "deployConfigs": [
    {
      "gatewayId": "gw-xxxxxxxxxxxxx",
      "policyConfigs": [
        {
          "type": "AiTokenRateLimit",
          "enable": true,
          "aiTokenRateLimitConfig": {
            "rules": [
              {
                "limitType": "Header",
                "matchKey": "x-user-level",
                "matchType": "Exact",
                "matchValue": "beta",
                "limitMode": "TokenPerMinute",
                "limitValue": 100
              }
            ]
          }
        }
      ]
    }
  ]
}

Example 3: Model-based rate limiting

Apply different rate limiting thresholds for different models:

  • qwen-max: 500 tokens per minute.

  • qwen-plus: 2000 tokens per minute.

  • qwen-max: 10 requests per minute maximum.

PUT /v1/http-apis/{httpApiId}

{
  "deployConfigs": [
    {
      "gatewayId": "gw-xxxxxxxxxxxxx",
      "policyConfigs": [
        {
          "type": "AiTokenRateLimit",
          "enable": true,
          "aiTokenRateLimitConfig": {
            "rules": [
              {
                "limitType": "Model",
                "matchValue": "qwen-max",
                "limitMode": "TokenPerMinute",
                "limitValue": 500
              },
              {
                "limitType": "Model",
                "matchValue": "qwen-plus",
                "limitMode": "TokenPerMinute",
                "limitValue": 2000
              },
              {
                "limitType": "Model",
                "matchValue": "qwen-max",
                "limitMode": "RequestPerMinute",
                "limitValue": 10
              }
            ]
          }
        }
      ]
    }
  ]
}
Note
  • For the Model type, the system automatically sets matchKey to x-higress-llm-model and matchType to Exact. You do not need to set these parameters manually.

  • Set matchValue to the name of the target model.

Example 4: Enable global rate limiting

Enable API-level global rate limiting: Allow up to 10,000 tokens per minute, 100 requests per minute, and 20 concurrent requests for the entire API.

PUT /v1/http-apis/{httpApiId}

{
  "deployConfigs": [
    {
      "gatewayId": "gw-xxxxxxxxxxxxx",
      "policyConfigs": [
        {
          "type": "AiTokenRateLimit",
          "enable": true,
          "aiTokenRateLimitConfig": {
            "rules": [
              {
                "limitType": "Consumer",
                "matchKey": "",
                "matchType": "All",
                "matchValue": "*",
                "limitMode": "TokenPerMinute",
                "limitValue": 1000
              }
            ],
            "enableGlobalRules": true,
            "globalRules": [
              {
                "limitType": "Global",
                "limitMode": "TokenPerMinute",
                "limitValue": 10000
              },
              {
                "limitType": "Global",
                "limitMode": "RequestPerMinute",
                "limitValue": 100
              },
              {
                "limitType": "Global",
                "limitMode": "ConcurrencyLimit",
                "limitValue": 20
              }
            ]
          }
        }
      ]
    }
  ]
}
Note
  • The limitType for global rules must be "Global". Do not set the matchKey, matchType, or matchValue parameters because the system clears them automatically.

  • Global rate limiting supports the Token, Request, and Concurrency modes.

Example 5: Configure external Redis

PUT /v1/http-apis/{httpApiId}

{
  "deployConfigs": [
    {
      "gatewayId": "gw-xxxxxxxxxxxxx",
      "gatewayType": "API",
      "policyConfigs": [
        {
          "type": "AiTokenRateLimit",
          "enable": true,
          "aiTokenRateLimitConfig": {
            "rules": [
              {
                "limitType": "Consumer",
                "matchKey": "",
                "matchType": "All",
                "matchValue": "*",
                "limitMode": "TokenPerMinute",
                "limitValue": 1000
              }
            ],
            "redisConfig": {
              "host": "r-bp1xxxxxxxxxxxxx.redis.rds.aliyuncs.com",
              "port": 6379,
              "username": "",
              "password": "your-redis-password",
              "databaseNumber": 0
            }
          }
        }
      ]
    }
  ]
}

Example 6: Disable token rate limiting

Set enable to false to disable rate limiting. After rate limiting is disabled, the configured rules do not take effect, but their configurations are still stored.

PUT /v1/http-apis/{httpApiId}

{
  "deployConfigs": [
    {
      "gatewayId": "gw-xxxxxxxxxxxxx",
      "policyConfigs": [
        {
          "type": "AiTokenRateLimit",
          "enable": false,
          "aiTokenRateLimitConfig": {
            "rules": [
              {
                "limitType": "Consumer",
                "matchKey": "",
                "matchType": "All",
                "matchValue": "*",
                "limitMode": "TokenPerMinute",
                "limitValue": 1000
              }
            ]
          }
        }
      ]
    }
  ]
}

Example 7: Update rate limiting rules

To update existing rate limiting rules, you must send the full policyConfigs configuration in your request. This example updates the consumer rate limiting threshold from 1000 to 2000.

PUT /v1/http-apis/{httpApiId}

{
  "deployConfigs": [
    {
      "gatewayId": "gw-xxxxxxxxxxxxx",
      "policyConfigs": [
        {
          "type": "AiTokenRateLimit",
          "enable": true,
          "aiTokenRateLimitConfig": {
            "rules": [
              {
                "limitType": "Consumer",
                "matchKey": "",
                "matchType": "All",
                "matchValue": "*",
                "limitMode": "TokenPerMinute",
                "limitValue": 2000
              }
            ]
          }
        }
      ]
    }
  ]
}

Configuration update rules

When you update the policy configuration using UpdateHttpApi, the system processes deployConfigs and policyConfigs according to the following rules:

  1. Replace policyConfigs entirely: If the policyConfigs field is not empty, the system replaces all policy configurations for the specified gatewayId. When you update rate limiting rules, you must include other policy configurations, such as AiFallback or AiStatistics, in your request. Otherwise, those configurations will be deleted.

  2. Update rate limiting rules entirely: The aiTokenRateLimitConfig.rules and aiTokenRateLimitConfig.globalRules arrays are completely replaced with each update. You must send the complete list of rules in every request.

Important

Before you update the rate limiting configuration, we recommend that you first call GetHttpApi to retrieve the current and complete deployConfigs and policyConfigs. Then, you can modify the rate limiting configurations and submit the update. This prevents you from accidentally overwriting other policy configurations.

Configuration validation rules

When you submit the configuration, the system validates aiTokenRateLimitConfig based on the following rules:

Validation item

Rule

Number of rules

Configure at least one of rules or globalRules.

Valid limitType and matchType combinations

  • Header / Parameter / Consumer / Cookie / Request / Concurrency support Exact / Prefix / Regex / All.

  • IP supports only IP mode.

  • Model is set to Exact automatically.

  • Global does not require matchType.

matchKey

  • Required for Header / Parameter / Cookie / Request / Concurrency.

  • Optional for Consumer.

  • Set automatically for IP / Model / Global.

matchValue

  • Required for Exact / Prefix / Regex modes.

  • Set to * automatically in All mode.

  • For IP, provide a valid IP address or CIDR block.

limitValue

  • Must be greater than 0.

  • For Model type, limitValue must not exceed the maximum Int32 value.

limitType in globalRules

Must be "Global".

limitType in rules

Cannot be "Global" (Global is allowed only in globalRules).

redisConfig

  • AI Gateway instances use an internal key-value store by default. Do not configure redisConfig.

  • Only configure redisConfig to use an external Redis instance to store rate limiting keys.

FAQ

Q: Do I need to configure redisConfig for my AI Gateway instance?

A: No. AI Gateway instances use an internal key-value store by default. Do not configure redisConfig. You can configure redisConfig only to use an external Redis instance for storing rate limiting keys.

Q: What should I consider when configuring consumer-based rate limiting?

A: Before you configure rate limiting per consumer (by setting limitType to Consumer), you must first enable consumer authentication for the Model API. Otherwise, the rate limiting policy cannot identify consumers, and the rule will not take effect.

Q: How do multiple rules interact?

A: Multiple rules have an OR relationship, which means that rate limiting is triggered if any rule is hit. Rules with the same rate limiting dimension, that is, the same limitType and matchKey, are merged into a single rule group for execution.

Q: Can I use global rate limiting and standard rules together?

A: Yes, you can. Global rate limiting (globalRules) applies to the entire API without differentiating between keys, while standard rules (rules) apply to specific dimensions. You can use both types of rules simultaneously. Rate limiting is triggered if the conditions for either a global rule or a standard rule are met.

Q: How long does it take for updated rate limiting configurations to take effect?

A: After you update the configuration, the system pushes the new rules to the gateway data plane. They usually take effect within seconds.

Q: How accurate is rate limiting in a distributed architecture?

A: Because of the nature of distributed systems, rate limiting counters may experience minor drift. The actual number of allowed requests may vary slightly from the configured values, depending on the traffic volume, request rate, and backend latency.