A policy or plug-in of this type dynamically throttles traffic based on token usage instead of request numbers or body sizes. This enables it to be particularly suitable for Large Language Model (LLM) services and high-concurrency scenarios. A throttling policy allows you to configure throttling rules for consumers in multiple dimensions, such as identity, request header parameter, query parameter, and client IP address. In addition, it can bill and throttle in real time based on the total number of tokens consumed in a single API call. This token consumption-centered throttling mode can effectively prevent system overload and interface abuse and ensure the stable running of core services in complex scenarios based on the resource consumption characteristics of LLM computing workloads.
Benefits
Prevents system overload: This policy can effectively limit high-frequency invocation or malicious requests and thus prevent system breakdown or performance deterioration caused by overloads based on flexible policy settings, such as by consumer, header, query parameter, cookie, or client IP address. In combination with caching policies, this policy can further improve system performance.
Allows dynamic throttling: You can throttle a consumer in many granularities, such as per second, per minute, per hour, and per day. You can also flexibly adjust throttling rules based on your business requirements to ensure that your system run stably with high concurrency.
Supports multiple matching rules: Throttling policies support multiple matching rules to meet the needs of complex business scenarios that require high priorities.
Prevents attacks: By throttling specific consumers, headers, query parameters, or cookies, you can effectively limit the access of crawlers or automated tools to protect data security.
Scenarios
High-concurrency scenarios: API callers can be subject to throttling based on their token usage in a unit of time in scenarios such as e-commerce promotions. This prevents malicious high-frequency calls and ensures service stability and promotion fairness.
AI service calls: Calls to LLM APIs can be throttled to preempt service quality degradation or system breakdown due to traffic bursts.
Multi-tenant systems: Different throttling quotas can be assigned to different tenants in an open platform or multi-tenant architecture to ensure fairness and resource isolation.
Defense against attacks: Throttling mechanisms can be established against crawler attacks, DDoS attacks, and API abuse.
Prerequisites
An AI API is created. For more information, see Manage AI APIs.
Procedure
You must add the CIDR block of the virtual private cloud (VPC) where your gateway instance resides to the whitelist in the Tair (Redis OSS-compatible) console.
Log on to the Cloud-native API Gateway console.
In the left-side navigation pane, click API. In the top navigation bar, select a region.
Click the AI API tab. In the API list, click the API that you want to manage.
On the Policies and Plug-ins tab, turn on Current limiting, configure the parameters, and then click Save.
NoteTair is used to store the token usage and time information of a request. This allows Cloud-native API Gateway to calculate the total usage within a time range to determine whether to trigger throttling.
Parameter
Description
Current limiting
The throttling switch. By default, this switch is turned off.
Redis service URL
The Tair service URL.
Port
The Tair service port.
Access Method
The method in which Tair is accessed. Valid values:
Account + password
Password-only
Password-free
Database Account
The account that is used to log on to the destination database.
Database Password
The password of the database account.
Database No.
The number of the specified database.
Throttling Policy
A throttling policy provides the following conditions:
By request header: For example, throttle requests with the
beta
identifier in the header to 100 tokens per minute.By request query: For example, throttle requests with the
user_id=1
query parameter to 100 tokens per minute.By request cookie: For example, throttle requests with the specified identifier in the cookie to 100 tokens per minute.
By Consumer: For example, throttle all consumers to 1,000 tokens per minute.
ImportantTo configure throttling by consumer, you must enable Consumer authentication.
By client IP address: For example, throttle each client IP address to 100 tokens per minute.
All condition types support four throttling rules: Exact match, Prefix Match, Regex Match, and Random match in the following order of priority: exact match > prefix match > regex match > random match.
NoteIf multiple rules are configured, a request is intercepted when any of the rules is matched.
The throttling range can be Every second, Every minute, Every hour, or Every day.
NoteThrottling is performed based on the number of inbound or outbound tokens by LLM.