This type of policy or plug-in dynamically throttles traffic based on token usage instead of the number of requests or the size of request bodies. This makes it particularly suitable for Large Language Model (LLM) services and high-concurrency scenarios. A throttling policy lets you configure throttling rules for consumers across multiple dimensions, such as identity, request header parameter, query parameter, and client IP address. In addition, it can perform real-time billing and throttling based on the total number of tokens consumed in a single API call. This token-based throttling mode can effectively prevent system overload and API abuse. It also ensures the stable operation of core services in complex scenarios by considering the resource consumption characteristics of LLM computing workloads.
Policy introduction
Prevents system overload: This policy can effectively limit high-frequency calls or malicious requests to prevent system breakdowns or performance degradation caused by overloads. This is achieved using flexible policy settings based on dimensions such as consumer, header, query parameter, cookie, or client IP address. When combined with caching policies, this policy can further improve system performance.
Allows dynamic throttling: You can throttle a consumer at various granularities, such as per second, per minute, per hour, and per day. You can also adjust throttling rules as needed to ensure that your system runs stably under high concurrency.
Supports multiple matching rules: Throttling policies support multiple matching rules to meet the needs of complex business scenarios with varying priorities.
Prevents attacks: By throttling specific consumers, headers, query parameters, or cookies, you can effectively limit the access of crawlers or automated tools to protect data security.
Scenarios
High-concurrency scenarios: In scenarios such as e-commerce promotions, you can throttle API callers based on their token usage per unit of time. This prevents malicious high-frequency calls and ensures service stability and fairness during the promotion.
AI service calls: You can throttle calls to LLM APIs to prevent service quality degradation or system breakdowns caused by traffic bursts.
Multi-tenant systems: In an open platform or multi-tenant architecture, you can assign different throttling quotas to different tenants to ensure fairness and resource isolation.
Defense against attacks: You can establish throttling mechanisms to defend against crawler attacks, DDoS attacks, and API abuse.
Procedure
Go to the AI Gateway console. In the top menu bar, select the region of the target instance. On the Instance page, click the target Instance ID.
In the navigation pane on the left, click LLM API. Then, click the name of the API to go to the API details page.
Click Policies and Plug-ins, enable the Throttling switch, and configure the relevant parameters.
Configuration item
Description
Throttling
Turn the throttling switch on or off. By default, this switch is turned off.
Throttling Policy
A throttling policy includes five types of Judgment Conditions:
By Request Header: For example, throttle requests with the
betaidentifier in the header to 100 tokens per minute.By Request Query Parameter: For example, throttle requests with the
user_id=1query parameter to 100 tokens per minute.By Request Cookie: For example, throttle requests with the specified identifier in the cookie to 100 tokens per minute.
By Consumer: For example, throttle all consumers to 1,000 tokens per minute.
ImportantTo configure throttling by consumer, you must enable consumer authentication.
By Client IP Address: For example, throttle each client IP address to 100 tokens per minute.
Each judgment condition supports four Throttling Rules: Exact Match, Prefix Match, Regex Match, and Any Match. The priority is: Exact match > Prefix match > Regex match > Any match.
NoteIf you configure multiple rules, a request is blocked if it hits any of them.
The Throttling Range can be Every Second, Every Minute, Every Hour, or Every Day.
NoteThrottling is performed based on the number of inbound or outbound tokens by LLM.
Confirm the configuration and click Save.