Throttling - API Gateway - Alibaba Cloud Documentation Center

This type of policy or plug-in dynamically throttles traffic based on token usage instead of the number of requests or the size of request bodies. This makes it particularly suitable for Large Language Model (LLM) services and high-concurrency scenarios. A throttling policy lets you configure throttling rules for consumers across multiple dimensions, such as identity, request header parameter, query parameter, and client IP address. In addition, it can perform real-time billing and throttling based on the total number of tokens consumed in a single API call. This token-based throttling mode can effectively prevent system overload and API abuse. It also ensures the stable operation of core services in complex scenarios by considering the resource consumption characteristics of LLM computing workloads.

Policy introduction

Prevents system overload: This policy can effectively limit high-frequency calls or malicious requests to prevent system breakdowns or performance degradation caused by overloads. This is achieved using flexible policy settings based on dimensions such as consumer, header, query parameter, cookie, or client IP address. When combined with caching policies, this policy can further improve system performance.
Allows dynamic throttling: You can throttle a consumer at various granularities, such as per second, per minute, per hour, and per day. You can also adjust throttling rules as needed to ensure that your system runs stably under high concurrency.
Supports multiple matching rules: Throttling policies support multiple matching rules to meet the needs of complex business scenarios with varying priorities.
Prevents attacks: By throttling specific consumers, headers, query parameters, or cookies, you can effectively limit the access of crawlers or automated tools to protect data security.

Scenarios

High-concurrency scenarios: In scenarios such as e-commerce promotions, you can throttle API callers based on their token usage per unit of time. This prevents malicious high-frequency calls and ensures service stability and fairness during the promotion.
AI service calls: You can throttle calls to LLM APIs to prevent service quality degradation or system breakdowns caused by traffic bursts.
Multi-tenant systems: In an open platform or multi-tenant architecture, you can assign different throttling quotas to different tenants to ensure fairness and resource isolation.
Defense against attacks: You can establish throttling mechanisms to defend against crawler attacks, DDoS attacks, and API abuse.

Procedure

Log on to the AI Gateway console and choose Instance. In the top menu bar, select a region, then click the target instance ID.
In the navigation pane on the left, choose Model API, then click the target API name to go to the API Details page.

Click Policies and Plug-ins, enable the Current limiting switch, and configure the relevant parameters.

Configuration item	Description
Current limiting	Turn the throttling switch on or off. By default, this switch is turned off.
Throttling Policy	A throttling policy includes five types of Conditions: By request header: For example, throttle requests with the `beta` identifier in the header to 100 tokens per minute. By request query: For example, throttle requests with the `user_id=1` query parameter to 100 tokens per minute. By request cookie: For example, throttle requests with the specified identifier in the cookie to 100 tokens per minute. By Consumer: For example, throttle all consumers to 1,000 tokens per minute. Important To configure throttling by consumer, you must enable consumer authentication. By client IP address: For example, throttle each client IP address to 100 tokens per minute. Each judgment condition supports four Throttling Rules: Exact match, Prefix Match, Regex Match, and Random match. The priority is: Exact match > Prefix match > Regex match > Any match. Note If you configure multiple rules, a request is blocked if it hits any of them. The Range can be Every second, Every minute, Every hour, or Every day. Note Throttling is performed based on the number of inbound or outbound tokens by LLM.

Confirm the configuration and click Save.