All Products
Search
Document Center

Alibaba Cloud Model Studio:Best practices for handling rate limiting

Last Updated:Apr 17, 2026

This topic describes the rate limiting mechanism of Model Studio APIs and provides traffic control strategies for different scenarios to improve throughput and ensure service availability.

Model Studio APIs limit the number of requests, token usage, and growth rate over time. This is called rate limiting. Large language model (LLM) services have long latency and use two-dimensional rate limiting, which limits both request counts and token volumes. Traditional "retry on error" strategies are ineffective for these services, so you need to implement specific traffic control measures.

This topic introduces three types of solutions, ordered from the lowest to the highest implementation cost:

If you are currently troubleshooting a 429 error, go to Error diagnosis and strategy recommendations to identify the cause.

Platform rate limiting mechanism

Rate limiting is calculated independently for each model at the root account level. After rate limiting is triggered, the service typically resumes within one minute. For more information about the specific rate limiting conditions and current usage of each model, see Model Rate Limiting Conditions and Model Usage Monitoring. The Model Studio API includes the following three types of rate limiting rules:

  • Minute-level quota limits (RPM / TPM): The maximum number of requests per minute (RPM) and the maximum token usage per minute (TPM).

  • Instantaneous frequency limits (RPS / TPS): The maximum number of requests per second (RPS) and the maximum token usage per second (TPS). A high density of API calls or token consumption within a second may trigger rate limiting.

  • Growth rate limits (Traffic Burst): A sudden surge in request volume or token usage triggers rate limiting. The threshold is dynamically adjusted based on the service status. You can avoid triggering this limit by gradually increasing the request volume.

Based on these rate limiting mechanisms, the following sections describe solutions at the platform configuration, client-side traffic control, and architectural fallback levels.

Error diagnosis and strategy recommendations

The same error code can be triggered by different rate limiting dimensions. In addition, server saturation under high concurrency can also lead to slower responses or timeouts. You can mitigate this issue using the adaptive congestion control strategy described later in this topic.

Error code (DashScope / OpenAI)

Triggering dimension

Feature Diagnosis

Recommended strategy

Throttling.RateQuota / limit_requests

Request rate exceeded
(RPM exceeded)

Intermittent errors. Success rate decreases over time.

Token bucket: Control the request quota per unit of time.

Request rate exceeded
(RPS exceeded)

Concentrated errors at startup or during concurrency spikes.

Concurrency semaphore or smoothing rate limiter: Increase the interval between requests.

Throttling.AllocationQuota / insufficient_quota

Token usage exceeded
(TPM exceeded)

Intermittent errors when processing long texts.

Dual token bucket: Limit both RPM and TPM quotas simultaneously.

Token usage exceeded
(TPS exceeded)

Instantaneous token consumption is too high during concurrent processing of long texts.

Concurrency semaphore or smoothing rate limiter.

Throttling.BurstRate / limit_burst_rate

Traffic growth rate exceeded
(Traffic Burst)

Sudden large volume of requests after startup or recovery from an idle state.

Use a token bucket with a low initial value, such as initial_tokens=0, to implement a slow start. Or, use a smoothing rate limiter for peak-load shifting.

Platform configuration solutions

The following solutions can help you mitigate or eliminate rate limiting issues through platform-side configurations or resource adjustments.

Increase quota limits

If the default quota is insufficient, you can directly increase the temporary rate limit quota for a model in the Model Studio console. The change takes effect immediately. This feature is currently supported in the China (Beijing) and Singapore regions.

Scenarios: The default RPM/TPM quota is insufficient due to business growth, or a temporary throughput increase is needed for short-term events. For more information, see Rate limits.

Increasing the quota is simple. Evaluate this option before trying client-side traffic control strategies.

Provisioned Throughput Unit (PTU)

The PTU service provides dedicated, reserved computing power. It is the preferred solution for meeting real-time, high-throughput requirements and avoiding contention for computing power in the public resource pool.

This solution is suitable for scenarios where your business has deterministic throughput requirements, such as a Service-Level Agreement (SLA) commitment, or where you want to achieve stable, high throughput without complex client-side traffic control development.

PTUs are reserved resources and are billed continuously, even when not fully utilized. Evaluate the required specifications based on your actual peak business load to avoid resource waste.

Asynchronous batch processing (Batch API)

For tasks that do not have strict real-time requirements, such as data cleaning and batch analytics, you can use the Batch API to submit them for batch processing. These tasks are executed during off-peak hours, provide results asynchronously, and are not subject to real-time online request frequency or traffic limits.

This solution is suitable for offline tasks that can tolerate a result return time of several hours to days, such as data annotation, log analysis, and batch summary generation. The cost of the Batch API is typically lower than that of real-time API calls.

The result return time for the Batch API is not guaranteed. It is not suitable for online services that require immediate responses. After submitting a task, you must retrieve the results through polling or a callback.

Client-side traffic control strategies

If platform configuration solutions cannot meet your needs, you must introduce traffic control mechanisms on the client side. The core principle is to distribute requests as evenly as possible within a time window to avoid burst traffic that triggers rate limiting. When a system starts or after a long idle period, you should gradually increase the concurrency level instead of instantly using the maximum level.

The following four strategies are listed in order of increasing engineering complexity. Each strategy includes the capabilities of the previous one and enhances them:

  • Basic retry provides only passive defense.

  • Request rate limiting adds active queuing.

  • Traffic shaping further introduces token-level control and smooth sending.

  • Adaptive congestion control dynamically adjusts the sending rate based on real-time feedback.

Choose the strategy with the lowest implementation cost that meets your business needs.

Throughput performance comparison of each strategy

image

The following list compares the effective throughput performance of the four client-side traffic control strategies under different loads:

  • Basic retry strategy: Effective under low load. Prone to congestive collapse under high concurrency, causing a sharp drop in throughput.

  • Request rate limiting strategy: Strong protection against collapse. However, under mixed workloads with long texts, throughput shows sawtooth-like fluctuations due to the lack of token control.

  • Traffic shaping strategy: High stability. Achieves smooth output by sacrificing some peak throughput.

  • Adaptive congestion control strategy: Can dynamically converge to a stable, high throughput point under high load, but has cold-start probing overhead.

Basic retry strategy

This strategy is suitable for non-high-concurrency scenarios such as personal testing, local scripts, and low-frequency background tasks. It does not limit the sending rate by default. It only triggers an exponential backoff retry with random jitter upon receiving a 429 or 5xx error.

This strategy has no proactive traffic control. Under multi-threaded concurrency, it can easily trigger rate limiting and cause many requests to back up and fail.

Code example

Using the tenacity library

import openai
from openai import OpenAI
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
    retry_if_exception_type
)

RETRYABLE_ERRORS = (
    openai.RateLimitError,
    openai.InternalServerError,
    openai.APIConnectionError,
)

@retry(
    wait=wait_random_exponential(min=1, max=60),
    stop=stop_after_attempt(6),
    retry=retry_if_exception_type(RETRYABLE_ERRORS)
)
def chat_with_retry(client, model, messages, max_tokens):
    return client.chat.completions.create(
        model=model,
        max_tokens=max_tokens,
        messages=messages
    )

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="YOUR_DASHSCOPE_API_KEY"
)

try:
    response = chat_with_retry(
        client=client,
        model="qwen-plus",
        messages=[{"role": "user", "content": "What is exponential backoff retry?"}],
        max_tokens=1024
    )
    print(response.choices[0].message.content)
except Exception as e:
    print(f"Request failed: {e}")

Native implementation (no dependencies)

import time
import random
import openai
from openai import OpenAI

RETRYABLE_ERRORS = (
    openai.RateLimitError,
    openai.InternalServerError,
    openai.APIConnectionError,
)

def chat_with_retry(client, model, messages, max_tokens):
    attempt = 0
    max_retries = 5
    base_delay = 1
    max_delay = 60

    while attempt <= max_retries:
        try:
            return client.chat.completions.create(
                model=model,
                max_tokens=max_tokens,
                messages=messages
            )
        except RETRYABLE_ERRORS as e:
            attempt += 1
            if attempt > max_retries:
                raise e
            backoff = min(max_delay, base_delay * (2 ** (attempt - 1)))
            sleep_time = backoff + random.uniform(0, 1)
            print(f"Triggered {type(e).__name__}, retrying after {sleep_time:.2f}s...")
            time.sleep(sleep_time)

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="YOUR_DASHSCOPE_API_KEY"
)

try:
    response = chat_with_retry(
        client=client,
        model="qwen-plus",
        messages=[{"role": "user", "content": "What is exponential backoff retry?"}],
        max_tokens=1024
    )
    print(response.choices[0].message.content)
except Exception as e:
    print(f"Request failed: {e}")

The preceding code uses exponential backoff instead of a fixed-interval retry. Fixed-interval retry, such as retrying all failed requests after 3 seconds, causes all requests to be re-sent at the same time. This can easily trigger rate limiting again and lead to persistent congestion. Exponential backoff with random jitter "spreads out" the retries:

  • Wait time doubles progressively: For example, 1s, 2s, 4s.... This avoids repeated requests in a short period.

  • Add random jitter: Introducing a random value, such as 2s +/- 0.5s, spreads out the retry traffic. This prevents many requests from retrying at the same time, which would create a secondary flood (thundering herd effect).

The system can then recover in a distributed manner, rather than becoming stuck in a vicious cycle of "fail—retry in unison—fail again".

Request rate limiting strategy

Relying solely on passive retries is insufficient for real business traffic. Frequent retries significantly increase response latency. The request rate limiting strategy introduces active traffic control. It performs self-checks and adjustments before sending requests. This organizes a large, unordered influx of requests into a smooth queue that complies with the platform's RPM limit. After rate limiting is triggered, it usually takes some time to recover. Actively smoothing the request rhythm introduces a small, controllable queuing delay. However, this is far less costly than the time spent in a passive "error—wait—retry" loop. In short, you incur a small, predictable cost to avoid a large, unpredictable delay.

This strategy is suitable for online services that are sensitive to time to first token, such as chatbots and other lightweight, request-response interactions.

This strategy implements active queuing on the client side with two levels of control:

  • RPM token bucket: Limits the total number of requests per minute. The bucket capacity is the RPM quota, and tokens are refilled at a constant rate. This method supports borrowing. If tokens are insufficient, a request can borrow from future quotas but must strictly follow First-In, First-Out (FIFO) order.

  • Concurrency semaphore: Limits the number of concurrent requests. An asynchronous semaphore controls in-flight requests, preventing instantaneous high concurrency from triggering RPS limits and avoiding client overload.

These two levels of control must be executed in a strict order: first acquire an RPM token, then acquire a concurrency semaphore. Concurrency slots are scarce resources and should only be allocated to requests that have met the execution conditions. If the order is reversed (occupy a slot first, then wait for a token), it can easily cause head-of-line blocking under high load. A request occupies a slot but has no token available. It holds the slot for a long time without executing. All slots become occupied, but no requests are actually sent. The core principle is: do not perform potentially long-running waits while holding a scarce resource.

The following code initializes the token bucket to a full state (initial_tokens=rpm_limit). This is suitable for lightweight online services to process requests immediately at startup. If starting with a full bucket triggers a rate limit error, you can lower the initial number of tokens. For example, set it to initial_tokens=0, which is an "empty bucket start". This allows the system to enter its working state at a more gradual pace.

This strategy does not track token usage. It can still trigger rate limiting in long-text tasks by exhausting the TPM quota.

Code example

Core component: Token bucket

import time

class TokenBucket:
    """
    Token bucket implementation to control requests per minute (RPM).
    Supports a debt mechanism to ensure first-in, first-out (FIFO) order under high concurrency.
    """
    def __init__(self, quota_per_minute: float, initial_tokens: float = 0.0):
        self.capacity = quota_per_minute
        self.tokens = initial_tokens
        self.refill_rate = quota_per_minute / 60.0
        self.last_refill = time.monotonic()

    def reserve(self, cost: float = 1.0) -> float:
        """
        Acquires a token.
        If tokens are insufficient, returns the number of seconds to wait (supports debt).
        """
        self._refill()

        # 1. Sufficient tokens: Deduct directly
        if self.tokens >= cost:
            self.tokens -= cost
            return 0.0

        # 2. Insufficient tokens: Calculate wait time and incur debt
        # "Reserves" future tokens for the current request to ensure FIFO order
        deficit = cost - self.tokens
        wait_seconds = deficit / self.refill_rate
        self.tokens -= cost
        return wait_seconds

    def _refill(self):
        """Refills tokens based on elapsed time."""
        now = time.monotonic()
        elapsed = now - self.last_refill
        if elapsed > 0:
            self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
            self.last_refill = now

Client logic

import asyncio
import openai
from openai import AsyncOpenAI
from tenacity import retry, wait_random_exponential, stop_after_attempt, retry_if_exception_type

class RateLimitedClient:
    def __init__(
        self,
        api_key: str,
        base_url: str = "https://dashscope.aliyuncs.com/compatible-mode/v1",
        rpm_limit: float = 600.0,
        max_concurrency: int = 20
    ):
        self.client = AsyncOpenAI(api_key=api_key, base_url=base_url)
        # Component 1: RPM token bucket (controls total volume)
        self.rpm_bucket = TokenBucket(
            quota_per_minute=rpm_limit,
            initial_tokens=rpm_limit  # Start with a full bucket, suitable for lightweight online services
        )
        # Component 2: Concurrency semaphore (controls instantaneous concurrency)
        self.semaphore = asyncio.Semaphore(max_concurrency)

    async def _execute_request(self, model, messages, max_tokens):
        """Executes a single request, passing through RPM check and concurrency limit in order."""
        # 1. RPM check (acquire token first)
        wait_seconds = self.rpm_bucket.reserve(1.0)
        if wait_seconds > 0:
            await asyncio.sleep(wait_seconds)
        # 2. Concurrency check (acquire semaphore next)
        async with self.semaphore:
            # 3. Make the API call
            return await self.client.chat.completions.create(
                model=model,
                messages=messages,
                max_tokens=max_tokens
            )

    @retry(
        wait=wait_random_exponential(min=1, max=60),
        stop=stop_after_attempt(5),
        retry=retry_if_exception_type((
            openai.RateLimitError,
            openai.InternalServerError,
            openai.APIConnectionError
        ))
    )
    async def chat_with_limit(self, model, messages, max_tokens=1024):
        # Design consideration: Why do retries also need to re-acquire a token?
        # Ans: For safety. Without re-acquisition, the traffic pulse from retries
        # could instantly exceed the RPM limit.
        return await self._execute_request(model, messages, max_tokens)

Traffic shaping strategy

In batch processing scenarios that require high, stable throughput, such as real-time RAG ingestion and bulk analysis of long documents, the request rate limiting strategy has a significant TPM blind spot. To address this, the traffic shaping strategy provides dual resource awareness (RPM & TPM). It also introduces a shaping mechanism on the sending end to perform peak-load shifting on bursty traffic, converting it into a smooth flow.

This strategy enhances the original request rate limiting strategy with the following capabilities:

  • Dual resource control (RPM & TPM): Maintains both RPM and TPM token buckets. All requests must pass quota checks for both dimensions before being sent.

  • Pre-deduction for input, post-settlement for output: The length of the model's output is unknown before the request. The TPM token bucket only pre-deducts input tokens when sending. After the request is completed, the actual output tokens are settled. Even if the quota is insufficient at settlement (negative tokens), subsequent requests will wait for the token count to become positive, naturally smoothing the flow rate.

  • Continuous warm-up: During a cold start, the token issuance rate increases linearly over time, eliminating the risk of initial bursts.

  • Smoothing rate limiter (Pacing): Smooths the sending rate by enforcing a minimum interval between requests (pacing), reducing the risk of triggering rate limits.

Alternative solution reference: If your business is not sensitive to minor queuing delays at startup, you can reuse the standard token bucket logic (set initial_tokens=0) to achieve a safe start while reducing client complexity. In addition, the Python token bucket implementation in this topic is for demonstrating the design concept. In a production environment, use mature rate limiting components from your language's ecosystem, such as Guava's SmoothRateLimiter in Java.

In the code example, the smoothing wait is placed inside the concurrency lock to avoid request bursts caused by head-of-line blocking. Multiple requests might compete for the concurrency semaphore simultaneously after their wait ends, causing the smoothed traffic to become congested again at the exit. Although this slightly reduces concurrency efficiency, it ensures precise control over the sending interval.

The complete traffic shaping pipeline is: Estimate input tokens → Dual admission (RPM & TPM) → Concurrency lock → Traffic shaping → Send → Settle output tokens.

image

This strategy sacrifices some theoretical maximum concurrency due to its conservative smoothing mechanism. It is not suitable for online services that require extremely low latency.

Code example

Advanced token bucket

import time

class TokenBucket:
    """Advanced token bucket that supports a continuous warm-up mechanism."""
    def __init__(self, quota_per_minute: float, warmup_seconds: float = 0.0):
        self.capacity = quota_per_minute
        self.tokens = 0.0
        self.target_refill_rate = quota_per_minute / 60.0
        self.warmup_seconds = warmup_seconds
        self.start_time = time.monotonic()
        self.last_update_time = self.start_time
        self.cumulative_generated = 0.0

    def _get_cumulative_tokens(self, t: float) -> float:
        if t <= 0:
            return 0.0
        R = self.target_refill_rate
        T = self.warmup_seconds
        if T <= 0:
            return R * t
        if t <= T:
            return (R / (2 * T)) * (t ** 2)
        else:
            warmup_total = (R * T) / 2.0
            return warmup_total + R * (t - T)

    def _get_time_for_cumulative_tokens(self, target_cumulative: float) -> float:
        if target_cumulative <= 0:
            return 0.0
        R = self.target_refill_rate
        T = self.warmup_seconds
        if T <= 0:
            return target_cumulative / R
        warmup_total = (R * T) / 2.0
        if target_cumulative <= warmup_total:
            return ((2 * T * target_cumulative) / R) ** 0.5
        else:
            return (target_cumulative - warmup_total) / R + T

    def reserve(self, cost: float = 1.0) -> float:
        now = time.monotonic()
        relative_now = now - self.start_time
        current_cumulative = self._get_cumulative_tokens(relative_now)
        new_tokens = current_cumulative - self.cumulative_generated
        self.tokens = min(self.capacity, self.tokens + new_tokens)
        self.cumulative_generated = current_cumulative
        self.last_update_time = now
        if self.tokens >= cost:
            self.tokens -= cost
            return 0.0
        deficit = cost - self.tokens
        self.tokens -= cost
        target_cumulative = self.cumulative_generated + deficit
        target_time = self._get_time_for_cumulative_tokens(target_cumulative)
        wait_seconds = target_time - relative_now
        return max(0.0, wait_seconds)

    def adjust(self, amount: float):
        self.tokens = min(self.capacity, self.tokens + amount)

Smoothing rate limiter

import time

class SmoothRateLimiter:
    def __init__(self, rate_per_minute: float):
        self._min_interval = 60.0 / rate_per_minute
        self._last_operation = time.monotonic()

    def reserve(self) -> float:
        now = time.monotonic()
        elapsed = now - self._last_operation
        wait_time = max(0.0, self._min_interval - elapsed)
        self._last_operation = now + wait_time
        return wait_time

Client logic

import asyncio

class TrafficShapingClient:
    def __init__(self):
        self._rpm_bucket = TokenBucket(quota_per_minute=600)
        self._tpm_bucket = TokenBucket(quota_per_minute=1_000_000)
        self._smooth_limiter = SmoothRateLimiter(rate_per_minute=600)
        self._concurrency_semaphore = asyncio.Semaphore(20)

    async def _execute_throttled_request(self, model, prompt, max_tokens, input_tokens):
        # [Step 1] Dual admission control
        # Check both RPM and TPM, and take the longer wait time
        wait_rpm = self._rpm_bucket.reserve(1.0)
        # The TPM check only requests a quota for input tokens
        wait_tpm = self._tpm_bucket.reserve(input_tokens)
        admission_wait = max(wait_rpm, wait_tpm)
        if admission_wait > 0:
            await asyncio.sleep(admission_wait)

        # [Step 2] Acquire concurrency lock
        async with self._concurrency_semaphore:
            # [Step 3] Traffic shaping
            # Key: Perform smoothing wait inside the lock
            # Sacrifices some concurrency efficiency for precise control over send intervals
            smooth_wait = self._smooth_limiter.reserve()
            if smooth_wait > 0:
                await asyncio.sleep(smooth_wait)

            # [Step 4] Send request
            content, actual_usage = await self._send_chat_request(model, prompt, max_tokens)

            # [Step 5] Settle output tokens
            output_tokens = actual_usage.completion_tokens
            if output_tokens > 0:
                self._tpm_bucket.adjust(-output_tokens)
            return content

Adaptive congestion control strategy

This strategy is suitable for large-scale, dynamic, mixed-load scenarios such as API gateways, complex proxies, and multi-tenant systems.

Note

Selection tip: This strategy is not a universal solution

The core value of the adaptive congestion control strategy is to handle highly uncertain and volatile business environments. It is not a one-size-fits-all choice:

  • Performance paradox: If the business load is predictable and relatively stable, such as in quantitative batch processing, directly setting optimal static parameters based on experience usually performs better than dynamic probing that requires "trial and convergence".

  • Probing overhead: To find the boundary, dynamic algorithms inevitably involve a cold-start ramp-up and exploratory fluctuations. In known scenarios, this "exploration cost" is an unnecessary performance loss.

  • Maintenance cost: Introducing a closed-loop feedback mechanism significantly increases system complexity and troubleshooting difficulty.

Unless your business has a very large scale, complex load, and significant volatility, choose one of the simpler first three strategies.

The request rate limiting and traffic shaping strategies are classic defensive strategies based on static quotas. They are fully applicable in scenarios with stable and predictable loads. However, in complex gateway-level scenarios, businesses face dynamic changes from two sides. The downstream load is complex and variable, with a mix of high-concurrency short requests and long-running deep inference tasks. The platform's rate limiting thresholds fluctuate dynamically because second-level rate limits and growth rate detection thresholds are adjusted based on service status. Static strategies struggle to balance efficiency and stability.

This policy is inspired by BBR (Bottleneck Bandwidth and RTT) and establishes a closed-loop control system based on EBP (Elastic Bandwidth Probing). Using RPM/TPM quotas as a guiding upper limit, this system dynamically calculates the optimal sending rate to maximize throughput based on real-time feedback, such as latency changes and whether rate limiting is active.

  • Elastic Bandwidth Probing (EBP): This method stores the historical highest successful watermark. It calculates the probing gain by simulating spring tension based on the distance between the current concurrency level and the highest watermark. The farther the distance, the faster the acceleration. The closer the distance, the slower the deceleration. A small linear thrust is added to ensure continuous boundary exploration even in highly saturated intervals.

  • TPT congestion awareness: The generation time of large models is proportional to their length. High latency in long texts does not necessarily indicate congestion. TPT (Time Per Token), or the processing time per token, is used as a metric to filter out the noise from content length. Congestion is only determined to be computational saturation when TPT significantly deteriorates.

  • Anti-burst rate governor: Regardless of the target concurrency level calculated by EBP, the rate governor forcibly limits the acceleration of concurrency growth. This ensures that traffic increases smoothly, avoiding step changes that could trigger growth rate limits.

image

Compared to native BBR, this strategy includes the following key modifications for large models:

  • Guided probing: Introduces known RPM/TPM quotas as a "guiding upper limit" to avoid frequent collisions caused by blind probing.

  • Signal source modification (RTT → TPT): Native BBR relies on RTT. However, in large model scenarios, the latency difference caused by content length is much greater than network jitter. TPT is used instead to eliminate interference from content length.

  • Response mechanism enhancement (ProbeRTT → Hold): In the face of latency fluctuations, it chooses to maintain the current concurrency level rather than proactively backing off and reducing throughput.

  • Hard rate limit response (Packet Loss → 429 Drain): Once a 429 error is triggered, it enters an aggressive Drain state and performs a fast recovery after a cooldown period.

This strategy has the following limitations:

  • Congestion signal noise (TPT Noise): The current TPT is roughly estimated as "total latency / total tokens". Total latency includes network round-trip time, queuing time, and time to first token. It is susceptible to being inflated by network jitter or long inputs, which can mistakenly trigger the Hold state.

  • Large request starvation (Starvation Risk): To achieve maximum scheduling performance, this strategy uses a non-strict FIFO wakeup mechanism. When quotas are scarce, short-token requests may "jump the queue" and preempt resources, causing long-token requests to wait for an extended period.

  • Cold start problem: This strategy requires a warm-up period to build a statistical model. In low-load or short-lived tasks, throughput may be lower than the first three strategies because it probes from scratch.

Code example

Control entrypoint

class ElasticCongestionController:
    async def acquire(self):
        """[Admission phase] Check before initiating a request"""
        # 1. SSR slow-start restart: If idle for too long, proactively decay the limit
        #    to prevent burst traffic caused by an outdated watermark.
        if self.is_idle_too_long():
            self.perform_slow_start_restart()

        # 2. Circuit breaker check: If in DRAIN (cooldown) state, force wait.
        if self.state == CongestionState.DRAIN:
            await self.wait_for_cooldown()

        # 3. Dual budget check: Check both concurrency slots and token budget.
        await self.wait_for_budget(request_tokens)

    async def release(self, latency, actual_tokens, error):
        """[Feedback phase] Decision after the request finishes"""
        if error:
            # [Fault response] On rate limit error (429/503): Immediately drain + multiplicative backoff
            self.state = CongestionState.DRAIN
            self.concurrency_limit *= self.backoff_factor  # e.g. 0.7
            return

        # [Normal response] Calculate TPT (Time-Per-Token)
        current_tpt = latency / actual_tokens

        # [Congestion awareness] TPT spike (generation slows down): Enter HOLD to observe
        # Maintain concurrency level, neither backing off nor increasing
        if current_tpt > self.metrics.ema_tpt * 2.0:
            self.state = CongestionState.HOLD
        else:
            # [Steady-state probing] Network is healthy: Perform EBP elastic probing
            self.state = CongestionState.PROBING
            self.update_limit_via_ebp()

EBP probing

def probe_next_limit(self, current_limit, max_known_capacity):
    """
    Calculate the next concurrency limit
    Core formula: Next = Max(Spring Tension, Additive Thrust) + Governor Smoothing
    """
    # 1. Calculate physical limit (Little's Law)
    # Theoretical limit = Throughput * Latency * Buffer Factor
    dynamic_ceiling = self.metrics.tps * self.metrics.avg_latency * 1.2

    # 2. Spring logic (Spring Tension)
    # The farther from the historical high watermark, the greater the tension (accelerate); the closer, the smaller (decelerate)
    tension = 1.0 - (current_limit / max_known_capacity)
    spring_target = current_limit * (1.0 + tension * gain)

    # 3. Additive thrust
    # Solves the "Zeno's Paradox": When tension approaches 0, forcibly add a small linear increment
    # to ensure the system can break out of local maxima and continue exploring the boundary.
    linear_target = current_limit + self.min_additive_step

    raw_target = max(spring_target, linear_target)

    # 4. Anti-burst rate governor
    # Limits the acceleration of concurrency growth to prevent step changes.
    final_limit = self.governor.smooth(raw_target)

    return min(final_limit, dynamic_ceiling)

Metrics tracking

class CongestionMetrics:
    def update_stats(self, latency, token_count):
        """
        [Sensor] Update statistical metrics in real time
        Use EMA (exponential moving average) to filter out noise from long-tail requests
        """
        alpha = 0.2  # Smoothing factor

        # 1. Estimate single request size (Token Size)
        self.ema_tokens = (1 - alpha) * self.ema_tokens + alpha * token_count

        # 2. Estimate TPT (Time Per Token)
        # Use TPT instead of Latency to eliminate errors caused by different LLM generation lengths
        instant_tpt = latency / token_count
        self.ema_tpt = (1 - alpha) * self.ema_tpt + alpha * instant_tpt

    def track_inflight(self, estimated_tokens):
        """
        [Blind spot filling] Correct the lag of "counting only after response"
        Pre-deduct the quota the moment a request is initiated
        """
        self.inflight_tokens += estimated_tokens

Architectural fallback solutions

When platform configurations and client-side traffic control still cannot meet the business requirements for availability or peak throughput, you can introduce fallback mechanisms at the system architecture level.

Model fallback

When the primary model cannot respond due to rate limiting or service exceptions, automatically fall back to an alternative model with a more generous quota to ensure the main process continues to respond.

Fallback path design principles

  • Choose models from different series: In Model Studio, rate limiting is calculated independently for each model. When a model is rate-limited, you can choose a different model as a fallback. For example, you can fall back from qwen3.6-plus to qwen3.6-flash.

  • Trigger fallback only on rate limit errors: Fallback should be triggered for 429 rate limit errors, not all exceptions. Switching models will not solve issues such as network timeouts or parameter errors.

  • Validate the fallback model in advance: Ensure that the fallback model supports the features required by your business, such as Function Calling and structured output, to avoid functional exceptions after fallback.

Code example

The following example demonstrates model fallback logic based on the 429 error code. When a request to the primary model triggers rate limiting, it automatically switches to the fallback model for a retry.

import os
import asyncio
from openai import AsyncOpenAI, APIStatusError

# Primary and fallback models (different series, independent quotas)
PRIMARY_MODEL = "qwen3.6-plus"
FALLBACK_MODEL = "qwen3.6-flash"

client = AsyncOpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
)

async def chat_with_fallback(messages: list) -> str:
    """Request with fallback: Automatically switches to the fallback model when the primary model is rate-limited."""
    for model in [PRIMARY_MODEL, FALLBACK_MODEL]:
        try:
            response = await client.chat.completions.create(
                model=model,
                messages=messages
            )
            return response.choices[0].message.content
        except APIStatusError as e:
            if e.status_code == 429 and model == PRIMARY_MODEL:
                print(f"[Rate Limit Triggered] {model}, falling back to {FALLBACK_MODEL}")
                continue
            raise
    raise RuntimeError("All models are unavailable")

async def main():
    result = await chat_with_fallback(
        messages=[{"role": "user", "content": "Hello"}]
    )
    print(result)

if __name__ == "__main__":
    asyncio.run(main())

Model fallback can be combined with client-side traffic control strategies. For example, you can integrate fallback logic into the retry mechanism of the request rate limiting strategy. When retries are exhausted and rate limiting is still triggered, switch to the fallback model.

Peak-load shifting using message queues (MQ)

For backend services that do not require immediate responses, you can introduce a message middleware, such as RabbitMQ or Kafka, for peak-load shifting. Burst traffic is first written to the MQ, and the consumer side pulls and processes it at a steady rate according to the rate limit quota. This architecture decouples frontend peaks from backend calls and can fundamentally prevent rate limiting errors.

Scenarios: Businesses where users can accept asynchronous notification of results after submitting a task, such as ticket processing, content moderation, and batch data annotation. The MQ acts as a buffer layer, absorbing traffic spikes from the frontend, while the consumer side sends requests to the Model Studio API at a stable rate.

Key architectural design points:

  • Consumer rate control: The consumer side should use the request rate limiting or traffic shaping strategy to consume messages at a steady rate based on the RPM/TPM quota, rather than pulling messages without limits.

  • Dead-letter handling: For messages that fail after multiple retries, move them to a dead-letter queue and trigger an alert. This prevents infinite retries from blocking consumption.

  • Back-pressure propagation: When the MQ backlog exceeds a threshold, propagate pressure back to the upstream, for example, by returning a queuing status. This prevents the queue from growing indefinitely.

Production environment considerations

The preceding code examples are based on a Python asyncio single-threaded loop and are intended to demonstrate the core algorithms. Before applying them to large-scale production, consider the following issues.

  • Adapting to non-text models

    The preceding strategies use text models as examples, but the core control principles also apply to multimodal model services, such as image generation and speech synthesis. Apart from different units of measurement, the essence is the same: limiting the submission rate and processing capacity.

    • Models such as speech recognition are typically constrained by both the number of requests per unit of time (such as RPM) and usage (such as audio duration). The strategies are basically the same as for text models.

    • Models for images and videos are typically constrained by the task submission rate and the number of concurrent tasks. You can use the same approach as the request rate limiting strategy: limit the task submission rate and use a semaphore to control concurrency.

    Regardless of how the rate limiting metrics change, the principle of client-side throttling remains the same. You only need to replace the counter (such as an RPM token bucket) or the probing metric (such as TPT) with the metric for the corresponding modality. For specific rate limiting rules and metric definitions for a model, see Model Rate Limiting Conditions.

  • Atomicity in concurrent models

    Example implementation: Because asyncio uses single-threaded cooperative scheduling, the state modification operations in the example code are inherently atomic and do not require additional concurrency protection within a single process.

    Production recommendation: When implementing in a multi-threaded or multi-process environment, ensure the concurrency safety of the token bucket and statistical window to guarantee the correctness of state updates. Otherwise, a race condition will cause the traffic control to fail.

  • Distributed rate limiting

    Example implementation: The traffic control components in the example code are all in-memory implementations.

    Production recommendation: In a multi-instance distributed deployment, each instance performs local traffic control independently. The actual total usage may exceed the limit and trigger global rate limiting. Use a centralized counter, such as Redis, to uniformly manage the usage of all nodes.

  • Priority queues and starvation prevention

    Example implementation: None of the example code implements priority differentiation. The adaptive congestion control strategy, in particular, uses a non-strict FIFO wakeup mechanism to achieve maximum scheduling performance.

    Production recommendation: When your business has high and low-priority requests, implement a weighted priority queue to guarantee bandwidth for high-priority requests. Also, introduce a starvation prevention mechanism to reserve a minimum quota for the low-priority queue, preventing it from being completely unschedulable during continuous high load.