To ensure fair use, Alibaba Cloud Model Studio applies basic rate limits. These limits are model-specific and linked to your Alibaba Cloud account. The limit is calculated based on the total calls to a model from all RAM users, workspaces, and API keys under your account. If you exceed the limit, API requests will fail. You must wait for the limit to reset before making another call.
Rules
Account-level limits: Rate limits apply at the Alibaba Cloud account level. They are calculated based on the total calls from all RAM users, workspaces, and API keys under the account.
Model-specific limits: Each model has an independent rate limit. See the tables below for details.
FAQ
Why is rate limiting triggered?
Check the error message:
Requests rate limit exceeded or You exceeded your current requests list: This error indicates that the call frequency limit was triggered.
Allocated quota exceeded or You exceeded your current quota: This error indicates that the token consumption limit was triggered.
Request rate increased too quickly: This error indicates that a sudden surge in call frequency triggered the system's stability protection, even if the Requests Per Minute (RPM) or Tokens Per Minute (TPM) limits were not reached.
For other errors, see Error messages to identify the cause.
Note: In addition to RPM and TPM, rate limits may also be enforced at the per-second level. These limits are Requests per Second (RPS), calculated as RPM/60, and Requests per Second (RPS), calculated as TPM/60. A burst of requests in a short period can trigger rate limiting, even if the total number of calls is below the per-minute limit.
How to view model call usage?
One hour after you call a model, go to the Monitoring (Singapore or Beijing) page. Set the query conditions, such as the time range and workspace. Then, in the Models area, find the target model and click Monitor in the Actions column to view the model's call statistics. For more information, see the Monitoring document.
Data is updated hourly. During peak periods, there may be an hour-level latency.

How long does it take to recover after a rate limit is triggered?
The limit typically resets within one minute. If other errors occur, see Error messages for solutions.
How to avoid rate limiting?
Choose a model with a higher rate limit: Stable or latest versions have higher rate limits than older snapshot versions.
Optimize your calling strategy
Adjust the call frequency: If you receive a "Requests rate limit exceeded" or "You exceeded your current requests list" error, reduce the call frequency.
Reduce token consumption: If you receive an "Allocated quota exceeded" or "You exceeded your current quota" error, shorten the input or output length.
Smooth the request rate: If a sudden increase in call frequency triggers system stability protection, you may receive a "Request rate increased too quickly" error. In this case, optimize your client-side calling logic. You can adopt a request smoothing strategy, such as uniform scheduling, exponential backoff, or a request queue buffer. This strategy distributes requests evenly over the time window and avoids instantaneous peaks.
Add a backup model
If you encounter a rate limit error, switch to a backup model to continue generation. This improves concurrency and reduces the failure rate. The following code shows an example of retrying a request with
qwen-plus-2025-07-14after a rate limit is triggered forqwen-plus-2025-07-28.Split tasks: Processing long conversations or large documents can consume many tokens quickly. Split large batch tasks into smaller batches and submit them at different times.
Use batch inference: If you do not need real-time results, use batch inference (Batch API). It is not subject to real-time rate limits, but you must consider queuing and processing time.
Text generation - Qwen
Qwen language models
Global
In the Global deployment mode, the endpoint and data storage are located in the US (Virginia) region. Inference computing resources are dynamically scheduled worldwide.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen3-max | 600 | 1,000,000 |
qwen3-max-2025-09-23 | 60 | 100,000 |
qwen3-max-preview | 600 | 1,000,000 |
qwen-plus | 15,000 | 5,000,000 |
qwen-plus-2025-12-01 | 60 | 1,000,000 |
qwen-plus-2025-09-11 | ||
qwen-plus-2025-07-28 | ||
qwen-flash | 15,000 | 10,000,000 |
qwen-flash-2025-07-28 | 60 | 1,000,000 |
International
In the International deployment mode, the endpoint and data storage are located in the Singapore region. Inference computing resources are dynamically scheduled worldwide, excluding Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen3-max | 600 | 1,000,000 |
qwen3-max-2026-01-23 | ||
qwen3-max-2025-09-23 | 60 | 100,000 |
qwen3-max-preview | 600 | 1,000,000 |
qwen-max | 120 | 100,000 |
qwen-max-latest | 600 | 1,000,000 |
qwen-max-2025-01-25 (qwen-max-0125) | 60 | 100,000 |
qwen-plus | 600 | 1,000,000 |
qwen-plus-latest | 600 | 1,000,000 |
qwen-plus-2025-12-01 | 120 | 1,000,000 |
qwen-plus-2025-09-11 | 120 | |
qwen-plus-2025-07-28 | 60 | 100,000 |
qwen-plus-2025-07-14 (qwen-plus-0714) | ||
qwen-plus-2025-04-28 (qwen-plus-0428) | 1,000,000 | |
qwen-plus-2025-01-25 (qwen-plus-0125) | 100,000 | |
qwen-flash | 600 | 5,000,000 |
qwen-flash-2025-07-28 | 600 | 5,000,000 |
qwq-plus | 60 | 100,000 |
qwen-turbo | 240 | 100,000 |
qwen-turbo-latest | 600 | 5,000,000 |
qwen-turbo-2025-04-28 (qwen-turbo-0428) | 60 | 1,000,000 |
qwen-turbo-2024-11-01 (qwen-turbo-1101) | 5,000,000 | |
US
In the US deployment mode, the endpoint and data storage are located in the US (Virginia) region. Inference computing resources are limited to the United States.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen-plus-us | 600 | 1,000,000 |
qwen-plus-2025-12-01-us | 60 | |
qwen-flash-us | 600 | 5,000,000 |
qwen-flash-2025-07-28-us | ||
Mainland China
In the Mainland China deployment mode, the endpoint and data storage are located in the Beijing region. Inference computing resources are limited to Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen3-max | 30,000 | 5,000,000 |
qwen3-max-2026-01-23 | 600 | 1,000,000 |
qwen3-max-2025-09-23 | 60 | 100,000 |
qwen3-max-preview | 600 | 1,000,000 |
qwen-max | 1,200 | 1,000,000 |
qwen-max-latest | ||
qwen-max-2025-01-25 (qwen-max-0125) | 60 | 100,000 |
qwen-max-2024-09-19 (qwen-max-0919) | ||
qwen-plus | 30,000 | 5,000,000 |
qwen-plus-latest | 15,000 | 1,200,000 |
qwen-plus-2025-12-01 | 120 | 1,000,000 |
qwen-plus-2025-09-11 | 60 | |
qwen-plus-2025-07-28 (qwen-plus-0728) | ||
qwen-plus-2025-07-14 (qwen-plus-0714) | 100,000 | |
qwen-plus-2025-04-28 (qwen-plus-0428) | 1,000,000 | |
qwen-plus-2025-01-25 (qwen-plus-0125) | 150,000 | |
qwen-plus-2025-01-12 (qwen-plus-0112) | ||
qwen-plus-2024-12-20 (qwen-plus-1220) | ||
qwen-flash | 30,000 | 10,000,000 |
qwen-flash-2025-07-28 | 60 | 1,000,000 |
qwq-plus | 600 | 1,000,000 |
qwq-plus-latest | ||
qwq-plus-2025-03-05 | 60 | 100,000 |
qwen-turbo | 1,200 | 5,000,000 |
qwen-turbo-latest | ||
qwen-turbo-2025-04-28 (qwen-turbo-0428) | 60 | 1,000,000 |
qwen-turbo-2025-02-11 (qwen-turbo-0211) | 5,000,000 | |
qwen-turbo-2024-11-01 (qwen-turbo-1101) | ||
qwen-long-latest | 1,200 | 60,000 |
qwen-long-2025-01-25 (qwen-long-0125) | 3 | 7,500 |
Qwen-VL (visual understanding / image-to-text)
Global
In the Global deployment mode, the endpoint and data storage are located in the US (Virginia) region. Inference computing resources are dynamically scheduled worldwide.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen3-vl-plus | 1,200 | 1,000,000 |
qwen3-vl-plus-2025-09-23 | 60 | 100,000 |
qwen3-vl-flash | 1,200 | 1,000,000 |
qwen3-vl-flash-2025-10-15 | 60 | 100,000 |
International
In the International deployment mode, the endpoint and data storage are located in the Singapore region. Inference computing resources are dynamically scheduled worldwide, excluding Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen3-vl-plus | 1,200 | 1,000,000 |
qwen3-vl-plus-2025-12-19 | 60 | 100,000 |
qwen3-vl-plus-2025-09-23 | 120 | 1,000,000 |
qwen3-vl-flash | 1,200 | 1,000,000 |
qwen3-vl-flash-2026-01-22 | 60 | 100,000 |
qwen3-vl-flash-2025-10-15 | 120 | 1,000,000 |
qwen-vl-max | 1,200 | 1,000,000 |
qwen-vl-max-latest | ||
qwen-vl-max-2025-08-13 (qwen-vl-max-0813) | 60 | 100,000 |
qwen-vl-max-2025-04-08 (qwen-vl-max-0408) | 1,200 | 1,000,000 |
qwen-vl-plus | ||
qwen-vl-plus-latest | ||
qwen-vl-plus-2025-08-15 (qwen-vl-plus-0815) | 120 | 1,000,000 |
qwen-vl-plus-2025-05-07 (qwen-vl-plus-0507) | ||
qwen-vl-plus-2025-01-25 (qwen-vl-plus-0125) | 1,200 | |
qvq-max | 60 | 100,000 |
qvq-max-latest | ||
qvq-max-2025-03-25 (qvq-max-0325) | ||
US
In the US deployment mode, the endpoint and data storage are located in the US (Virginia) region. Inference computing resources are limited to the United States.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen3-vl-flash-us | 1,200 | 1,000,000 |
qwen3-vl-flash-2025-10-15-us | 120 | 1,000,000 |
Mainland China
In the Mainland China deployment mode, the endpoint and data storage are located in the Beijing region. Inference computing resources are limited to Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen3-vl-plus | 3,000 | 5,000,000 |
qwen3-vl-plus-2025-12-19 | 60 | 100,000 |
qwen3-vl-plus-2025-09-23 | ||
qwen3-vl-flash | 3,000 | 5,000,000 |
qwen3-vl-flash-2026-01-22 | 60 | 100,000 |
qwen3-vl-flash-2025-10-15 | ||
qwen-vl-max | 1,200 | 1,000,000 |
qwen-vl-max-latest | ||
qwen-vl-max-2025-08-13 (qwen-vl-max-0813) | 60 | 100,000 |
qwen-vl-max-2025-04-08 (qwen-vl-max-0408) | ||
qwen-vl-max-2025-04-02 (qwen-vl-max-0402) | ||
qwen-vl-max-2025-01-25 (qwen-vl-max-0125) | ||
qwen-vl-max-2024-12-30 (qwen-vl-max-1230) | ||
qwen-vl-max-2024-11-19 (qwen-vl-max-1119) | ||
qwen-vl-plus | 1,200 | 1,000,000 |
qwen-vl-plus-latest | ||
qwen-vl-plus-2025-08-15 (qwen-vl-plus-0815) | 60 | 100,000 |
qwen-vl-plus-2025-07-10 (qwen-vl-plus-0710) | ||
qwen-vl-plus-2025-05-07 (qwen-vl-plus-0507) | ||
qwen-vl-plus-2025-01-25 (qwen-vl-plus-0125) | ||
qwen-vl-plus-2025-01-02 (qwen-vl-plus-0102) | ||
qvq-max | ||
qvq-max-latest | ||
qvq-max-2025-05-15 (qvq-max-0515) | ||
qvq-max-2025-03-25 (qvq-max-0325) | ||
qvq-plus | ||
qvq-plus-latest | ||
qvq-plus-2025-05-15 (qvq-plus-0515) | ||
Qwen-Omni (omni-modality)
International
In the International deployment mode, the endpoint and data storage are located in the Singapore region. Inference computing resources are dynamically scheduled worldwide, excluding Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen3-omni-flash | 60 | 100,000 |
qwen3-omni-flash-2025-12-01 | ||
qwen3-omni-flash-2025-09-15 | ||
qwen-omni-turbo | ||
qwen-omni-turbo-latest | ||
qwen-omni-turbo-2025-03-26 | ||
Mainland China
In the Mainland China deployment mode, the endpoint and data storage are located in the Beijing region. Inference computing resources are limited to Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen3-omni-flash | 60 | 100,000 |
qwen3-omni-flash-2025-12-01 | ||
qwen3-omni-flash-2025-09-15 | ||
qwen-omni-turbo | ||
qwen-omni-turbo-latest | ||
qwen-omni-turbo-2025-03-26 (qwen-omni-turbo-0326) | ||
qwen-omni-turbo-2025-01-19 (qwen-omni-turbo-0119) | ||
Qwen-Omni-Realtime (real-time omni-modality)
International
In the US deployment mode, the endpoint and data storage are located in the US (Virginia) region. Inference computing resources are limited to the United States.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen3-omni-flash-realtime | 60 | 100,000 |
qwen3-omni-flash-realtime-2025-12-01 | ||
qwen3-omni-flash-realtime-2025-09-15 | ||
qwen-omni-turbo-realtime | 10,000 | |
qwen-omni-turbo-realtime-latest | ||
qwen-omni-turbo-realtime-2025-05-08 | ||
Mainland China
In the Mainland China deployment mode, the endpoint and data storage are located in the Beijing region. Inference computing resources are limited to Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen3-omni-flash-realtime | 60 | 100,000 |
qwen3-omni-flash-realtime-2025-12-01 | ||
qwen3-omni-flash-realtime-2025-09-15 | ||
qwen-omni-turbo-realtime | ||
qwen-omni-turbo-realtime-latest | ||
qwen-omni-turbo-realtime-2025-05-08 | ||
Qwen-OCR (text extraction)
Global
In the Global deployment mode, the endpoint and data storage are located in the US (Virginia) region. Inference computing resources are dynamically scheduled worldwide.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen-vl-ocr | 600 | 6,000,000 |
qwen-vl-ocr-2025-11-20 | 1,200 | |
International
In the International deployment mode, the endpoint and data storage are located in the Singapore region. Inference computing resources are dynamically scheduled worldwide, excluding Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen-vl-ocr | 600 | 6,000,000 |
qwen-vl-ocr-2025-11-20 | 1,200 | |
Mainland China
In the Mainland China deployment mode, the endpoint and data storage are located in the Beijing region. Inference computing resources are limited to Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen-vl-ocr | 600 | 6,000,000 |
qwen-vl-ocr-latest | 1,200 | |
qwen-vl-ocr-2025-11-20 | ||
qwen-vl-ocr-2025-04-13 | 600 | |
qwen-vl-ocr-2024-10-28 | ||
Qwen-Math
Only the Mainland China deployment mode is supported. In this mode, the endpoint and data storage are located in the Beijing region, and inference computing resources are limited to Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen-math-plus | 1,200 | 1,000,000 |
qwen-math-plus-latest | ||
qwen-math-plus-2024-09-19 (qwen-math-plus-0919) | 60 | 100,000 |
qwen-math-plus-2024-08-16 (qwen-math-plus-0816) | 10 | 20,000 |
qwen-math-turbo | 1200 | 1,000,000 |
qwen-math-turbo-latest | ||
qwen-math-turbo-2024-09-19 (qwen-math-turbo-0919) | 60 | 100,000 |
Qwen-Coder
Global
In the Global deployment mode, the endpoint and data storage are located in the US (Virginia) region. Inference computing resources are dynamically scheduled worldwide.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen3-coder-plus | 2,400 | 2,000,000 |
qwen3-coder-plus-2025-09-23 | 60 | 1,000,000 |
qwen3-coder-plus-2025-07-22 | ||
qwen3-coder-flash | 1,200 | |
qwen3-coder-flash-2025-07-28 | 60 | |
International
In the International deployment mode, the endpoint and data storage are located in the Singapore region. Inference computing resources are dynamically scheduled worldwide, excluding Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen3-coder-plus | 2,400 | 2,000,000 |
qwen3-coder-plus-2025-09-23 | 600 | 1,000,000 |
qwen3-coder-plus-2025-07-22 | 60 | 1,000,000 |
qwen3-coder-flash | 600 | 5,000,000 |
qwen3-coder-flash-2025-07-28 | 600 | 5,000,000 |
Mainland China
In the Mainland China deployment mode, the endpoint and data storage are located in the Beijing region. Inference computing resources are limited to Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen3-coder-plus | 5,000 | 5,000,000 |
qwen3-coder-plus-2025-09-23 | 60 | 1,000,000 |
qwen3-coder-plus-2025-07-22 | ||
qwen3-coder-flash | 5,000 | 5,000,000 |
qwen3-coder-flash-2025-07-28 | 60 | 1,000,000 |
qwen-coder-plus | 1,200 | |
qwen-coder-plus-latest | ||
qwen-coder-plus-2024-11-06 (qwen-coder-plus-1106) | 120 | 200,000 |
qwen-coder-turbo | 1,200 | 1,000,000 |
qwen-coder-turbo-latest | ||
qwen-coder-turbo-2024-09-19 (qwen-coder-turbo-0919) | 60 | 100,000 |
Qwen-MT
Global
In the Global deployment mode, the endpoint and data storage are located in the US (Virginia) region. Inference computing resources are dynamically scheduled worldwide.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen-mt-plus | 60 | 25,000 |
qwen-mt-flash | 35,000 | |
qwen-mt-lite | 100,000 | |
International
In the International deployment mode, the endpoint and data storage are located in the Singapore region. Inference computing resources are dynamically scheduled worldwide, excluding Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen-mt-plus | 60 | 100,000 |
qwen-mt-flash | ||
qwen-mt-lite | ||
qwen-mt-turbo | ||
Mainland China
In the Mainland China deployment mode, the endpoint and data storage are located in the Beijing region. Inference computing resources are limited to Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen-mt-plus | 60 | 25,000 |
qwen-mt-flash | 35,000 | |
qwen-mt-lite | 100,000 | |
qwen-mt-turbo | 35,000 | |
Qwen-Doc
Only the Mainland China deployment mode is supported. In this mode, the endpoint and data storage are located in the Beijing region, and inference computing resources are limited to Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen-doc-turbo | 600 | 3,000,000 |
Qwen-Deep-Research
Only the Mainland China deployment mode is supported. In this mode, the endpoint and data storage are located in the Beijing region, and inference computing resources are limited to Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen-deep-research | 120 | 1,200,000 |
Text generation - Qwen - Open source
Open-source Qwen language models
Global
In the Global deployment mode, the endpoint and data storage are located in the US (Virginia) region. Inference computing resources are dynamically scheduled worldwide.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen3-next-80b-a3b-thinking | 600 | 1,000,000 |
qwen3-next-80b-a3b-instruct | ||
qwen3-235b-a22b-thinking-2507 | ||
qwen3-235b-a22b-instruct-2507 | ||
qwen3-30b-a3b-thinking-2507 | ||
qwen3-30b-a3b-instruct-2507 | ||
qwen3-235b-a22b | ||
qwen3-30b-a3b | ||
qwen3-32b | ||
qwen3-14b | ||
qwen3-8b | ||
International
In the International deployment mode, the endpoint and data storage are located in the Singapore region. Inference computing resources are dynamically scheduled worldwide, excluding Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen3-next-80b-a3b-thinking | 600 | 1,000,000 |
qwen3-next-80b-a3b-instruct | ||
qwen3-235b-a22b-thinking-2507 | ||
qwen3-235b-a22b-instruct-2507 | ||
qwen3-30b-a3b-thinking-2507 | 5,000,000 | |
qwen3-30b-a3b-instruct-2507 | ||
qwen3-235b-a22b | 1,000,000 | |
qwen3-32b | ||
qwen3-30b-a3b | ||
qwen3-14b | ||
qwen3-8b | ||
qwen3-4b | ||
qwen3-1.7b | ||
qwen3-0.6b | ||
qwen2.5-14b-instruct-1m | 1,200 | 5,000,000 |
qwen2.5-7b-instruct-1m | ||
qwen2.5-72b-instruct | 60 | 150,000 |
qwen2.5-32b-instruct | ||
qwen2.5-14b-instruct | ||
qwen2.5-7b-instruct | ||
Mainland China
In the Mainland China deployment mode, the endpoint and data storage are located in the Beijing region. Inference computing resources are limited to Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen3-next-80b-a3b-thinking | 600 | 1,000,000 |
qwen3-next-80b-a3b-instruct | ||
qwen3-235b-a22b-thinking-2507 | ||
qwen3-235b-a22b-instruct-2507 | ||
qwen3-30b-a3b-thinking-2507 | ||
qwen3-30b-a3b-instruct-2507 | ||
qwen3-235b-a22b | ||
qwen3-30b-a3b | ||
qwen3-32b | 2400 | |
qwen3-14b | 600 | |
qwen3-8b | ||
qwen3-4b | ||
qwen3-1.7b | ||
qwen3-0.6b | ||
qwq-32b | ||
qwq-32b-preview | 1,200 | |
qwen2.5-72b-instruct | ||
qwen2.5-32b-instruct | ||
qwen2.5-14b-instruct | ||
qwen2.5-14b-instruct-1m | 5,000,000 | |
qwen2.5-7b-instruct | 1,000,000 | |
qwen2.5-7b-instruct-1m | 5,000,000 | |
qwen2.5-3b-instruct | 2,000,000 | |
qwen2.5-1.5b-instruct | ||
qwen2.5-0.5b-instruct | ||
Qwen-VL (visual understanding / image-to-text)
International
In the International deployment mode, the endpoint and data storage are located in the Singapore region. Inference computing resources are dynamically scheduled worldwide, excluding Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen3-vl-32b-thinking | 60 | 100,000 |
qwen3-vl-32b-instruct | ||
qwen3-vl-30b-a3b-thinking | ||
qwen3-vl-30b-a3b-instruct | ||
qwen3-vl-8b-thinking | ||
qwen3-vl-8b-instruct | ||
qwen3-vl-235b-a22b-thinking | ||
qwen3-vl-235b-a22b-instruct | ||
qwen2.5-vl-72b-instruct | ||
qwen2.5-vl-32b-instruct | ||
qwen2.5-vl-7b-instruct | 1,200 | 1,000,000 |
qwen2.5-vl-3b-instruct | ||
Mainland China
In the Mainland China deployment mode, the endpoint and data storage are located in the Beijing region. Inference computing resources are limited to Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen3-vl-32b-thinking | 600 | 1,000,000 |
qwen3-vl-32b-instruct | ||
qwen3-vl-30b-a3b-thinking | ||
qwen3-vl-30b-a3b-instruct | ||
qwen3-vl-8b-thinking | ||
qwen3-vl-8b-instruct | ||
qwen3-vl-235b-a22b-thinking | 60 | 100,000 |
qwen3-vl-235b-a22b-instruct | ||
qwen2.5-vl-72b-instruct | ||
qwen2.5-vl-32b-instruct | ||
qwen2.5-vl-7b-instruct | 1,200 | 1,000,000 |
qwen2.5-vl-3b-instruct | ||
qwen2-vl-72b-instruct | ||
qwen2-vl-7b-instruct | ||
qwen2-vl-2b-instruct | ||
qvq-72b-preview | 60 | 100,000 |
Qwen3-Omni
International
In the International deployment mode, the endpoint and data storage are located in the Singapore region. Inference computing resources are dynamically scheduled worldwide, excluding Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen2.5-omni-7b | 60 | 100,000 |
Mainland China
In the Mainland China deployment mode, the endpoint and data storage are located in the Beijing region. Inference computing resources are limited to Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen2.5-omni-7b | 60 | 100,000 |
Qwen3-Omni-Captioner
International
In the International deployment mode, the endpoint and data storage are located in the Singapore region. Inference computing resources are dynamically scheduled worldwide, excluding Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen3-omni-30b-a3b-captioner | 60 | 100,000 |
Mainland China
In the Mainland China deployment mode, the endpoint and data storage are located in the Beijing region. Inference computing resources are limited to Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen3-omni-30b-a3b-captioner | 60 | 100,000 |
Qwen-Math
Only the Mainland China deployment mode is supported. In this mode, the endpoint and data storage are located in the Beijing region, and inference computing resources are limited to Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen2.5-math-72b-instruct | 1,200 | 1,000,000 |
qwen2.5-math-7b-instruct | ||
qwen2.5-math-1.5b-instruct | ||
Qwen-Coder
Global
In the Global deployment mode, the endpoint and data storage are located in the US (Virginia) region. Inference computing resources are dynamically scheduled worldwide.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen3-coder-480b-a35b-instruct | 600 | 1,000,000 |
qwen3-coder-30b-a3b-instruct | ||
International
In the International deployment mode, the endpoint and data storage are located in the Singapore region. Inference computing resources are dynamically scheduled worldwide, excluding Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen3-coder-480b-a35b-instruct | 1,000,000 | 600 |
qwen3-coder-30b-a3b-instruct | ||
Mainland China
In the Mainland China deployment mode, the endpoint and data storage are located in the Beijing region. Inference computing resources are limited to Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen3-coder-480b-a35b-instruct | 600 | 1,000,000 |
qwen3-coder-30b-a3b-instruct | ||
qwen2.5-coder-32b-instruct | 1,200 | |
qwen2.5-coder-14b-instruct | ||
qwen2.5-coder-7b-instruct | ||
qwen2.5-coder-3b-instruct | 2,000,000 | |
qwen2.5-coder-1.5b-instruct | ||
qwen2.5-coder-0.5b-instruct | ||
Text generation - Third-party
DeepSeek
Only the Mainland China deployment mode is supported. In this mode, the endpoint and data storage are located in the Beijing region, and inference computing resources are limited to Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
deepseek-v3.2 | 15,000 | 1,200,000 |
deepseek-v3.2-exp | 15,000 | 1,200,000 |
deepseek-v3.1 | 15,000 | 1,200,000 |
deepseek-r1-0528 | 60 | 100,000 |
deepseek-r1 | 15,000 | 1,200,000 |
deepseek-v3 | ||
deepseek-r1-distill-qwen-7b | ||
deepseek-r1-distill-qwen-14b | ||
deepseek-r1-distill-qwen-32b | ||
deepseek-r1-distill-qwen-1.5b | 60 | 100,000 |
deepseek-r1-distill-llama-8b | ||
deepseek-r1-distill-llama-70b | ||
Kimi
Only the Mainland China deployment mode is supported. In this mode, the endpoint and data storage are located in the Beijing region, and inference computing resources are limited to Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
kimi-k2.5 | 60 | 100,000 |
kimi-k2-thinking | 60 | 100,000 |
Moonshot-Kimi-K2-Instruct | 60 | 100,000 |
GLM
Only the Mainland China deployment mode is supported. In this mode, the endpoint and data storage are located in the Beijing region, and inference computing resources are limited to Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
glm-4.7 | 1,000,000 | 60 |
glm-4.6 | ||
Image generation
Qwen-Image
International
In the International deployment mode, the endpoint and data storage are located in the Singapore region. Inference computing resources are dynamically scheduled worldwide, excluding Mainland China.
Service | Model | Rate limit (triggered if any value is exceeded) | |
Task submission limit | Concurrent tasks | ||
Text-to-image | qwen-image-max | 2 per minute | No limit for sync API |
qwen-image-max-2025-12-30 | 2 per minute | No limit for sync API | |
qwen-image-plus | 2 per second | No limit for sync API / Async API: 2 | |
qwen-image-plus-2026-01-09 | 2 per second | No limit for sync API | |
qwen-image | 2 per second | No limit for sync API / Async API: 2 | |
Image editing | qwen-image-edit-max | 2 per minute | No limit for sync API |
qwen-image-edit-max-2026-01-16 | 2 per minute | No limit for sync API | |
qwen-image-edit-plus | 2 per second | No limit for sync API | |
qwen-image-edit-plus-2025-12-15 | 2 per second | No limit for sync API | |
qwen-image-edit-plus-2025-10-30 | 2 per second | No limit for sync API | |
qwen-image-edit | 2 per second | No limit for sync API | |
Mainland China
In the Mainland China deployment mode, the endpoint and data storage are located in the Beijing region. Inference computing resources are limited to Mainland China.
Service | Model | Rate limit (triggered if any value is exceeded) | |
Task submission limit | Concurrent tasks | ||
Text-to-image | qwen-image-max | 2 per minute | No limit for sync API |
qwen-image-max-2025-12-30 | 2 per minute | No limit for sync API | |
qwen-image-plus | 2 per second | No limit for sync API / Async API: 2 | |
qwen-image-plus-2026-01-09 | 2 per second | No limit for sync API | |
qwen-image | 2 per second | No limit for sync API / Async API: 2 | |
Image editing | qwen-image-edit-max | 2 per minute | No limit for sync API |
qwen-image-edit-max-2026-01-16 | 2 per minute | No limit for sync API | |
qwen-image-edit-plus | 2 per second | No limit for sync API | |
qwen-image-edit-plus-2025-12-15 | 2 per second | No limit for sync API | |
qwen-image-edit-plus-2025-10-30 | 2 per second | No limit for sync API | |
qwen-image-edit | 2 per second | No limit for sync API | |
Image translation | qwen-mt-image | 1 per second | 2 |
Text-to-image - Z-Image
International
In the International deployment mode, the endpoint and data storage are located in the Singapore region. Inference computing resources are dynamically scheduled worldwide, excluding Mainland China.
Model | Rate limit (triggered if any value is exceeded) | |
Task submission RPS limit | Concurrent tasks | |
z-image-turbo | 2 | No limit for sync API |
Mainland China
In the Mainland China deployment mode, the endpoint and data storage are located in the Beijing region. Inference computing resources are limited to Mainland China.
Model | Rate limit (triggered if any value is exceeded) | |
Task submission RPS limit | Concurrent tasks | |
z-image-turbo | 2 | No limit for sync API |
Wan
Global
In the Global deployment mode, the endpoint and data storage are located in the US (Virginia) region. Inference computing resources are dynamically scheduled worldwide.
Service | Model | Rate limit (triggered if any value is exceeded) | |
Task submission RPS limit | Concurrent tasks | ||
Text-to-image | wan2.6-t2i | 5 | 5 |
Image generation | wan2.6-image | 5 | 5 |
International
In the International deployment mode, the endpoint and data storage are located in the Singapore region. Inference computing resources are dynamically scheduled worldwide, excluding Mainland China.
Service | Model | Rate limit (triggered if any value is exceeded) | |
Task submission RPS limit | Concurrent tasks | ||
Text-to-image | wan2.6-t2i | 5 | 5 |
wan2.5-t2i-preview | |||
wan2.2-t2i-flash | 2 | 2 | |
wan2.2-t2i-plus | |||
wan2.1-t2i-turbo | |||
wan2.1-t2i-plus | |||
Image editing | wan2.5-i2i-preview | 5 | 5 |
Image generation | wan2.6-image | 5 | 5 |
Mainland China
In the Mainland China deployment mode, the endpoint and data storage are located in the Beijing region. Inference computing resources are limited to Mainland China.
Service | Model | Rate limit (triggered if any value is exceeded) | |
Task submission RPS limit | Concurrent tasks | ||
Text-to-image | wan2.6-t2i | 1 | 5 |
wan2.5-t2i-preview | 5 | ||
wanx2.0-t2i-turbo | 2 | 2 | |
wanx2.1-t2i-turbo | |||
wanx2.1-t2i-plus | |||
wan2.2-t2i-flash | |||
wan2.2-t2i-plus | |||
General image editing | wan2.5-i2i-preview | 5 | 5 |
wanx2.1-imageedit | 2 | 2 | |
Image generation | wan2.6-image | 5 | 5 |
AI试衣OutfitAnyone
Only the Mainland China deployment mode is supported. In this mode, the endpoint and data storage are located in the Beijing region, and inference computing resources are limited to Mainland China.
Model | Rate limit (triggered if any value is exceeded) | |
Task submission RPS limit | Concurrent tasks | |
aitryon-plus | 10 | 5 |
aitryon-parsing-v1 | 10 | No limit for sync API |
Video generation
Wan
Global
In the Global deployment mode, the endpoint and data storage are located in the US (Virginia) region. Inference computing resources are dynamically scheduled worldwide.
Service | Model | Rate limit (triggered if any value is exceeded) | |
Task submission RPS limit | Concurrent tasks | ||
Text-to-video | wan2.6-t2v | 5 | 5 |
Image-to-video - first frame | wan2.6-i2v | ||
Referece-to-video | wan2.6-r2v | ||
International
In the International deployment mode, the endpoint and data storage are located in the Singapore region. Inference computing resources are dynamically scheduled worldwide, excluding Mainland China.
Service | Model | Rate limit (triggered if any value is exceeded) | |
Task submission RPS limit | Concurrent tasks | ||
Text-to-image | wan2.6-t2v | 5 | 5 |
wan2.5-t2v-preview | |||
wan2.2-t2v-plus | 2 | 2 | |
wan2.1-t2v-turbo | |||
wan2.1-t2v-plus | |||
Image-to-video - first frame | wan2.6-i2v-flash | 5 | 5 |
wan2.6-i2v | |||
wan2.5-i2v-preview | |||
wan2.2-i2v-flash | 2 | 2 | |
wan2.1-i2v-plus | |||
wan2.1-i2v-turbo | |||
wan2.2-i2v-plus | |||
Image-to-video - first and last frames | wan2.2-kf2v-flash | ||
wan2.1-kf2v-plus | 1 | ||
General video editing | wan2.1-vace-plus | 2 | |
Referece-to-video | wan2.6-r2v-flash | 5 | 5 |
wan2.6-r2v | 5 | 5 | |
Animate image | wan2.2-animate-move | 5 | 1 |
Video character swap | wan2.2-animate-mix | 5 | 1 |
US
In the US deployment mode, the endpoint and data storage are located in the US (Virginia) region. Inference computing resources are limited to the United States.
Service | Model | Rate limit (triggered if any value is exceeded) | |
Task submission RPS limit | Concurrent tasks | ||
Text-to-video | wan2.6-t2v-us | 5 | 5 |
Image-to-video - first frame | wan2.6-i2v-us | ||
Mainland China
In the Mainland China deployment mode, the endpoint and data storage are located in the Beijing region. Inference computing resources are limited to Mainland China.
Service | Model | Rate limit (triggered if any value is exceeded) | |
Task submission RPS limit | Concurrent tasks | ||
Text-to-video | wan2.6-t2v | 5 | 5 |
wan2.5-t2v-preview | |||
wan2.2-t2v-plus | 2 | 2 | |
wanx2.1-t2v-turbo | |||
wanx2.1-t2v-plus | |||
Image-to-video - first frame | wan2.6-i2v-flash | 5 | 5 |
wan2.6-i2v | |||
wan2.5-i2v-preview | |||
wan2.2-i2v-plus | 2 | 2 | |
wanx2.1-i2v-turbo | |||
wanx2.1-i2v-plus | |||
Image-to-video - first and last frames | wan2.2-kf2v-flash | ||
wanx2.1-kf2v-plus | |||
General video editing | wanx2.1-vace-plus | ||
Referece-to-video | wan2.6-r2v-flash | 5 | 5 |
wan2.6-r2v | 5 | 5 | |
Digital human | wan2.2-s2v-detect | 5 | No limit for sync API |
wan2.2-s2v | 1 | ||
Animate image | wan2.2-animate-move | 5 | 1 |
Video character swap | wan2.2-animate-mix | 5 | 1 |
AnimateAnyone
Only the Mainland China deployment mode is supported. In this mode, the endpoint and data storage are located in the Beijing region, and inference computing resources are limited to Mainland China.
Model | Task submission RPS limit | Concurrent tasks |
animate-anyone-detect-gen2 | 5 | No limit for sync API |
animate-anyone-template-gen2 | 1 At a time, only one task is running. Other tasks are pending. | |
animate-anyone-gen2 |
EMO
Only the Mainland China deployment mode is supported. In this mode, the endpoint and data storage are located in the Beijing region, and inference computing resources are limited to Mainland China.
Model | Task submission RPS limit | Concurrent tasks |
emo-detect-v1 | 5 | No limit for sync API |
emo-v1 | 1 At a time, only one task is running. Other tasks are pending. |
LivePortrait
Only the Mainland China deployment mode is supported. In this mode, the endpoint and data storage are located in the Beijing region, and inference computing resources are limited to Mainland China.
Model | Task submission RPS limit | Concurrent tasks |
liveportrait-detect | 5 | No limit for sync API |
liveportrait | 1 At a time, only one task is running. Other tasks are pending. |
VideoRetalk
Only the Mainland China deployment mode is supported. In this mode, the endpoint and data storage are located in the Beijing region, and inference computing resources are limited to Mainland China.
Model | Task submission RPS limit | Concurrent tasks |
videoretalk | 1 | 1 At a time, only one task is running. Other tasks are pending. |
Emoji
Only the Mainland China deployment mode is supported. In this mode, the endpoint and data storage are located in the Beijing region, and inference computing resources are limited to Mainland China.
Model | Task submission RPS limit | Concurrent tasks |
emoji-detect-v1 | 1 | No limit for sync API |
emoji-v1 | 1 At a time, only one task is running. Other tasks are pending. |
Video style transform
Only the Mainland China deployment mode is supported. In this mode, the endpoint and data storage are located in the Beijing region, and inference computing resources are limited to Mainland China.
Model | Task submission RPS limit | Concurrent tasks |
video-style-transform | 2 | 1 At a time, only one task is running. Other tasks are pending. |
Speech synthesis (text-to-speech)
Qwen speech synthesis
International
In the international deployment mode, endpoints and data storage are in the Singapore region. Model inference compute resources are dynamically scheduled worldwide, excluding Mainland China.
Qwen3-TTS-Instruct-Flash
Model | RPM |
qwen3-tts-instruct-flash | 180 |
qwen3-tts-instruct-flash-2026-01-26 | 180 |
Qwen3-TTS-VD
Model | RPM |
qwen3-tts-vd-2026-01-26 | 180 |
Qwen3-TTS-VC
Model | RPM |
qwen3-tts-vc-2026-01-22 | 180 |
Qwen3-TTS-Flash
Model | RPM |
qwen3-tts-flash | 180 |
qwen3-tts-flash-2025-11-27 | 180 |
qwen3-tts-flash-2025-09-18 | 10 |
Mainland China
In the Mainland China deployment mode, endpoints and data storage are in the Beijing region. Model inference compute resources are limited to Mainland China.
Qwen3-TTS-Instruct-Flash
Model | RPM |
qwen3-tts-instruct-flash | 180 |
qwen3-tts-instruct-flash-2026-01-26 | 180 |
Qwen3-TTS-VD
Model | RPM |
qwen3-tts-vd-2026-01-26 | 180 |
Qwen3-TTS-VC
Model | RPM |
qwen3-tts-vc-2026-01-22 | 180 |
Qwen3-TTS-Flash
Model | RPM |
qwen3-tts-flash | 180 |
qwen3-tts-flash-2025-11-27 | 180 |
qwen3-tts-flash-2025-09-18 | 10 |
Qwen-TTS
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60). | |
RPM | TPM Includes input and output tokens | |
qwen-tts | 10 | 100,000 |
qwen-tts-latest | ||
qwen-tts-2025-05-22 | ||
qwen-tts-2025-04-10 | ||
Qwen real-time speech synthesis
International
In the international deployment mode, endpoints and data storage are in the Singapore region. Model inference compute resources are dynamically scheduled worldwide, excluding Mainland China.
Qwen3-TTS-Instruct-Flash-Realtime
Model | RPM |
qwen3-tts-instruct-flash-realtime | 180 |
qwen3-tts-instruct-flash-realtime-2026-01-22 | 180 |
Qwen3-TTS-VD-Realtime
Model | RPM |
qwen3-tts-vd-realtime-2026-01-15 | 180 |
qwen3-tts-vd-realtime-2025-12-16 |
Qwen3-TTS-VC-Realtime
Model | RPM |
qwen3-tts-vc-realtime-2026-01-15 | 180 |
qwen3-tts-vc-realtime-2025-11-27 |
Qwen3-TTS-Flash-Realtime
Model | RPM |
qwen3-tts-flash-realtime | 180 |
qwen3-tts-flash-realtime-2025-11-27 | 180 |
qwen3-tts-flash-realtime-2025-09-18 | 10 |
Mainland China
In the Mainland China deployment mode, endpoints and data storage are in the Beijing region. Model inference compute resources are limited to Mainland China.
Qwen3-TTS-Instruct-Flash-Realtime
Model | RPM |
qwen3-tts-instruct-flash-realtime | 180 |
qwen3-tts-instruct-flash-realtime-2026-01-22 | 180 |
Qwen3-TTS-VD-Realtime
Model | RPM |
qwen3-tts-vd-realtime-2026-01-15 | 180 |
qwen3-tts-vd-realtime-2025-12-16 |
Qwen3-TTS-VC-Realtime
Model | RPM |
qwen3-tts-vc-realtime-2026-01-15 | 180 |
qwen3-tts-vc-realtime-2025-11-27 |
Qwen3-TTS-Flash-Realtime
Model | RPM |
qwen3-tts-flash-realtime | 180 |
qwen3-tts-flash-realtime-2025-11-27 | 180 |
qwen3-tts-flash-realtime-2025-09-18 | 10 |
Qwen-TTS-Realtime
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60). | |
RPM | TPM Includes input and output tokens | |
qwen-tts-realtime | 10 | 100,000 |
qwen-tts-realtime-latest | ||
qwen-tts-realtime-2025-07-15 | ||
Qwen voice cloning
International
In the international deployment mode, endpoints and data storage are in the Singapore region. Model inference compute resources are dynamically scheduled worldwide, excluding Mainland China.
Model | RPM |
qwen-voice-enrollment | 180 |
Mainland China
In the Mainland China deployment mode, endpoints and data storage are in the Beijing region. Model inference compute resources are limited to Mainland China.
Model | RPM |
qwen-voice-enrollment | 180 |
Qwen voice design
International
In the international deployment mode, endpoints and data storage are in the Singapore region. Model inference compute resources are dynamically scheduled worldwide, excluding Mainland China.
Model | RPM |
qwen-voice-design | 180 |
Mainland China
In the Mainland China deployment mode, endpoints and data storage are in the Beijing region. Model inference compute resources are limited to Mainland China.
Model | RPM |
qwen-voice-design | 180 |
CosyVoice speech synthesis
International
In the international deployment mode, endpoints and data storage are in the Singapore region. Model inference compute resources are dynamically scheduled worldwide, excluding Mainland China.
Speech synthesis
Model | Task submission RPS limit |
cosyvoice-v3-plus | 3 |
cosyvoice-v3-flash |
Voice cloning
Model | Task submission RPS limit |
cosyvoice-v3-plus | 10 The voice cloning feature has a total concurrent request limit of 10 RPS. This limit applies whether you call a single model version or multiple versions at once. This means:
|
cosyvoice-v3-flash |
Mainland China
In the Mainland China deployment mode, endpoints and data storage are in the Beijing region. Model inference compute resources are limited to Mainland China.
Speech synthesis
Model | Task submission RPS limit |
cosyvoice-v3-plus | 3 |
cosyvoice-v3-flash | |
cosyvoice-v2 |
Voice cloning
Model | Task submission RPS limit |
cosyvoice-v3-plus | 10 The voice cloning feature has a total concurrent request limit of 10 RPS. This limit applies whether you call a single model version or multiple versions at once. This means:
|
cosyvoice-v3-flash | |
cosyvoice-v2 |
Speech recognition (speech-to-text) and translation (speech-to-translation)
Qwen3-LiveTranslate-Flash
International
In the international deployment mode, endpoints and data storage are in the Singapore region. Model inference compute resources are dynamically scheduled worldwide, excluding Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60). | |
RPM | TPM Includes input and output tokens | |
qwen3-livetranslate-flash | 100 | 100,000 |
qwen3-livetranslate-flash-2025-12-01 | ||
Mainland China
In the Mainland China deployment mode, endpoints and data storage are in the Beijing region. Model inference compute resources are limited to Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60). | |
RPM | TPM Includes input and output tokens | |
qwen3-livetranslate-flash | 100 | 100,000 |
qwen3-livetranslate-flash-2025-12-01 | ||
Qwen3-LiveTranslate-Flash-Realtime
International
In the international deployment mode, endpoints and data storage are in the Singapore region. Model inference compute resources are dynamically scheduled worldwide, excluding Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60). | |
RPM | TPM Includes input and output tokens | |
qwen3-livetranslate-flash-realtime | 10 | 100,000 |
qwen3-livetranslate-flash-realtime-2025-09-22 | ||
Mainland China
In the Mainland China deployment mode, endpoints and data storage are in the Beijing region. Model inference compute resources are limited to Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60). | |
RPM | TPM Includes input and output tokens | |
qwen3-livetranslate-flash-realtime | 10 | 100,000 |
qwen3-livetranslate-flash-realtime-2025-09-22 | ||
Qwen audio file recognition
International
In the international deployment mode, endpoints and data storage are in the Singapore region. Model inference compute resources are dynamically scheduled worldwide, excluding Mainland China.
Qwen3-ASR-Flash-Filetrans
Model | RPM |
qwen3-asr-flash-filetrans | 100 |
qwen3-asr-flash-filetrans-2025-11-17 |
Qwen3-ASR-Flash
Model | RPM |
qwen3-asr-flash | 100 |
qwen3-asr-flash-2025-09-08 |
United States
In the US deployment mode, endpoints and data storage are in the US (Virginia) region. Model inference compute resources are limited to the United States.
Model | RPM |
qwen3-asr-flash-us | 100 |
qwen3-asr-flash-2025-09-08-us |
Mainland China
In the Mainland China deployment mode, endpoints and data storage are in the Beijing region. Model inference compute resources are limited to Mainland China.
Qwen3-ASR-Flash-Filetrans
Model | RPM |
qwen3-asr-flash-filetrans | 100 |
qwen3-asr-flash-filetrans-2025-11-17 |
Qwen3-ASR-Flash
Model | RPM |
qwen3-asr-flash | 100 |
qwen3-asr-flash-2025-09-08 |
Qwen real-time speech recognition
International
In the international deployment mode, endpoints and data storage are in the Singapore region. Model inference compute resources are dynamically scheduled worldwide, excluding Mainland China.
Model | RPS |
qwen3-asr-flash-realtime | 20 |
qwen3-asr-flash-realtime-2025-10-27 |
Mainland China
In the Mainland China deployment mode, endpoints and data storage are in the Beijing region. Model inference compute resources are limited to Mainland China.
Model | RPS |
qwen3-asr-flash-realtime | 20 |
qwen3-asr-flash-realtime-2025-10-27 |
Paraformer speech recognition
Only the Mainland China deployment mode is supported. In this mode, the endpoint and data storage are located in the Beijing region, and inference computing resources are limited to Mainland China.
Model | Task submission RPS limit |
paraformer-realtime-v2 | 20 |
paraformer-realtime-8k-v2 |
Model | Task submission RPS limit | Task query RPS limit |
paraformer-v2 | 20 | 20 |
paraformer-8k-v2 | 20 |
Fun-ASR audio file recognition
International
In the international deployment mode, endpoints and data storage are in the Singapore region. Model inference compute resources are dynamically scheduled worldwide, excluding Mainland China.
Model | Task submission RPS limit | Task query RPS limit |
fun-asr | 10 | 20 |
fun-asr-2025-11-07 | ||
fun-asr-2025-08-25 |
Model | Submit job API RPM limit | Task query RPS limit |
fun-asr-mtl | 100 | 20 |
fun-asr-mtl-2025-08-25 |
Mainland China
In the Mainland China deployment mode, endpoints and data storage are in the Beijing region. Model inference compute resources are limited to Mainland China.
Model | Task submission RPS limit | Task query RPS limit |
fun-asr | 10 | 20 |
fun-asr-2025-11-07 | ||
fun-asr-2025-08-25 | ||
fun-asr-mtl | ||
fun-asr-mtl-2025-08-25 |
Fun-ASR real-time speech recognition
International
In the international deployment mode, endpoints and data storage are in the Singapore region. Model inference compute resources are dynamically scheduled worldwide, excluding Mainland China.
Model | Task submission RPS limit |
fun-asr-realtime | 20 |
fun-asr-realtime-2025-11-07 |
Mainland China
In the Mainland China deployment mode, endpoints and data storage are in the Beijing region. Model inference compute resources are limited to Mainland China.
Model | Task submission RPS limit |
fun-asr-realtime | 20 |
fun-asr-realtime-2025-11-07 | |
fun-asr-realtime-2025-09-15 | |
fun-asr-flash-8k-realtime | |
fun-asr-flash-8k-realtime-2026-01-28 |
Text embedding
International
In the International deployment mode, the endpoint and data storage are located in the Singapore region. Inference computing resources are dynamically scheduled worldwide, excluding Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM/Number of tasks Includes input and output tokens | |
text-embedding-v4 | 1,800 | 1,000,000 |
text-embedding-v3 | 6,000 | 24,000,000 |
Mainland China
In the Mainland China deployment mode, the endpoint and data storage are located in the Beijing region. Inference computing resources are limited to Mainland China.
Model | Rate limit (triggered if any value is exceeded) | |
RPS | TPM/Number of tasks Includes input and output tokens | |
text-embedding-v4 | 30 | 1,200,000 |
Multimodal embedding
Only the Mainland China deployment mode is supported. In this mode, the endpoint and data storage are located in the Beijing region, and inference computing resources are limited to Mainland China.
Model | Rate limit The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Input tokens only | |
qwen3-vl-embedding | 1,200 | 600,000 |
multimodal-embedding-v1 | 120 | 200,000 |
文本排序
Only the Mainland China deployment mode is supported. In this mode, the endpoint and data storage are located in the Beijing region, and inference computing resources are limited to Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
gte-rerank-v2 | 5,040 | 4,980,000,000 |
Domain specific
Intent recognition
Only the Mainland China deployment mode is supported. In this mode, the endpoint and data storage are located in the Beijing region, and inference computing resources are limited to Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
tongyi-intent-detect-v3 | 1,200 | 1,000,000 |
Role playing
International
In the International deployment mode, the endpoint and data storage are located in the Singapore region. Inference computing resources are dynamically scheduled worldwide, excluding Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen-plus-character-ja | 120 | 500,000 |
Mainland China
In the Mainland China deployment mode, the endpoint and data storage are located in the Beijing region. Inference computing resources are limited to Mainland China.
Model | Rate limit (triggered if any value is exceeded) The following are per-minute limits. The service may also enforce limits based on RPS (RPM/60) and TPS (TPM/60) | |
RPM | TPM Includes input and output tokens | |
qwen-plus-character | 120 | 500,000 |
已下线模型
See Model unpublishing mechanism.
Retired on January 30, 2026
Category | Model | Rate limit (triggered if any value is exceeded) | |
RPM | TPM Includes input and output tokens | ||
Qwen-Plus | qwen-plus-2024-11-27 | 0 | 0 |
qwen-plus-2024-11-25 | |||
qwen-plus-2024-09-19 | |||
qwen-plus-2024-08-06 | |||
Qwen-Turbo | qwen-turbo-2024-09-19 | ||
Qwen-VL | qwen-vl-max-2024-10-30 | ||
qwen-vl-max-2024-08-09 | |||
qwen-vl-plus-2024-08-09 | |||
Retired on August 20, 2025
Category | Model | Rate limit (triggered if any value is exceeded) | |
RPM | TPM Includes input and output tokens | ||
Text generation - Qwen | qwen2-72b-instruct | 0 | 0 |
qwen2-57b-a14b-instruct | |||
qwen2-7b-instruct | |||
qwen1.5-110b-chat | |||
qwen1.5-72b-chat | |||
qwen1.5-32b-chat | |||
qwen1.5-14b-chat | |||
qwen1.5-7b-chat | |||