All Products
Search
Document Center

:Billing for models

Last Updated:Nov 25, 2025

Pricing overview

Activating Alibaba Cloud Model Studio does not incur any fees. Fees for model inference (calling) are generated when you call models to perform tasks such as text generation, image generation, and speech synthesis.

View bills: Go to the Bill Details and Cost Analysis pages. View statistics: Go to the Model Observation (Singapore or Beijing) page.

Billable items

Model inference (calling)

Method

By model call volume

Formula

Fee = Usage × Unit price

Description

Free quota: A free quota is available only in the Singapore region. Real-time calls are not charged within the free quota. The remaining quota data is updated hourly, and there may be an hourly delay during peak periods.

Unit price: View prices

Model inference (calling)

Billing overview & free quota

For model call prices, see Models. For limits such as requests per minute (RPM) and tokens per minute (TPM), see Rate limits.

Note

A free quota is available only in the Singapore region. For more information about how to claim a free quota and view the remaining free quota, see New user free quota.

On the Model Observation (Singapore or Beijing) page, view the number of calls and tokens consumed for a specific model.

Subscription (savings plan)

You can purchase one or more savings plans to offset inference fees incurred after your free quota is used up. After the savings plan is exhausted, the system will start deducting fees from your account balance.

Large language model

Purchase method

Click here to purchase LLM inference savings plan.

Tiers

Alibaba Cloud Model Studio offers the following purchase tiers: $10, $50, $100, $500, $1,000, $5,000, and $50,000.

Validity period

  • For the $10, $50, and $100 tiers, the validity period is three months.

  • For the $500, $1,000, $5,000, and $50,000 tiers, the validity period is six months.

Applicable models

All text generation models in the Singapore region and Beijing region (including the following: Qwen commercial editions, Qwen open source editions, DeepSeek, Kimi). Go to Models to view these models and their call prices.

Usage instructions

When using Model Studio, the quota of the savings plan is consumed first. If you have purchased multiple savings plans, they will be deducted in order of their expiration dates. If the expiration dates are the same, the first purchased savings plan will be deducted first.

Query savings plan bills

For more information, see How to query savings plan bills.

Wan model

Purchase method

Click here to purchase Wan savings plan.

Purchase instructions

Alibaba Cloud Model Studio offers five purchase tiers:

  • $10: No discount

  • $50: No discount

  • $100: No discount

  • $500: 2% off

  • $1,000: 5% off

  • $5,000: 10% off

Discount example: Take the $500 tier as an example. If generating a video costs $1, the actual amount deducted from the savings plan will be $1 × 0.98 = $0.98.

Validity period

  • For the $10, $50, and $100 tiers, the validity period is three months.

  • For the $500, $1,000, and $5,000 tiers, the validity period is six months.

Usage instructions

When using Model Studio, the quota of the savings plan is consumed first. If you have purchased multiple savings plans, they will be deducted in order of their expiration dates.

Query bills

See How to query savings plan bills.

Applicable models

Image generation: wan2.5-t2i-preview, wan2.5-i2i-preview, wan2.2-t2i-plus, wan2.2-t2i-flash, wanx2.1-imageedit, wan2.1-t2i-plus, wan2.1-t2i-turbo, wanx2.0-t2i-turbo

Video generation: wan2.5-t2v-preview, wan2.5-i2v-preview, wan2.2-i2v-flash, wan2.2-i2v-plus, wan2.2-t2v-plus, wan2.1-vace-plus, wan2.1-kf2v-plus, wan2.1-i2v-plus, wan2.1-i2v-turbo, wan2.1-t2v-plus, wan2.1-t2v-turbo

Go to Models to view all models and their call prices.

Batch discounts (Singapore region only)

The Batch Inference (Batch API) service asynchronously processes large datasets at 50% of the cost of real-time calls.

You can submit files through the console or the API to create batch tasks. The system processes data during off-peak hours and returns the results when the task is complete or the maximum wait time is reached.

Supported models

Text generation models: qwen-max, qwen-plus, qwen-turbo

Limits

Batch inference does not support services or discounts such as subscription (savings plan), free quota, or Context Cache.

Context cache discounts

Includes implicit cache and explicit cache:

  • Implicit cache

    There is no extra charge to enable implicit cache mode.

    image.png

    You can retrieve the number of cached tokens from the cached_tokens attribute in the response.

    OpenAI compatible batch methods are not eligible for cache discounts.
  • Explicit cache

    Includes the following fees:

    • Create cache: The fee for tokens used to create a cache is calculated at 125% of the standard input unit price. If an existing cache is a prefix of the new cache, only the new content (new cache block token count - existing cache block token count) is billed.

      Suppose there is an existing cache block A with 1,200 tokens. When a new request needs to cache 1,500 tokens of content AB, the 1,200 tokens will be billed at 10% of the hit price, and the new 300 tokens will be billed at 125% of the creation price.

      The number of tokens used to create a cache can be viewed through the cache_creation_input_tokens parameter.
    • Hit cache: The unit price is 10% of the standard input token.

      The number of tokens that hit the cache can be viewed through the cached_tokens parameter.
    • Other tokens: Tokens that do not hit the cache and are not used to create a cache are billed at the original price.

FAQ

General

How to pay or top up my account?

Model calling fees are automatically deducted. Bills are generated hourly. For more information, see Introduction to payment methods.

Subscription method:

Model inference (call): Click here to purchase LLM inference savings plan.

How to renew my service?

After March 15, 2024, Model Studio upgraded its commercial services. All subscription services were changed to pay-as-you-go services. Therefore, you do not need to manually renew your services. The pay-as-you-go billing method is used automatically.

How to stop billing?

  • Model inference and model training

    You will no longer incur fees after you stop using the related features. For model inference, you can delete the API key (Singapore or Beijing) to prevent further charges from accidental calls.

    image

You can set a monthly spending alert. Set the alert threshold to a low value. Alibaba Cloud will notify you when unexpected charges occur to help you avoid further losses.

How to view the number of calls and tokens consumed?

You can view the number of calls and token consumption for a specific model on the Model Observation (Singapore or Beijing) page.

How are tokens calculated?

Tokens are the basic units that a model uses to represent text. You can think of them as characters or words.

  • In Chinese, one token is usually one character or word. For example, the text "你好,我是通义千问" (Hello, I am Qwen) is converted to ['你好', ',', '我是', '通', '义', '千', '问'].

  • For English text, one token usually corresponds to three to four letters or one word. For example, "Nice to meet you." is converted to ['Nice', ' to', ' meet', ' you', '.'].

Different LLMs may use different methods to chunk tokens. You can use a SDK to view the token data chunked by a Qwen model on your local machine.

View the token data chunked by a Qwen model:

# Make sure that the DashScope Python SDK is installed.
from dashscope import get_tokenizer

# Get the tokenizer object. Currently, only Qwen series models are supported.
tokenizer = get_tokenizer('qwen-turbo')

input_str = 'Qwen has powerful capabilities.'

# Chunk the string into tokens and convert them to token IDs.
tokens = tokenizer.encode(input_str)
print(f"The token IDs after chunking are: {tokens}.")
print(f"There are {len(tokens)} tokens after chunking.")

# Convert the token IDs to strings and print them.
for i in range(len(tokens)):
    print(f"The string corresponding to token ID {tokens[i]} is: {tokenizer.decode(tokens[i])}")
// Copyright (c) Alibaba, Inc. and its affiliates.
// dashscope SDK version >= 2.13.0
import java.util.List;
import com.alibaba.dashscope.exception.NoSpecialTokenExists;
import com.alibaba.dashscope.exception.UnSupportedSpecialTokenMode;
import com.alibaba.dashscope.tokenizers.Tokenizer;
import com.alibaba.dashscope.tokenizers.TokenizerFactory;

public class Main {
  public static void testEncodeOrdinary(){
    Tokenizer tokenizer = TokenizerFactory.qwen();
    String prompt ="If you had to walk a very long distance, how long would it take to arrive? ";
    // encode string with no special tokens
    List<Integer> ids = tokenizer.encodeOrdinary(prompt);
    System.out.println(ids);
    String decodedString = tokenizer.decode(ids);
    assert decodedString == prompt;
  }

  public static void testEncode() throws NoSpecialTokenExists, UnSupportedSpecialTokenMode{
    Tokenizer tokenizer = TokenizerFactory.qwen();
    String prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nSan Francisco is a<|im_end|>\n<|im_start|>assistant\n";
    // encode string with special tokens <|im_start|> and <|im_end|>
    List<Integer> ids = tokenizer.encode(prompt, "all");
    // 24 tokens [151644, 8948, 198, 7771, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 23729, 80328, 9464, 374, 264, 151645, 198, 151644, 77091, 198]
    String decodedString = tokenizer.decode(ids);
    System.out.println(ids);
    assert decodedString == prompt;

  }

  public static void main(String[] args) {
      try {
        testEncodeOrdinary();
        testEncode();
      } catch (NoSpecialTokenExists | UnSupportedSpecialTokenMode e) {
        e.printStackTrace();
      }
  }
}

The local tokenizer helps estimate the number of tokens in your text. However, the result is for reference only and may not match the server-side count exactly. For more information about the Qwen tokenizer, see the tokenizer reference.

What to do if a model call fails?

Refer to the Error messages document for the corresponding solution.

Billing rules

Why does my free quota not decrease after I call a model? (Singapore only)

The free quota data is updated hourly. During peak hours, there may be a delay of up to one hour. Therefore, you need to view the remaining quota one hour after the model call is complete.

How are tokens that exceed the free quota billed? (Singapore only)

You are billed based on the actual number of tokens consumed. Because the unit price (input or output cost) is per 1 million tokens, the formula is:

Fee = Actual number of tokens consumed / 1,000,000 × Unit price.

For example, the input cost of qwen-vl-max is $0.80 per 1 million tokens, and the remaining free quota is 50,000 tokens. In a call where the input is 50,400 tokens, the fee for tokens that exceed the free quota is 400 / 1,000,000 × $0.80.

How are multi-turn conversations billed?

In a multi-turn conversation, the input and output from historical conversations are billed as input tokens for the new turn.

Are model applications charged?

You are not charged for creating an application. However, if you call the application for a Q&A pair, you are charged a model calling fee based on the model that is called.

Why is my LLM inference savings plan not used for deduction?

If the free quota is not used up, no bill is generated and no fee is incurred. In this case, the savings plan is not used for deduction. The savings plan is used for deduction after the free quota is used up and a bill is generated.

Overdue payments

What are the impacts of an overdue payment?

If your account has an overdue payment, you cannot make model calls even if you have a free quota (Singapore only) or resource plans. You can go to the Recharge page to top up your account.

API call error: How to quickly resolve issues with service activation or overdue payments?

1. Service not activated

Use your Alibaba Cloud account to go to the Model Studio console (Singapore or Beijing) and activate the model service of Model Studio.

image

2. Insufficient account balance

  • Check balance: Log on to the Expenses and Costs page to check whether your balance is sufficient.

  • Top up: Click Top-up & Remittance, enter the required amount, and complete the payment.

3. Set a spending alert to prevent repeated errors

Bills

After model inference, why can't I find the relevant bills on the Bill Details page?

Possible reasons are:

  • The billing system is updated on an hourly basis. During peak hours, there may be a delay of up to one hour. For example, charges that are incurred between 16:00 and 17:00 may not be billed until 19:30:00.

  • Free models and model inference within the free quota (Singapore only) do not generate bills. Only usage that exceeds the free quota generates bills.

How to view the costs of all Model Studio services?

On the Cost Analysis page, set Cost Type to Pretax Amount, set Time Unit to Month, select a time range, and set Product Name to Alibaba Cloud Model Studio. You can then view the costs of Model Studio within the selected time range.

image

How to view the costs of the model inference service?

On the Cost Analysis page, set Cost Type to Pretax Amount, set Time Granularity to Month, select a time range, and set Product Detail to Model Studio Foundation Model Inference. You can then view the total cost of model inference within the selected time range.

image

How to view the inference cost of a specific model?

Take qwen-max as an example. On the Bill Details page, select a Billing Month. Set Commodity Name to Model Studio Foundation Model Inference and click Search.

In the Instance ID column, find all instances related to qwen-max. Sum the pretax amounts for these instances to get the total inference fee for the qwen-max model in the selected billing cycle.

image

How to export and view the number of consumed tokens in a detailed bill?

On the Bill Details page, set Statistics Item to Billable Item and export the bill. You can view the token usage in the bill.

image

How to reconcile detailed bills for models?

Bills for model inference, deployment, and training that are generated after September 7, 2024 can be reconciled based on the ApiKeyID, workspace ID, model name, input/output type, calling channel, and tags of instances.

On the Bill Details page, select a Billing Month. Set Commodity Name to Model Studio Foundation Model Inference and click Search. Download the search results to your local machine and reconcile the bills based on the content in the Instance ID column.

A complete Asset/Resource Instance ID, such as 12xxx;llm-xxx;qwen-max;output_token;app, represents ApiKeyID;Workspace ID;Model Name;Input/Output Type;Calling Channel respectively. If your Asset/Resource Instance ID does not contain an ApiKeyID, the charge item was generated by a call from the console.

A complete Instance ID, such as text_token;llm-xxx;qwen-max;output_token;app, represents Billing Type;Workspace ID;Model Name;Input/Output Type;Calling Channel respectively.

A complete instance tag, such as key:test value:test, represents Tag Key (key) and Tag Value (value) respectively. If an instance has two or more tags, the tag key-value pairs are listed sequentially and separated by semicolons, such as key:test1 value:test1; key:test2 value:test2.

Go to the Model Studio API Key Management page and confirm the API key that corresponds to the ApiKeyID to reconcile bills based on the API key.
Go to the Workspace Management (Singapore or Beijing) page and confirm the workspace that corresponds to the workspace ID to reconcile bills based on the workspace.
Calling channels include app, bmp, and assistant-api. app indicates that the model is called through an application. bmp indicates that the model is called through the Playground (Singapore or Beijing). assistant-api indicates that the model is called through the assistant API.

image

How are pay-as-you-go bills settled?

Pay-as-you-go cloud resource bills are Not settled in real time. Instead, the system first freezes the amount that is consumed but not yet settled from the account's available credit. At the beginning of the next month, after the final monthly bill is issued, the bill for the previous month is actually deducted.

Cost control

How to set an alert for high spending?

You can set a monthly spending alert in the Expenses and Costs center.

image

How to limit the usage of model calls?

  • Stop charges after your free quota is used up

    To avoid extra costs, Model Studio provides a Free quota only feature.

  • Limit the number of model calls or tokens consumed per unit of time

    Set rate limits for a sub-workspace. Go to the Workspaces page, find the target sub-workspace, and click Authorization & Throttling Settings. Adjust the Request Number Limit and Token Limit for each model.

  • Set an alert for token consumption

    Set an alert rule for model overhead. For more information, see Usage and performance observation.

    • If the Advanced monitoring service has not been activated, the Alibaba Cloud account must first switch to the target workspace and then manually enable or disable it on the Model Observation page. To use a RAM user, the Alibaba Cloud account must first grant the necessary permissions to the RAM user.

    • Go to the Model Alerting page and follow the instructions to activate the CloudMonitor service.

    • Click Create Alert Rule to configure the rule. When a specified metric becomes abnormal, the system notifies you or your O&M team.

    Model alerts only trigger notifications and do not stop model calls.