All Products
Search
Document Center

Alibaba Cloud Model Studio:Model monitoring

Last Updated:Jun 23, 2026

Model monitoring lets you:

  • View call records.

  • Monitor metrics and set alerts for time to first token, call duration, requests per minute (RPM), tokens per minute (TPM), and failure rate.

  • Track token consumption.

Supported models

  • Basic monitoring: Supports all models in the model list, while Advanced monitoring supports all models in the China (Beijing), Singapore, and US (Virginia) regions.

  • Alerting feature: Supports all models in the China (Beijing) and Singapore regions.

Monitor model runs

The system automatically collects model call data from all workspaces in your Alibaba Cloud account. When a direct or indirect model call occurs, the system synchronizes this data to the Monitoring list in the target workspace.

List records are generated for each model and workspace. A new model appears in the list after its first data synchronization. Basic monitoring has an hourly data latency. For minute-level data insights, use advanced monitoring.

At the top of the list, an overview dashboard summarizes key metrics on cards, including Total Models, Total Calls, Total Failures, Average call duration, and Average time to first token.

The Monitoring table lists each model's Model Code, Workspace, Total Calls, Total Failures, failure rate, Average call duration, and Average time to first token. All columns except Model Code and Workspace are sortable. The Actions column provides access to the Monitor and Log pages.

Members of the default workspace can view model calls across all workspaces. Members of a sub-workspace can only view data for their current workspace and cannot switch to other workspaces.

Find the target model in the list and click Monitor in the Actions column to view the following 4 metric categories:

  • Security: Identifies policy violations in conversations, such as Content Moderation Error Count.

  • Cost: Evaluates the cost-effectiveness of the model, with metrics like Average Usage per Request.

  • Performance: Observes changes in model performance, with metrics such as call duration and time to first token.

  • Error: Assesses the model's stability, with metrics like Failures and failure rate.

You can create alerts based on these metrics to detect and address anomalies promptly.

Clicking Monitor in the Actions column opens the model details page, which contains the Monitoring and Log tabs. The Monitoring tab is divided into two sections: call statistics and performance metrics.

This tab displays metrics related to Security, Cost, and Error, such as call count and failure count. You can filter the data by API key, inference type, time range, and time granularity (by minute or by hour).

  • Rate Limiting Error Count: Indicates call failures caused by a 429 status code.

  • Content Moderation Error Count: Indicates that the input or output was blocked by the Content Moderation Service because it contained suspected sensitive or high-risk content, such as profanity, political content, or advertisements.

In the Failures chart on the call statistics tab, click Failure details to view a breakdown of failures and diagnose their root cause.

Performance metrics

This tab displays Performance-related metrics, such as requests per minute (RPM), tokens per minute (TPM), call duration, time to first token, and subsequent token latency.

View token consumption

Adjusting model parameters and system prompts affects a model's token consumption. To help you track and manage costs, model monitoring provides the following features:

  • Summary: Aggregates historical token consumption by workspace. You can further filter the data by time range and API key.

  • Tracking: Records the token consumption for each model call.

  • Alerting: Sets token consumption thresholds and sends an alert when a model's consumption is abnormal.

View historical model token consumption

  • To view token consumption for the last 30 days:

    1. On the Monitoring page for your target workspace, find the model and click Monitor in the Actions column.

    2. On the Call Statistics tab, view the token consumption data in the Calls section.

  • To view earlier usage, go to the Expenses and Costs page.

View token consumption for a specific call

This feature is currently available only for some models in the China (Beijing) region.
  1. Log on to your Alibaba Cloud account (or as a RAM user with sufficient permissions). In your target workspace, go to the Monitoring (Beijing) page and click Monitoring Configuration in the upper-right corner. Follow the instructions to enable the audit log and inference log.

    Once enabled, the system records the input and output of every model call in the workspace. Logs can take several minutes to appear after a call.
  2. Find the target model in the model monitoring list and click Logs in the Actions column.

  3. The Logs tab displays the real-time inference call records for the model. The Usage field shows the token consumption for the call.

Create an alert for abnormal consumption

Conversation history (model logs)

Important

This feature is currently limited to some models in the China (Beijing) region.

Model monitoring records the input, output, and latency for each model call, providing crucial data for troubleshooting and content auditing.

Step 1: Enable logging

Log on with an Alibaba Cloud account (or a RAM user with sufficient permissions). On the model monitoring page of the target workspace, click Monitoring Configuration in the upper-right corner, and follow the instructions to enable audit logs and inference logs.

After you enable this feature, the system starts recording the input and output of each model call within the workspace. Logs may take several minutes to appear after a call.
To stop recording, simply disable inference logs in the Monitoring Configuration.

Step 2: View conversation history

  1. In the model monitoring list, find the target model and click Logs in the Actions column.

  2. The Logs tab displays the real-time inference call records for the model. The Request and Response fields show the input and output for each call.

Models that support request and response

  • Qwen Max

    • qwen3-max, qwen3-max-preview, qwen3-max-2025-09-23 and later snapshot versions

    • qwen-max

  • Qwen Plus

    • qwen3.7-plus, qwen3.7-plus-2026-05-26 and later snapshot versions

    • qwen3.6-plus, qwen3.6-plus-2026-04-02 and later snapshot versions

    • qwen3.5-plus, qwen3.5-plus-2026-02-15 and later snapshot versions

    • qwen-plus, qwen-plus-latest, qwen-plus-2025-12-01 and later snapshot versions

  • Qwen Flash

    • qwen3.5-flash, qwen3.5-flash-2026-02-23

    • qwen-flash, qwen-flash-2025-07-28

  • Qwen Turbo: qwen-turbo

  • Qwen Coder: qwen3-coder-flash, qwen3-coder-flash-2025-07-28, qwen3-coder-plus, qwen3-coder-plus-2025-07-22, qwen3-coder-plus-2025-09-23

  • Open-source models: qwen3-235b-a22b, qwen3-235b-a22b-instruct-2507, qwen3-235b-a22b-thinking-2507, qwen3-30b-a3b, qwen3-30b-a3b-instruct-2507, qwen3-30b-a3b-thinking-2507, qwen3-next-80b-a3b-instruct, qwen3-next-80b-a3b-thinking, qwen3-coder-480b-a35b-instruct

  • Third-party models: deepseek-v3.1, deepseek-v3.2, deepseek-v3.2-exp

Set up proactive alerts

Important

Model monitoring is available in the Singapore and China (Beijing) regions, but alert rules can currently be created only in the China (Beijing) region.

Silent failures, such as timeouts or sudden spikes in token consumption, are often difficult to detect with traditional application logs. Model monitoring enables you to set up alerts for key metrics like cost, failure rate, and response latency. When a metric becomes abnormal, the system immediately sends an alert.

Step 1: Enable advanced monitoring

  1. Log on with your Alibaba Cloud account (or a RAM user with sufficient permissions). In your target workspace, go to the Model Monitoring (Singapore or China (Beijing)) page and click Monitoring Configuration in the upper-right corner.

  2. In the advanced monitoring section, enable Performance and usage metrics monitoring.

Step 2: Create an alert rule

  1. On the Model Alerts (Singapore or China (Beijing)) page, click Create Alert Rule in the upper-right corner.

  2. In the dialog box, select the model and monitoring template, and after confirming the settings, click Create. When the specified monitoring metrics (such as call statistics or performance metrics) become abnormal, the system will notify your team.

    • Notification methods: SMS, email, phone calls, DingTalk group robot, WeCom Robot, and webhooks.

    • Alert level: There are four predefined, unmodifiable alert levels: General, Warning, Error, and Urgent. Each level uses specific notification channels:

      • Urgent (CRITICAL): Phone call, SMS, and email

      • Error (ERROR): SMS and email

      • Warning (WARNING): SMS and email

      • General (INFO): Email

Integrate with Grafana and custom applications

Model monitoring metrics are stored in your private Prometheus instance, which supports the standard Prometheus HTTP API for connecting to Grafana or your custom application to perform visual analytics.

Step 1: Get the HTTP API address

  1. Ensure that advanced monitoring is enabled.

  2. On the Model Monitoring (Asia Pacific SE 1 (Singapore)), Model Monitoring (US East 1 (Virginia)), or Model Monitoring (China (Beijing)) page, click Monitoring Configuration in the upper-right corner. To the right of the CloudMonitor Prometheus instance, click View Details.

  3. On the Settings page, copy the HTTP API address for your client's network environment (public network or VPC access).

    1

Step 2: Connect to Grafana or a custom application

Connect to custom applications

The following examples show how to retrieve metric data using the Prometheus HTTP API. For complete usage details, see the Prometheus HTTP API documentation.

  • Example 1: Query the token consumption for all models across all workspaces in an Alibaba Cloud account on November 20, 2025 (UTC), using the model_usage metric with a 60s step size.

    Example

    Description

    GET {HTTP API}/api/v1/query_range?query=model_usage&start=2025-11-20T00:00:00Z&end=2025-11-20T23:59:59Z&step=60s
    
    Accept: application/json
    Content-Type: application/json
    Authorization: Basic base64Encode(AccessKey:AccessKeySecret)
    • query: You can set query to any metric from the Metrics table below.

      Metrics

      Type

      Metric name

      Description

      Call count

      model_call_count

      Total number of model calls.

      Call duration

      model_call_duration_total

      Total model call duration.

      model_call_duration

      Average model call duration.

      model_call_duration_p50

      p50 model call duration.

      model_call_duration_p99

      p99 model call duration.

      model_first_token_duration_total

      Total time to first token.

      model_first_token_duration

      Average time to first token.

      model_first_token_duration_p50

      p50 time to first token.

      model_first_token_duration_p99

      p99 time to first token.

      Generation time per token

      model_generation_duration_per_token_total

      Total generation time per token.

      model_generation_duration_per_token

      Average generation time per token.

      model_generation_duration_per_token_p50

      p50 generation time per token.

      model_generation_duration_per_token_p99

      p99 generation time per token.

      Usage

      model_usage

      Total model usage.

    • HTTP API: Replace {HTTP API} with the HTTP API address from Step 1.

    • Authorization: Concatenate the AccessKey and AccessKey Secret of your Alibaba Cloud account in the format AccessKey:AccessKeySecret, encode the resulting string in Base64, and provide it in the format Basic encoded_string.

      Example value: Basic TFRBSTV3OWlid0U4XXXXU0xb1dZMFVodmRsNw==
      Important: The AccessKey and AccessKey Secret must belong to the same Alibaba Cloud account as the Prometheus instance from Step 1.
  • Example 2: Building on Example 1, this example adds filters to retrieve the token consumption for a specific model (model=qwen-plus) in a specific workspace (workspace_id=llm-nymssti2mzww****).

    Example

    Description

    GET {HTTP API}/api/v1/query_range?query=model_usage{workspace_id="llm-nymssti2mzww****",model="qwen-plus"}&start=2025-11-20T00:00:00Z&end=2025-11-20T23:59:59Z&step=60s
    
    Accept: application/json
    Content-Type: application/json
    Authorization: Basic base64Encode(AccessKey:AccessKeySecret)
    • query: Enclose multiple filter conditions in braces ({}) and separate them with commas. For example: {workspace_id="value1",model="value2"}. The following table lists the supported filter conditions (LabelKey).

      Supported filter conditions

      Label key

      Description

      user_id

      The Alibaba Cloud account ID.

      For a RAM user, this is the user ID (UID). How to obtain

      apikey_id

      The API key ID, which is different from the API key. You can obtain it on the Key Management page in the(Singapore| US (Virginia) | China (Beijing)) console.

      56

      Note

      A value of -1 for apikey_id indicates that the call originated from the Alibaba Cloud Model Studio console, not from an API call.

      workspace_id

      The workspace ID. How to obtain

      model

      The model.

      protocol

      The protocol type. Valid values:

      • HTTP: Non-streaming HTTP

      • SSE: Streaming HTTP

      • WS: WebSocket protocol

      sub_protocol

      The sub-protocol. Valid values:

      status_code

      The HTTP status code.

      This LabelKey applies only to the model_call_count metric.

      error_code

      The error code.

      This LabelKey applies only to the model_call_count metric.

      usage_type

      The usage type.

      This LabelKey applies only to the model_usage metric.

      Valid values:

      • total_tokens

      • input_tokens

      • output_tokens

      • cache_tokens

      • image_tokens

      • audio_tokens

      • video_tokens

      • image_count

      • audio_count

      • video_count

      • duration

      • characters

      • audio_tts

      • times

Connect to Grafana

Add the model monitoring data source to your Grafana environment (either self-managed or the Alibaba Cloud Grafana service). This guide uses the English version of Grafana 10.x for demonstration; the procedure is similar for other versions. For details, see the official Grafana documentation.

  1. Add the data source:

    1. Log in to Grafana with an administrator account. In the upper-left corner of the page, click the image icon and select Administration > Data sources. Click + Add new data source and select Prometheus as the data source type.

    2. On the Settings tab, configure the data source:

      • Name: Enter a custom name.

      • Prometheus server URL: Paste the HTTP API address that you obtained in Step 1.

      • Auth: Enable Basic auth. For User, enter the AccessKey for your Alibaba Cloud account. For Password, enter the AccessKey secret for your Alibaba Cloud account.

        Ensure the AccessKey and AccessKey secret belong to the same Alibaba Cloud account as the Prometheus instance from Step 1.

      image

    3. At the bottom of the tab, click Save & Test.

  2. Query the metrics:

    1. In the upper-left corner of the Grafana page, click the image icon. In the left navigation pane, click Dashboards.

    2. On the Dashboards page, click New > New dashboard in the upper-right corner.

    3. Click + Add visualization and select the data source that you just created.

    4. On the Edit Panel page, click the Query tab. In the A section, select _name_ and the metric name in the Label filters field. For example, to query model token consumption, use the model_usage metric:

      Example

      Description

      image

      In this example, the value for _name_ is model_usage. You can replace it with any metric name from the list of monitoring metrics below.

      Metrics

      Type

      Metric name

      Description

      Call count

      model_call_count

      Total number of model calls.

      Call duration

      model_call_duration_total

      Total model call duration.

      model_call_duration

      Average model call duration.

      model_call_duration_p50

      p50 model call duration.

      model_call_duration_p99

      p99 model call duration.

      model_first_token_duration_total

      Total time to first token.

      model_first_token_duration

      Average time to first token.

      model_first_token_duration_p50

      p50 time to first token.

      model_first_token_duration_p99

      p99 time to first token.

      Generation time per token

      model_generation_duration_per_token_total

      Total generation time per token.

      model_generation_duration_per_token

      Average generation time per token.

      model_generation_duration_per_token_p50

      p50 generation time per token.

      model_generation_duration_per_token_p99

      p99 generation time per token.

      Usage

      model_usage

      Total model usage.

      Use the following labels to further refine your query:

      Supported filter conditions

      Label key

      Description

      user_id

      The Alibaba Cloud account ID.

      For a RAM user, this is the user ID (UID). How to obtain

      apikey_id

      The API key ID, which is different from the API key. You can obtain it on the Key Management page in the(Singapore| US (Virginia) | China (Beijing)) console.

      56

      Note

      A value of -1 for apikey_id indicates that the call originated from the Alibaba Cloud Model Studio console, not from an API call.

      workspace_id

      The workspace ID. How to obtain

      model

      The model.

      protocol

      The protocol type. Valid values:

      • HTTP: Non-streaming HTTP

      • SSE: Streaming HTTP

      • WS: WebSocket protocol

      sub_protocol

      The sub-protocol. Valid values:

      status_code

      The HTTP status code.

      This LabelKey applies only to the model_call_count metric.

      error_code

      The error code.

      This LabelKey applies only to the model_call_count metric.

      usage_type

      The usage type.

      This LabelKey applies only to the model_usage metric.

      Valid values:

      • total_tokens

      • input_tokens

      • output_tokens

      • cache_tokens

      • image_tokens

      • audio_tokens

      • video_tokens

      • image_count

      • audio_count

      • video_count

      • duration

      • characters

      • audio_tts

      • times

    5. Click Run queries.

      If data appears in the chart, the configuration is successful. If no data appears, verify the following: 1. Verify that the HTTP API address, AccessKey, and AccessKey secret are correct. 2. Ensure the Prometheus instance from Step 1 contains monitoring data.

Comparison of monitoring modes

Model monitoring provides two monitoring modes: Basic Monitoring and Advanced Monitoring.

Basic Monitoring: A basic service that is automatically enabled when you activate Model Studio and cannot be disabled.
Advanced Monitoring: This service must be manually enabled by an Alibaba Cloud account (or a RAM user with sufficient permissions) on the Model Monitoring (Asia Pacific SE 1 (Singapore)), Model Monitoring (US East 1 (Virginia)), or Model Monitoring (China (Beijing)) page of the target workspace. You can disable this service at any time. It records call data only after it is enabled.

Item

Basic Monitoring

Advanced Monitoring

Data latency

Hourly

Minute-level

Call statistics

Supported

Supported

Failed call details

Not supported

Supported

Performance metrics

Supported

Supported

Applicable scope

All workspaces under the Alibaba Cloud account

Applies only to the workspace where it is enabled

Billing

Free

Paid

Quotas and limits

  • Data retention period: Basic and advanced monitoring data is retained for 30 days by default. To query usage information from earlier periods, use the Expenses and Costs page.

  • Alert template limit: You can create up to 100 alert templates for each workspace.

  • API limit: You can query monitoring metrics using the Prometheus HTTP API.

    • Alternative: You can get the token consumption for a single call from the usage field in the response. This field has the following structure. For more information, see the Qwen API reference.

      {
        "prompt_tokens": 3019,
        "completion_tokens": 104,
        "total_tokens": 3123,
        "prompt_tokens_details": {
          "cached_tokens": 2048
        }
      }

Billing

FAQ

Why can't I see the call count and token consumption in Model Monitoring after calling a model?

To troubleshoot, check the following:

  1. Data latency: Wait for data to synchronize. Basic monitoring data is delayed by one hour, while advanced monitoring is delayed by several minutes.

  2. Workspace: If you are in a sub-workspace, you can only view data for that specific workspace. Switch to the default workspace to view all data.

What can cause timeouts when calling a large model?

Common causes include:

  • Excessively long output: If the model generates a large volume of content, the response time may exceed your client's timeout setting. We recommend using streaming output to receive the first token faster.

  • Network issues: Ensure your client has a stable network connection to Alibaba Cloud services.

How do I configure permissions for a RAM user to enable advanced monitoring?

Procedure:

  1. Grant the AliyunBailianFullAccess global management permission for Model Studio to the RAM user.

  2. Grant the Model Monitoring - Operations (or Administrator) page permission to the RAM user. This grants the RAM user write access to the Model Monitoring page.

  3. Attach the AliyunCloudMonitorFullAccess system policy to the RAM user.

  4. Create and attach a custom policy that grants permission to create a service-linked role for the RAM user.

    1. Log on to the RAM console. In the left-side navigation pane, choose Permissions > Policies, and then click Create Policy.

    2. Click the JSON tab. Paste the following policy document into the editor and click OK.

      {
          "Version": "1",
          "Statement": [
              {
                  "Action": "ram:CreateServiceLinkedRole",
                  "Resource": "*",
                  "Effect": "Allow"
              }
          ]
      }
    3. Enter CreateServiceLinkedRole for the policy name and click OK.

    4. In the left-side navigation pane, choose Identities > Users. Find the target RAM user in the list, and click Add Permission in the Actions column.

    5. In the Policies list, select the CreateServiceLinkedRole policy that you created and click Grant permissions. The RAM user can now create a service-linked role.

  5. After configuring the required permissions, return to the Model Monitoring (Asia Pacific SE 1 (Singapore)), Model Monitoring (US (Virginia)), or Model Monitoring (China (Beijing)) page and retry enabling Advanced Monitoring as the RAM user.

How do I configure permissions for a RAM user to enable Inference Logs?

Procedure:

  1. Grant the AliyunBailianFullAccess global management permission for Model Studio to the RAM user.

  2. Grant the Model Monitoring - Operations (or Administrator) page permission to the RAM user. This grants the RAM user write access to the Model Monitoring page.

  3. Attach the AliyunLogFullAccess system policy to the RAM user.

  4. Create and attach a custom policy that grants permission to create a service-linked role for the RAM user.

    1. Log on to the RAM console. In the left-side navigation pane, choose Permissions > Policy, and then click Create Policy.

    2. Click the JSON tab. Paste the following policy document into the editor and click OK.

      {
          "Version": "1",
          "Statement": [
              {
                  "Action": "ram:CreateServiceLinkedRole",
                  "Resource": "*",
                  "Effect": "Allow"
              }
          ]
      }
    3. Enter CreateServiceLinkedRole for the policy name and click OK.

    4. In the left-side navigation pane, choose Identities > Users. Find the target RAM user in the list, and click Add Permission in the Actions column.

    5. In the Policies list, select the CreateServiceLinkedRole policy that you created and click Grant permissions. The RAM user can now create a service-linked role.

  5. After configuring the required permissions, return to the Model Monitoring (China (Beijing)) page and retry enabling Inference Logs as the RAM user.

Appendix

Glossary

Term

Description

Real-time

Refers to all direct and indirect calls made to a model, including:

  • API calls made through the DashScope SDK or an OpenAI-compatible interface

  • Playground

  • Test and published states of Model Studio applications (such as agents, workflows, and agent orchestration applications) and their model-invoking nodes (such as LLM nodes, intent classification nodes, and agent group nodes)

  • Calls to the Assistant API (Deprecated)

  • Application calls

Batches

Large-scale, offline data processing using the OpenAI-compatible Batch (file input) API for use cases that do not require real-time responses.