All Products
Search
Document Center

Alibaba Cloud Model Studio:Model observation

Last Updated:Nov 27, 2025

Use the model observation feature to perform the following operations:

  • View call records

  • Monitor and create alerts for metrics, such as token latency, call duration, requests per minute (RPM), tokens per minute (TPM), and failure rate

  • Track token consumption

Model availability

All models in the Model list are supported.

Monitor model operations

After you activate the model inference service, Alibaba Cloud Model Studio automatically adds the following four categories of monitoring metrics to the Model Observation dashboard:

  • Security: Identifies non-compliant content in conversations, such as Content Moderation errors.

  • Cost: Evaluates the cost-effectiveness of the model, such as average tokens per request.

  • Performance: Observes changes in model performance, such as call duration and time to first token.

  • Error: Determines the stability of the model, such as failures and failure rate.

You can create alerts based on the preceding metrics to promptly detect and handle anomalies.

Step 1: Activate the model inference service

  1. If the following message is displayed at the top of the Model Studio console, use your Alibaba Cloud account to activate the service and obtain a free call quota. If the message is not displayed, the service is already activated.

    image

  2. After activation, the system automatically collects model call data from all workspaces under your Alibaba Cloud account. When a direct or indirect model call occurs, the system automatically collects and syncs the relevant data to the Model Observation (Singapore or Beijing) list.

    List records are generated by model and workspace. New models are automatically added to the list after the first data synchronization is complete. The latency for Basic Monitoring is typically at the hour level. For minute-level data insights, use Advanced Monitoring.
    Members of the default workspace can view model call details for all workspaces. Members of a sub-workspace can only view data for the current workspace and cannot switch to view data from other workspaces.

Step 2: View monitoring metrics

  1. When the model appears in the list, click Monitor in the Actions column to view Call Statistics, such as the number of calls and the number of failures. Flter the statistics by API key, Inference type, and time range.

    • Rate limit errors: Refers to failures with the 429 status code.

    • Content Moderation errors: Refers to calls intercepted by the Content Moderation service because the input or output contains suspected sensitive or high-risk content, such as pornography, political content, or advertisements.

  2. On the Performance Metrics tab, you can view metrics such as RPM, TPM, call duration, and time to first token.

View token consumption

In practice, adjusting model parameters, system prompts, and other operations can change the model's token consumption. To calculate and manage costs with fine-grained control, Model Observation provides the following cost monitoring features:

  • Summary: Summarizes the historical token consumption of models by workspace. You can further filter by time range and API key.

  • Alerting: Allows you to set token consumption thresholds. The system immediately sends an alert when a specified model shows abnormal consumption.

Step 1: Activate the model inference service

Ensure that you have activated the model service.

Step 2: View token consumption or create an alert

  • View the historical token consumption of a model:

    • View token consumption for the last 30 days:

      1. When the model appears in the Model Observation (Singapore or Beijing) list, click Monitor in the Actions column.

      2. On the Call Statistics tab, view token consumption data in the Calls section.

    • To view older usage data, query the data on the Expenses and Costs page.

  • Create an alert for abnormal consumption:

Create proactive alerts

Silent model failures, such as timeouts or sudden increases in token consumption, are difficult to detect with traditional application logs. Model Observation lets you set alerts for monitoring metrics such as cost, failure rate, and response latency. If a metric becomes abnormal, the system immediately sends an alert.

Step 1: Enable Advanced Monitoring

  1. Ensure that you have activated the model inference service.

  2. Log on with an Alibaba Cloud account (or a RAM user with sufficient permissions). On the Model Observation (Singapore or Beijing) page for the target workspace, click Model Observation Configurations in the upper-right corner.

  3. In the Advanced Monitoring area, you can manually enable Performance and Usage Metrics Monitoring.

Step 2: Create an alert rule

  1. On the Model Alert (Singapore or Beijing) page, click Create Alert Rule in the upper-right corner.

  2. In the dialog box, select the model and monitoring template, and then click Create. If the specified monitoring metrics (such as call statistics or performance metrics) become abnormal, the system notifies your team.

    • Notification methods: Supported methods include text message, email, phone call, DingTalk group robot, WeCom robot, and Webhook.

    • Alert level: The available levels are General, Warning, Error, and Urgent. These levels are predefined and cannot be modified. The notification method is the same for all levels. We recommend establishing consistent handling procedures within your team.

Connect to Grafana or custom application

The monitoring metric data from Model Observation is stored in your private Prometheus instance. It supports the standard Prometheus HTTP API, which you can use to connect to Grafana or your custom applications for visual analytics.

Step 1: Obtain the data source HTTP API address

  1. Ensure that you have enabled Advanced Monitoring.

  2. On the Model Observation (Singapore or Beijing) page, click Model Observation Configurations in the upper-right corner, and then click View Details to the right of the CloudMonitor Prometheus instance.

  3. On the Settings tab, copy the HTTP API URL that matches your client's network environment, Internet or Internal Network (VPC).

    1

Step 2: Connect to Grafana or a custom application

Connect to a custom application

The following example shows how to retrieve monitoring data using the Prometheus HTTP API. For complete API usage details, see the Prometheus HTTP API reference.

  • Example 1: Query the token consumption (query=model_usage) for all models in all workspaces under your Alibaba Cloud account within a specified time range (all day on November 20, 2025, UTC), with a step size of step=60s.

    Example

    Parameter description

    GET {HTTP API}/api/v1/query_range?query=model_usage&start=2025-11-20T00:00:00Z&end=2025-11-20T23:59:59Z&step=60s
    
    Accept: application/json
    Content-Type: application/json
    Authorization: Basic base64Encode(AccessKey:AccessKeySecret)
    • query: The value of query can be replaced with any metric name from the Monitoring metrics list below.

      View monitoring metrics

      Type

      Metric name

      Description

      Number of calls

      model_call_count

      Total number of model calls

      Call duration

      model_call_duration_total

      Total duration of model calls

      model_call_duration

      Average duration of model calls

      model_call_duration_p50

      p50 latency of model calls

      model_call_duration_p99

      p99 latency of model calls

      model_first_token_duration_total

      Total time to first token

      model_first_token_duration

      Average time to first token

      model_first_token_duration_p50

      p50 time to first token

      model_first_token_duration_p99

      p99 time to first token

      Time per non-first token

      model_generation_duration_per_token_total

      Total time per non-first token

      model_generation_duration_per_token

      Average time per non-first token

      model_generation_duration_per_token_p50

      p50 time per non-first token

      model_generation_duration_per_token_p99

      p99 time per non-first token

      Usage

      model_usage

      Total model usage

    • HTTP API: Replace {HTTP API} with the HTTP API address that you obtained in Step 1.

    • Authorization: Concatenate your Alibaba Cloud account's AccessKey:AccessKeySecret, Base64-encode the resulting string, and provide it in the format Basic <encoded-string>.

      Example value: Basic TFRBSTV3OWlid0U4XXXXU0xb1dZMFVodmRsNw==
      Note: AccessKey and AccessKey secret must belong to the same Alibaba Cloud account as the Prometheus instance from Step 1.
  • Example 2: Building on Example 1, add filters to retrieve token consumption only for a specific model (model=qwen-plus) in a specific workspace (workspace_id=llm-nymssti2mzww****).

    Example

    Description

    GET {HTTP API}/api/v1/query_range?query=model_usage{workspace_id="llm-nymssti2mzww****",model="qwen-plus"}&start=2025-11-20T00:00:00Z&end=2025-11-20T23:59:59Z&step=60s
    
    Accept: application/json
    Content-Type: application/json
    Authorization: Basic base64Encode(AccessKey:AccessKeySecret)
    • query: Enclose multiple filter conditions in {} and separate them with commas. For example: {workspace_id="value1",model="value2"}. The following are the supported filter conditions (LabelKey).

      View supported filter conditions

      LabelKey

      Description

      user_id

      The ID of the Alibaba Cloud account.

      For a RAM user, this is the UID of the primary account. For more information, see how to obtain the ID

      apikey_id

      The ID of the API key, not the API key itself. You can obtain this ID from the Key Management(International Edition | Mainland China Edition) page.

      56

      Note

      A value of -1 for apikey_id indicates that the call is from the Model Studio console, not through an API call.

      workspace_id

      The ID of the workspace. Learn how to get the ID.

      model

      The model.

      protocol

      The Protocol Type. Valid values are:

      • HTTP: Non-streaming HTTP.

      • SSE: Streaming HTTP.

      • WS: The WebSocket protocol.

      sub_protocol

      The sub-protocol. Valid values are:

      • DEFAULT: A synchronous call.

      • ASYNC: An asynchronous call.

        This is common for image generation models. For more information, see Text-to-image generation

      status_code

      The HTTP status code.

      This LabelKey is supported only by the model_call_count monitoring metric.

      error_code

      The error code.

      This LabelKey is supported only by the model_call_count monitoring metric.

      usage_type

      Usage type

      This LabelKey is supported only by the model_usage monitoring metric.

      Possible values:

      • total_tokens

      • input_tokens

      • output_tokens

      • cache_tokens

      • image_tokens

      • audio_tokens

      • video_tokens

      • image_count

      • audio_count

      • video_count

      • duration

      • characters

      • audio_tts

      • times

Connect to Grafana

Add a model observation data source in Grafana (self-hosted or the Alibaba Cloud Grafana service). This topic uses Grafana 10.x (English version) as an example. The steps for other versions are similar. For more information, see the official Grafana documentation.

  1. Add the data source:

    1. Log on to Grafana using an administrator account. Click the image icon in the upper-left corner of the page and select Administration > Data sources. Click + Add new data source. For the data source type, select Prometheus.

    2. On the Settings tab, configure the data source information:

      • Name: Enter a custom name.

      • Prometheus server URL: Enter the HTTP API address you got in Step 1.

      • Auth: Enable Basic auth, and set User (your Alibaba Cloud account's AccessKey) and Password (your Alibaba Cloud account's AccessKey secret).

        The AccessKey and AccessKey secret must belong to the same Alibaba Cloud account as the Prometheus instance from Step 1.

      image

    3. Click Save & Test at the bottom of the tab.

  2. Query metrics:

    1. Click the image icon in the upper-left corner of the Grafana page and, in the navigation pane on the left, click Dashboards.

    2. Click New > New dashboard on the right side of the Dashboards page to create a new dashboard.

    3. Click + Add visualization and select the data source that you just created.

    4. On the Edit Panel page, click the Query tab. In the A area, select _name_ and the metric name in the Label filters field. For example, to query the model token consumption model_usage:

      Example

      Description

      image

      In this example, the value of _name_ (model_usage) can be replaced with any metric name from the Monitoring metrics list.

      View monitoring metrics

      Type

      Metric name

      Description

      Number of calls

      model_call_count

      Total number of model calls

      Call duration

      model_call_duration_total

      Total duration of model calls

      model_call_duration

      Average duration of model calls

      model_call_duration_p50

      p50 latency of model calls

      model_call_duration_p99

      p99 latency of model calls

      model_first_token_duration_total

      Total time to first token

      model_first_token_duration

      Average time to first token

      model_first_token_duration_p50

      p50 time to first token

      model_first_token_duration_p99

      p99 time to first token

      Time per non-first token

      model_generation_duration_per_token_total

      Total time per non-first token

      model_generation_duration_per_token

      Average time per non-first token

      model_generation_duration_per_token_p50

      p50 time per non-first token

      model_generation_duration_per_token_p99

      p99 time per non-first token

      Usage

      model_usage

      Total model usage

      You can add the following Label filters to further refine the query:

      View supported filter conditions

      LabelKey

      Description

      user_id

      The ID of the Alibaba Cloud account.

      For a RAM user, this is the UID of the primary account. For more information, see how to obtain the ID

      apikey_id

      The ID of the API key, not the API key itself. You can obtain this ID from the Key Management(International Edition | Mainland China Edition) page.

      56

      Note

      A value of -1 for apikey_id indicates that the call is from the Model Studio console, not through an API call.

      workspace_id

      The ID of the workspace. Learn how to get the ID.

      model

      The model.

      protocol

      The Protocol Type. Valid values are:

      • HTTP: Non-streaming HTTP.

      • SSE: Streaming HTTP.

      • WS: The WebSocket protocol.

      sub_protocol

      The sub-protocol. Valid values are:

      • DEFAULT: A synchronous call.

      • ASYNC: An asynchronous call.

        This is common for image generation models. For more information, see Text-to-image generation

      status_code

      The HTTP status code.

      This LabelKey is supported only by the model_call_count monitoring metric.

      error_code

      The error code.

      This LabelKey is supported only by the model_call_count monitoring metric.

      usage_type

      Usage type

      This LabelKey is supported only by the model_usage monitoring metric.

      Possible values:

      • total_tokens

      • input_tokens

      • output_tokens

      • cache_tokens

      • image_tokens

      • audio_tokens

      • video_tokens

      • image_count

      • audio_count

      • video_count

      • duration

      • characters

      • audio_tts

      • times

    5. Click Run queries.

      If data is successfully rendered in the chart, the configuration is successful. Otherwise, check the following: 1) The HTTP API address, AccessKey, and AccessKey secret are correct. 2) The Prometheus instance from Step 1 contains monitoring data.

Compare monitoring modes

Model observation offers two monitoring modes: Basic Monitoring and Advanced Monitoring.

Basic Monitoring: This basic service is automatically enabled when the model service is activated and cannot be disabled.
Advanced Monitoring: Must be manually enabled by the Alibaba Cloud account or a RAM user with sufficient permissions) on the Model Observation (Singapore or Beijing) page of the target workspace. It can be disabled. Only call data generated after this feature is enabled is recorded.

Item

Basic Monitoring (Default)

Advanced Monitoring (Manual activation required)

Data latency

Hourly

By the minute

View call statistics

Supported

Supported

View failure details

Not supported

Supported

View performance metrics

Supported

Supported

Applicable scope

All workspaces under the Alibaba Cloud account

Applies only to the workspace where it is enabled

Billing

Free

Charged

Quotas and limits

  • Data retention period: By default, data for both Basic and Advanced Monitoring is retained for 30 days. To query usage information that is older than 30 days, go to the Expenses and Costs page.

  • Alert template limit: You can create up to 100 alert templates in each workspace.

  • API limits: You can query monitoring metric data for Model Observation through the Prometheus HTTP API.

    • Workaround: To retrieve token consumption for a single call through an API, you can extract the current call data from the usage field in the response from each model call. This field has the following structure. For more information, see Qwen API reference:

      {
        "prompt_tokens": 3019,
        "completion_tokens": 104,
        "total_tokens": 3123,
        "prompt_tokens_details": {
          "cached_tokens": 2048
        }
      }

Billing

  • Basic Monitoring: Free of charge.

  • Advanced Monitoring: After you enable this feature, minute-level monitoring data is written to the CloudMonitor (CMS) service, which incurs additional fees. For more information about the billing method, see Billing overview of CloudMonitor.

FAQ

Why can't I find the call count and token consumption in Model Observation after I call a model?

You can troubleshoot the issue as follows:

  1. Data latency: Confirm that you have waited long enough for data synchronization. Data is synchronized hourly for Basic Monitoring and every minute for Advanced Monitoring.

  2. Workspace: If you are in a sub-workspace, you can view data only for that workspace. Switch to the default workspace to view all data.

What are the possible reasons for a timeout when I call a large language model?

Common reasons include the following:

  • Long output: The model generates too much content, which causes the total time to exceed the client's wait limit. You can use the streaming output method to retrieve the first token more quickly.

  • Network issues: Check whether the network connectivity between the client and the Alibaba Cloud service is stable.

How do I configure permissions for a RAM user to enable Advanced Monitoring?

Follow these steps:

  1. Grant the AliyunBailianFullAccess global management permission to the RAM user.

  2. Assign the ModelObservation-FullAccess (or Administrator) page permission to the RAM user to allow write operations on the Model Observation page.

  3. Grant the AliyunCloudMonitorFullAccess system policy to the RAM user.

  4. Create and grant a system policy that allows the RAM user to create service-linked roles.

    1. Log on to the RAM console. In the navigation pane on the left, choose Permissions > Policies. Then, click Create Policy.

    2. Click JSON, paste the following content into the policy editor, and click OK.

      {
          "Version": "1",
          "Statement": [
              {
                  "Action": "ram:CreateServiceLinkedRole",
                  "Resource": "*",
                  "Effect": "Allow"
              }
          ]
      }
    3. Enter CreateServiceLinkedRole as the access policy name and click OK.

    4. In the navigation pane on the left, choose Identities > Users. Find the RAM user that you want to authorize and click Add Permissions in the Actions column.

    5. From the access policy list, select the access policy that you just created (CreateServiceLinkedRole) and click Grant permissions. The RAM user now has the permission to create service-linked roles.

  5. After you complete all the preceding permission configurations, return to the Model Observation (Singapore or Beijing) page and use the RAM user to try enabling Advanced Monitoring again.

Appendix

Glossary

Term

Description

Real-time Inference

All direct and indirect calls to a model. This includes calls made in the following scenarios:

  • API calls through the DashScope SDK or OpenAI compatible interfaces

  • Playground

  • Model Studio applications in a test or published state, such as agents, workflows, and agent orchestration applications. This also includes any node within these applications that makes a model call, such as LLM nodes, intent classification nodes, and agent group nodes.

  • Assistant API calls

  • Application calls

Batch Inference

Large-scale, offline data processing using the OpenAI compatible Batch interface for scenarios that do not require real-time responses.