Model monitoring lets you:
-
View call records.
-
Monitor metrics and set alerts for time to first token, call duration, requests per minute (RPM), tokens per minute (TPM), and failure rate.
-
Track token consumption.
Supported models
-
Basic monitoring: Supports all models in the model list, while Advanced monitoring supports all models in the China (Beijing), Singapore, and US (Virginia) regions.
-
Alerting feature: Supports all models in the China (Beijing) and Singapore regions.
Monitor model runs
The system automatically collects model call data from all workspaces in your Alibaba Cloud account. When a direct or indirect model call occurs, the system synchronizes this data to the Monitoring list in the target workspace.
List records are generated for each model and workspace. A new model appears in the list after its first data synchronization. Basic monitoring has an hourly data latency. For minute-level data insights, use advanced monitoring.
At the top of the list, an overview dashboard summarizes key metrics on cards, including Total Models, Total Calls, Total Failures, Average call duration, and Average time to first token.
The Monitoring table lists each model's Model Code, Workspace, Total Calls, Total Failures, failure rate, Average call duration, and Average time to first token. All columns except Model Code and Workspace are sortable. The Actions column provides access to the Monitor and Log pages.
Members of the default workspace can view model calls across all workspaces. Members of a sub-workspace can only view data for their current workspace and cannot switch to other workspaces.
Find the target model in the list and click Monitor in the Actions column to view the following 4 metric categories:
-
Security: Identifies policy violations in conversations, such as
Content Moderation Error Count. -
Cost: Evaluates the cost-effectiveness of the model, with metrics like
Average Usage per Request. -
Performance: Observes changes in model performance, with metrics such as
call durationandtime to first token. -
Error: Assesses the model's stability, with metrics like
Failuresandfailure rate.
You can create alerts based on these metrics to detect and address anomalies promptly.
Clicking Monitor in the Actions column opens the model details page, which contains the Monitoring and Log tabs. The Monitoring tab is divided into two sections: call statistics and performance metrics.
This tab displays metrics related to Security, Cost, and Error, such as call count and failure count. You can filter the data by API key, inference type, time range, and time granularity (by minute or by hour).
-
Rate Limiting Error Count: Indicates call failures caused by a 429 status code.
-
Content Moderation Error Count: Indicates that the input or output was blocked by the Content Moderation Service because it contained suspected sensitive or high-risk content, such as profanity, political content, or advertisements.
In the Failures chart on the call statistics tab, click Failure details to view a breakdown of failures and diagnose their root cause.
Performance metrics
This tab displays Performance-related metrics, such as requests per minute (RPM), tokens per minute (TPM), call duration, time to first token, and subsequent token latency.
View token consumption
Adjusting model parameters and system prompts affects a model's token consumption. To help you track and manage costs, model monitoring provides the following features:
-
Summary: Aggregates historical token consumption by workspace. You can further filter the data by time range and API key.
-
Tracking: Records the token consumption for each model call.
-
Alerting: Sets token consumption thresholds and sends an alert when a model's consumption is abnormal.
View historical model token consumption
-
To view token consumption for the last 30 days:
-
On the Monitoring page for your target workspace, find the model and click Monitor in the Actions column.
-
On the Call Statistics tab, view the token consumption data in the Calls section.
-
-
To view earlier usage, go to the Expenses and Costs page.
View token consumption for a specific call
This feature is currently available only for some models in the China (Beijing) region.
-
Log on to your Alibaba Cloud account (or as a RAM user with sufficient permissions). In your target workspace, go to the Monitoring (Beijing) page and click Monitoring Configuration in the upper-right corner. Follow the instructions to enable the audit log and inference log.
Once enabled, the system records the input and output of every model call in the workspace. Logs can take several minutes to appear after a call.
-
Find the target model in the model monitoring list and click Logs in the Actions column.
-
The Logs tab displays the real-time inference call records for the model. The Usage field shows the token consumption for the call.
Create an alert for abnormal consumption
Conversation history (model logs)
This feature is currently limited to some models in the China (Beijing) region.
Model monitoring records the input, output, and latency for each model call, providing crucial data for troubleshooting and content auditing.
Step 1: Enable logging
Log on with an Alibaba Cloud account (or a RAM user with sufficient permissions). On the model monitoring page of the target workspace, click Monitoring Configuration in the upper-right corner, and follow the instructions to enable audit logs and inference logs.
After you enable this feature, the system starts recording the input and output of each model call within the workspace. Logs may take several minutes to appear after a call.
To stop recording, simply disable inference logs in the Monitoring Configuration.
Step 2: View conversation history
-
In the model monitoring list, find the target model and click Logs in the Actions column.
-
The Logs tab displays the real-time inference call records for the model. The Request and Response fields show the input and output for each call.
Set up proactive alerts
Model monitoring is available in the Singapore and China (Beijing) regions, but alert rules can currently be created only in the China (Beijing) region.
Silent failures, such as timeouts or sudden spikes in token consumption, are often difficult to detect with traditional application logs. Model monitoring enables you to set up alerts for key metrics like cost, failure rate, and response latency. When a metric becomes abnormal, the system immediately sends an alert.
Step 1: Enable advanced monitoring
-
Log on with your Alibaba Cloud account (or a RAM user with sufficient permissions). In your target workspace, go to the Model Monitoring (Singapore or China (Beijing)) page and click Monitoring Configuration in the upper-right corner.
-
In the advanced monitoring section, enable Performance and usage metrics monitoring.
Step 2: Create an alert rule
-
On the Model Alerts (Singapore or China (Beijing)) page, click Create Alert Rule in the upper-right corner.
-
In the dialog box, select the model and monitoring template, and after confirming the settings, click Create. When the specified monitoring metrics (such as call statistics or performance metrics) become abnormal, the system will notify your team.
-
Notification methods: SMS, email, phone calls, DingTalk group robot, WeCom Robot, and webhooks.
-
Alert level: There are four predefined, unmodifiable alert levels: General, Warning, Error, and Urgent. Each level uses specific notification channels:
-
Urgent (CRITICAL): Phone call, SMS, and email
-
Error (ERROR): SMS and email
-
Warning (WARNING): SMS and email
-
General (INFO): Email
-
-
Integrate with Grafana and custom applications
Model monitoring metrics are stored in your private Prometheus instance, which supports the standard Prometheus HTTP API for connecting to Grafana or your custom application to perform visual analytics.
Step 1: Get the HTTP API address
-
Ensure that advanced monitoring is enabled.
-
On the Model Monitoring (Asia Pacific SE 1 (Singapore)), Model Monitoring (US East 1 (Virginia)), or Model Monitoring (China (Beijing)) page, click Monitoring Configuration in the upper-right corner. To the right of the CloudMonitor Prometheus instance, click View Details.
-
On the Settings page, copy the HTTP API address for your client's network environment (public network or VPC access).

Step 2: Connect to Grafana or a custom application
Connect to custom applications
The following examples show how to retrieve metric data using the Prometheus HTTP API. For complete usage details, see the Prometheus HTTP API documentation.
-
Example 1: Query the token consumption for all models across all workspaces in an Alibaba Cloud account on November 20, 2025 (UTC), using the
model_usagemetric with a60sstep size.Example
Description
GET {HTTP API}/api/v1/query_range?query=model_usage&start=2025-11-20T00:00:00Z&end=2025-11-20T23:59:59Z&step=60s Accept: application/json Content-Type: application/json Authorization: Basic base64Encode(AccessKey:AccessKeySecret)-
query: You can set
queryto any metric from the Metrics table below. -
HTTP API: Replace
{HTTP API}with the HTTP API address from Step 1. -
Authorization: Concatenate the AccessKey and AccessKey Secret of your Alibaba Cloud account in the format
AccessKey:AccessKeySecret, encode the resulting string in Base64, and provide it in the formatBasic encoded_string.Example value: Basic TFRBSTV3OWlid0U4XXXXU0xb1dZMFVodmRsNw==
Important: The AccessKey and AccessKey Secret must belong to the same Alibaba Cloud account as the Prometheus instance from Step 1.
-
-
Example 2: Building on Example 1, this example adds filters to retrieve the token consumption for a specific model (model=
qwen-plus) in a specific workspace (workspace_id=llm-nymssti2mzww****).Example
Description
GET {HTTP API}/api/v1/query_range?query=model_usage{workspace_id="llm-nymssti2mzww****",model="qwen-plus"}&start=2025-11-20T00:00:00Z&end=2025-11-20T23:59:59Z&step=60s Accept: application/json Content-Type: application/json Authorization: Basic base64Encode(AccessKey:AccessKeySecret)-
query: Enclose multiple filter conditions in braces (
{}) and separate them with commas. For example:{workspace_id="value1",model="value2"}. The following table lists the supported filter conditions (LabelKey).
-
Connect to Grafana
Add the model monitoring data source to your Grafana environment (either self-managed or the Alibaba Cloud Grafana service). This guide uses the English version of Grafana 10.x for demonstration; the procedure is similar for other versions. For details, see the official Grafana documentation.
-
Add the data source:
-
Log in to Grafana with an administrator account. In the upper-left corner of the page, click the
icon and select . Click + Add new data source and select Prometheus as the data source type. -
On the Settings tab, configure the data source:
-
Name: Enter a custom name.
-
Prometheus server URL: Paste the HTTP API address that you obtained in Step 1.
-
Auth: Enable Basic auth. For User, enter the AccessKey for your Alibaba Cloud account. For Password, enter the AccessKey secret for your Alibaba Cloud account.
Ensure the AccessKey and AccessKey secret belong to the same Alibaba Cloud account as the Prometheus instance from Step 1.

-
-
At the bottom of the tab, click Save & Test.
-
-
Query the metrics:
-
In the upper-left corner of the Grafana page, click the
icon. In the left navigation pane, click Dashboards. -
On the Dashboards page, click in the upper-right corner.
-
Click + Add visualization and select the data source that you just created.
-
On the Edit Panel page, click the Query tab. In the A section, select _name_ and the metric name in the Label filters field. For example, to query model token consumption, use the
model_usagemetric:Example
Description

In this example, the value for
_name_ismodel_usage. You can replace it with any metric name from the list of monitoring metrics below.Use the following labels to further refine your query:
-
Click Run queries.
If data appears in the chart, the configuration is successful. If no data appears, verify the following: 1. Verify that the HTTP API address, AccessKey, and AccessKey secret are correct. 2. Ensure the Prometheus instance from Step 1 contains monitoring data.
-
Comparison of monitoring modes
Model monitoring provides two monitoring modes: Basic Monitoring and Advanced Monitoring.
Basic Monitoring: A basic service that is automatically enabled when you activate Model Studio and cannot be disabled.
Advanced Monitoring: This service must be manually enabled by an Alibaba Cloud account (or a RAM user with sufficient permissions) on the Model Monitoring (Asia Pacific SE 1 (Singapore)), Model Monitoring (US East 1 (Virginia)), or Model Monitoring (China (Beijing)) page of the target workspace. You can disable this service at any time. It records call data only after it is enabled.
|
Item |
Basic Monitoring |
Advanced Monitoring |
|
|
Data latency |
Hourly |
Minute-level |
|
|
Call statistics |
Supported |
Supported |
|
|
Failed call details |
Not supported |
Supported |
|
|
Performance metrics |
Supported |
Supported |
|
|
Applicable scope |
All workspaces under the Alibaba Cloud account |
Applies only to the workspace where it is enabled |
|
|
Billing |
Free |
Paid |
|
Quotas and limits
-
Data retention period: Basic and advanced monitoring data is retained for 30 days by default. To query usage information from earlier periods, use the Expenses and Costs page.
-
Alert template limit: You can create up to 100 alert templates for each workspace.
-
API limit: You can query monitoring metrics using the Prometheus HTTP API.
-
Alternative: You can get the token consumption for a single call from the
usagefield in the response. This field has the following structure. For more information, see the Qwen API reference.{ "prompt_tokens": 3019, "completion_tokens": 104, "total_tokens": 3123, "prompt_tokens_details": { "cached_tokens": 2048 } }
-
Billing
-
Basic monitoring: Free.
-
Advanced monitoring: When enabled, minute-level monitoring data is written to Cloud Monitor, and charges apply. For billing details, see Cloud Monitor billing overview.
-
Inference logs: When enabled, minute-level log data is written to Log Service, and charges apply. For billing details, see Log Service billing overview.
FAQ
Why can't I see the call count and token consumption in Model Monitoring after calling a model?
To troubleshoot, check the following:
-
Data latency: Wait for data to synchronize. Basic monitoring data is delayed by one hour, while advanced monitoring is delayed by several minutes.
-
Workspace: If you are in a sub-workspace, you can only view data for that specific workspace. Switch to the default workspace to view all data.
What can cause timeouts when calling a large model?
Common causes include:
-
Excessively long output: If the model generates a large volume of content, the response time may exceed your client's timeout setting. We recommend using streaming output to receive the first token faster.
-
Network issues: Ensure your client has a stable network connection to Alibaba Cloud services.
How do I configure permissions for a RAM user to enable advanced monitoring?
Procedure:
-
Grant the
AliyunBailianFullAccessglobal management permission for Model Studio to the RAM user. -
Grant the
Model Monitoring - Operations(orAdministrator) page permission to the RAM user. This grants the RAM user write access to the Model Monitoring page. -
Attach the AliyunCloudMonitorFullAccess system policy to the RAM user.
-
Create and attach a custom policy that grants permission to create a service-linked role for the RAM user.
-
Log on to the RAM console. In the left-side navigation pane, choose , and then click Create Policy.
-
Click the JSON tab. Paste the following policy document into the editor and click OK.
{ "Version": "1", "Statement": [ { "Action": "ram:CreateServiceLinkedRole", "Resource": "*", "Effect": "Allow" } ] } -
Enter
CreateServiceLinkedRolefor the policy name and click OK. -
In the left-side navigation pane, choose . Find the target RAM user in the list, and click Add Permission in the Actions column.
-
In the Policies list, select the
CreateServiceLinkedRolepolicy that you created and click Grant permissions. The RAM user can now create a service-linked role.
-
-
After configuring the required permissions, return to the Model Monitoring (Asia Pacific SE 1 (Singapore)), Model Monitoring (US (Virginia)), or Model Monitoring (China (Beijing)) page and retry enabling Advanced Monitoring as the RAM user.
How do I configure permissions for a RAM user to enable Inference Logs?
Procedure:
-
Grant the
AliyunBailianFullAccessglobal management permission for Model Studio to the RAM user. -
Grant the
Model Monitoring - Operations(orAdministrator) page permission to the RAM user. This grants the RAM user write access to the Model Monitoring page. -
Attach the AliyunLogFullAccess system policy to the RAM user.
-
Create and attach a custom policy that grants permission to create a service-linked role for the RAM user.
-
Log on to the RAM console. In the left-side navigation pane, choose , and then click Create Policy.
-
Click the JSON tab. Paste the following policy document into the editor and click OK.
{ "Version": "1", "Statement": [ { "Action": "ram:CreateServiceLinkedRole", "Resource": "*", "Effect": "Allow" } ] } -
Enter
CreateServiceLinkedRolefor the policy name and click OK. -
In the left-side navigation pane, choose . Find the target RAM user in the list, and click Add Permission in the Actions column.
-
In the Policies list, select the
CreateServiceLinkedRolepolicy that you created and click Grant permissions. The RAM user can now create a service-linked role.
-
-
After configuring the required permissions, return to the Model Monitoring (China (Beijing)) page and retry enabling Inference Logs as the RAM user.
Appendix
Glossary
|
Term |
Description |
|
Real-time |
Refers to all direct and indirect calls made to a model, including:
|
|
Batches |
Large-scale, offline data processing using the OpenAI-compatible Batch (file input) API for use cases that do not require real-time responses. |
