Use the model observation feature to perform the following operations:
View call records
Monitor and create alerts for metrics, such as token latency, call duration, requests per minute (RPM), tokens per minute (TPM), and failure rate
Track token consumption
Model availability
All models in the Model list are supported.
Monitor model operations
After you activate the model inference service, Alibaba Cloud Model Studio automatically adds the following four categories of monitoring metrics to the Model Observation dashboard:
Security: Identifies non-compliant content in conversations, such as
Content Moderation errors.Cost: Evaluates the cost-effectiveness of the model, such as
average tokens per request.Performance: Observes changes in model performance, such as
call durationandtime to first token.Error: Determines the stability of the model, such as
failuresandfailure rate.
You can create alerts based on the preceding metrics to promptly detect and handle anomalies.
Step 1: Activate the model inference service
If the following message is displayed at the top of the Model Studio console, use your Alibaba Cloud account to activate the service and obtain a free call quota. If the message is not displayed, the service is already activated.

After activation, the system automatically collects model call data from all workspaces under your Alibaba Cloud account. When a direct or indirect model call occurs, the system automatically collects and syncs the relevant data to the Model Observation (Singapore or Beijing) list.
List records are generated by model and workspace. New models are automatically added to the list after the first data synchronization is complete. The latency for Basic Monitoring is typically at the hour level. For minute-level data insights, use Advanced Monitoring.
Members of the default workspace can view model call details for all workspaces. Members of a sub-workspace can only view data for the current workspace and cannot switch to view data from other workspaces.
Step 2: View monitoring metrics
When the model appears in the list, click Monitor in the Actions column to view Call Statistics, such as the number of calls and the number of failures. Flter the statistics by API key, Inference type, and time range.
Rate limit errors: Refers to failures with the 429 status code.
Content Moderation errors: Refers to calls intercepted by the Content Moderation service because the input or output contains suspected sensitive or high-risk content, such as pornography, political content, or advertisements.
On the Performance Metrics tab, you can view metrics such as RPM, TPM, call duration, and time to first token.
View token consumption
In practice, adjusting model parameters, system prompts, and other operations can change the model's token consumption. To calculate and manage costs with fine-grained control, Model Observation provides the following cost monitoring features:
Summary: Summarizes the historical token consumption of models by workspace. You can further filter by time range and API key.
Alerting: Allows you to set token consumption thresholds. The system immediately sends an alert when a specified model shows abnormal consumption.
Step 1: Activate the model inference service
Ensure that you have activated the model service.
Step 2: View token consumption or create an alert
View the historical token consumption of a model:
View token consumption for the last 30 days:
To view older usage data, query the data on the Expenses and Costs page.
Create an alert for abnormal consumption:
Create proactive alerts
Silent model failures, such as timeouts or sudden increases in token consumption, are difficult to detect with traditional application logs. Model Observation lets you set alerts for monitoring metrics such as cost, failure rate, and response latency. If a metric becomes abnormal, the system immediately sends an alert.
Step 1: Enable Advanced Monitoring
Ensure that you have activated the model inference service.
Log on with an Alibaba Cloud account (or a RAM user with sufficient permissions). On the Model Observation (Singapore or Beijing) page for the target workspace, click Model Observation Configurations in the upper-right corner.
In the Advanced Monitoring area, you can manually enable Performance and Usage Metrics Monitoring.
Step 2: Create an alert rule
On the Model Alert (Singapore or Beijing) page, click Create Alert Rule in the upper-right corner.
In the dialog box, select the model and monitoring template, and then click Create. If the specified monitoring metrics (such as call statistics or performance metrics) become abnormal, the system notifies your team.
Notification methods: Supported methods include text message, email, phone call, DingTalk group robot, WeCom robot, and Webhook.
Alert level: The available levels are General, Warning, Error, and Urgent. These levels are predefined and cannot be modified. The notification method is the same for all levels. We recommend establishing consistent handling procedures within your team.
Connect to Grafana or custom application
The monitoring metric data from Model Observation is stored in your private Prometheus instance. It supports the standard Prometheus HTTP API, which you can use to connect to Grafana or your custom applications for visual analytics.
Step 1: Obtain the data source HTTP API address
Ensure that you have enabled Advanced Monitoring.
On the Model Observation (Singapore or Beijing) page, click Model Observation Configurations in the upper-right corner, and then click View Details to the right of the CloudMonitor Prometheus instance.
On the Settings tab, copy the HTTP API URL that matches your client's network environment, Internet or Internal Network (VPC).

Step 2: Connect to Grafana or a custom application
Connect to a custom application
The following example shows how to retrieve monitoring data using the Prometheus HTTP API. For complete API usage details, see the Prometheus HTTP API reference.
Example 1: Query the token consumption (query=
model_usage) for all models in all workspaces under your Alibaba Cloud account within a specified time range (all day on November 20, 2025, UTC), with a step size ofstep=60s.Example
Parameter description
GET {HTTP API}/api/v1/query_range?query=model_usage&start=2025-11-20T00:00:00Z&end=2025-11-20T23:59:59Z&step=60s Accept: application/json Content-Type: application/json Authorization: Basic base64Encode(AccessKey:AccessKeySecret)query: The value of
querycan be replaced with any metric name from the Monitoring metrics list below.HTTP API: Replace
{HTTP API}with the HTTP API address that you obtained in Step 1.Authorization: Concatenate your Alibaba Cloud account's
AccessKey:AccessKeySecret, Base64-encode the resulting string, and provide it in the formatBasic <encoded-string>.Example value: Basic TFRBSTV3OWlid0U4XXXXU0xb1dZMFVodmRsNw==
Note: AccessKey and AccessKey secret must belong to the same Alibaba Cloud account as the Prometheus instance from Step 1.
Example 2: Building on Example 1, add filters to retrieve token consumption only for a specific model (model=
qwen-plus) in a specific workspace (workspace_id=llm-nymssti2mzww****).Example
Description
GET {HTTP API}/api/v1/query_range?query=model_usage{workspace_id="llm-nymssti2mzww****",model="qwen-plus"}&start=2025-11-20T00:00:00Z&end=2025-11-20T23:59:59Z&step=60s Accept: application/json Content-Type: application/json Authorization: Basic base64Encode(AccessKey:AccessKeySecret)query: Enclose multiple filter conditions in
{}and separate them with commas. For example:{workspace_id="value1",model="value2"}. The following are the supported filter conditions (LabelKey).
Connect to Grafana
Add a model observation data source in Grafana (self-hosted or the Alibaba Cloud Grafana service). This topic uses Grafana 10.x (English version) as an example. The steps for other versions are similar. For more information, see the official Grafana documentation.
Add the data source:
Log on to Grafana using an administrator account. Click the
icon in the upper-left corner of the page and select . Click + Add new data source. For the data source type, select Prometheus.On the Settings tab, configure the data source information:
Name: Enter a custom name.
Prometheus server URL: Enter the HTTP API address you got in Step 1.
Auth: Enable Basic auth, and set User (your Alibaba Cloud account's AccessKey) and Password (your Alibaba Cloud account's AccessKey secret).
The AccessKey and AccessKey secret must belong to the same Alibaba Cloud account as the Prometheus instance from Step 1.

Click Save & Test at the bottom of the tab.
Query metrics:
Click the
icon in the upper-left corner of the Grafana page and, in the navigation pane on the left, click Dashboards.Click on the right side of the Dashboards page to create a new dashboard.
Click + Add visualization and select the data source that you just created.
On the Edit Panel page, click the Query tab. In the A area, select _name_ and the metric name in the Label filters field. For example, to query the model token consumption
model_usage:Example
Description

In this example, the value of
_name_(model_usage) can be replaced with any metric name from the Monitoring metrics list.You can add the following Label filters to further refine the query:
Click Run queries.
If data is successfully rendered in the chart, the configuration is successful. Otherwise, check the following: 1) The HTTP API address, AccessKey, and AccessKey secret are correct. 2) The Prometheus instance from Step 1 contains monitoring data.
Compare monitoring modesModel observation offers two monitoring modes: Basic Monitoring and Advanced Monitoring. Basic Monitoring: This basic service is automatically enabled when the model service is activated and cannot be disabled. Advanced Monitoring: Must be manually enabled by the Alibaba Cloud account or a RAM user with sufficient permissions) on the Model Observation (Singapore or Beijing) page of the target workspace. It can be disabled. Only call data generated after this feature is enabled is recorded.
| ||||||||||||||||||||||||||||
Quotas and limits
Data retention period: By default, data for both Basic and Advanced Monitoring is retained for 30 days. To query usage information that is older than 30 days, go to the Expenses and Costs page.
Alert template limit: You can create up to 100 alert templates in each workspace.
API limits: You can query monitoring metric data for Model Observation through the Prometheus HTTP API.
Workaround: To retrieve token consumption for a single call through an API, you can extract the current call data from the
usagefield in the response from each model call. This field has the following structure. For more information, see Qwen API reference:{ "prompt_tokens": 3019, "completion_tokens": 104, "total_tokens": 3123, "prompt_tokens_details": { "cached_tokens": 2048 } }
Billing
Basic Monitoring: Free of charge.
Advanced Monitoring: After you enable this feature, minute-level monitoring data is written to the CloudMonitor (CMS) service, which incurs additional fees. For more information about the billing method, see Billing overview of CloudMonitor.
FAQ
Why can't I find the call count and token consumption in Model Observation after I call a model?
You can troubleshoot the issue as follows:
Data latency: Confirm that you have waited long enough for data synchronization. Data is synchronized hourly for Basic Monitoring and every minute for Advanced Monitoring.
Workspace: If you are in a sub-workspace, you can view data only for that workspace. Switch to the default workspace to view all data.
What are the possible reasons for a timeout when I call a large language model?
Common reasons include the following:
Long output: The model generates too much content, which causes the total time to exceed the client's wait limit. You can use the streaming output method to retrieve the first token more quickly.
Network issues: Check whether the network connectivity between the client and the Alibaba Cloud service is stable.
How do I configure permissions for a RAM user to enable Advanced Monitoring?
Follow these steps:
Grant the
AliyunBailianFullAccessglobal management permission to the RAM user.Assign the
ModelObservation-FullAccess(orAdministrator) page permission to the RAM user to allow write operations on the Model Observation page.Grant the AliyunCloudMonitorFullAccess system policy to the RAM user.
Create and grant a system policy that allows the RAM user to create service-linked roles.
Log on to the RAM console. In the navigation pane on the left, choose . Then, click Create Policy.
Click JSON, paste the following content into the policy editor, and click OK.
{ "Version": "1", "Statement": [ { "Action": "ram:CreateServiceLinkedRole", "Resource": "*", "Effect": "Allow" } ] }Enter
CreateServiceLinkedRoleas the access policy name and click OK.In the navigation pane on the left, choose . Find the RAM user that you want to authorize and click Add Permissions in the Actions column.
From the access policy list, select the access policy that you just created (CreateServiceLinkedRole) and click Grant permissions. The RAM user now has the permission to create service-linked roles.
After you complete all the preceding permission configurations, return to the Model Observation (Singapore or Beijing) page and use the RAM user to try enabling Advanced Monitoring again.
Appendix
Glossary
Term | Description |
Real-time Inference | All direct and indirect calls to a model. This includes calls made in the following scenarios:
|
Batch Inference | Large-scale, offline data processing using the OpenAI compatible Batch interface for scenarios that do not require real-time responses. |
