The model observation feature lets you view:
Call records and inference logs. The logs contain the complete content of each interaction, including the messages sent to the model and the responses from the model.
The inference log feature is available only for the China Mainland Version (Beijing).
Token consumption
Performance metrics, such as token latency, call duration, requests per minute (RPM), tokens per minute (TPM), and failure rate
Scenarios
Call statistics View model usage over a specific period. | Performance metrics View various common performance metrics for your models. |
View trends and fluctuations in call counts and volume. View failure counts and failure rates to promptly detect anomalies. | Analyze RPM and TPM to inform future capacity planning. View call duration and token latency to track model performance changes. |
|
|
Supported models
All models in the Models and pricing are supported.
Inference logs are supported only for the following models: qwen-max, qwen-max-latest, qwen-plus, qwen-plus-latest, qwen-flash, qwen-turbo, qwen-turbo-latest, qwen3-max-preview, qwen3-max, qwen3-235b-a22b, qwen3-235b-a22b-instruct-2507, qwen3-235b-a22b-thinking-2507, qwen3-30b-a3b-thinking-2507, and qwen3-30b-a3b-instruct-2507.
Get started
PrerequisitesFollow the instructions on the Model Observation (Singapore or Beijing) page to activate the service. After activation, model monitoring (Basic Monitoring mode) is automatically enabled and cannot be disabled. Use your Alibaba Cloud account to activate the service. Activation usually takes effect within minutes, but there may be a slight delay during peak hours. To use a RAM user, the Alibaba Cloud account must grant the | ||||||||||||||||||||||||||||||||
Monitoring modesModel observation offers two monitoring modes: Basic Monitoring and Advanced Monitoring. Basic Monitoring: Provided as a basic service. It is automatically enabled when you activate the model inference service and cannot be disabled. Advanced Monitoring: Must be manually enabled by an Alibaba Cloud account (or a RAM user with sufficient permissions) on the Model Monitoring (Singapore or Beijing) page of the target workspace
| ||||||||||||||||||||||||||||||||
View monitoring dataAfter you activate the model inference service, the system automatically collects model call data from all workspaces under your Alibaba Cloud account. Each time a user makes a direct or indirect request to a model, the system collects and syncs the relevant data to the Model Observation (Singapore or Beijing) list. List records are generated by model. New models are automatically added to the list after the first data synchronization is complete. The data latency for Basic Monitoring is typically hourly, so please wait. default workspace can view model call details for all workspaces. Members of a sub-workspace can only view model call details for the current workspace and cannot filter by workspace.
| ||||||||||||||||||||||||||||||||
After a model appears in the list, click Actions in the Monitor column to view details for Call Statistics (such as call count and token consumption) and Performance Metrics (such as RPM, TPM, call duration, and first token latency). You can filter the data by API-KEY, inference type, and time range. Filter by API-KEY: In the default workspace, you can filter by all API-KEYs. In a sub-workspace, you can only filter by the API-KEYs of the current workspace. The Others option in the filter criteria refers to calls initiated from the =Model Studio console, including both direct and indirect calls.
Click Log in the Actions column on the right to view the complete interaction content for real-time inference (the messages that you send to the model and the replies that the model returns). The logging feature must be enabled manually: In the upper-right corner of the Model Observation page, click Model Observation Configuration, and then click Create And Enable Audit Log and Create And Enable Inference Log. To disable this feature, turn off Inference Log in Model Observation Configuration. The system will no longer record model call logs for the workspace. | ||||||||||||||||||||||||||||||||
Configure model alertsSet alerts for key metrics to receive timely notifications and intervene when business anomalies occur, such as sudden cost increases or frequent call failures. This feature requires Advanced Monitoring. Procedure:
The system provides four preset alert levels: Normal, Warning, Error, and Critical. Make clear and consistent criteria for each level to help your team quickly identify and respond to issues. Alert levels cannot be modified or added. Currently, Model Studio does not differentiate notification methods based on the alert level. | ||||||||||||||||||||||||||||||||
Quotas and limits
Data retention period: By default, data for both Basic and Advanced Monitoring is retained for 30 days. To query usage information older than 30 days, go to the Expenses and Costs page.
Alert template limit: You can create a maximum of 100 alert templates in each workspace.
API limits: Model observation does not currently provide an API operation.
Workaround: To retrieve token consumption information through an API, you can extract the current call data from the
usagefield in the response each time you call the model. Historical or summary queries are not currently supported. The following is an example of the field structure (For more information, see Qwen API reference):{ "prompt_tokens": 3019, "completion_tokens": 104, "total_tokens": 3123, "prompt_tokens_details": { "cached_tokens": 2048 } }
Billing description
Basic Monitoring: Free of charge.
Advanced Monitoring: After you enable this feature, minute-level monitoring data is written to the Cloud Monitor (CMS) service, which incurs fees. For more information about the billing method, see Cloud Monitor (CMS) Billing Overview.
Inference Log: When this feature is enabled, minute-level log data is written to Simple Log Service, which incurs fees. For more information about the billing method, see Simple Log Service Billing Overview.
FAQ
Why can't I find the call count and token consumption in model observation after calling a model?
Troubleshoot the issue as follows:
Check for data latency: Confirm that you have waited long enough for data synchronization. Data is synchronized hourly for Basic Monitoring and every minute for Advanced Monitoring.
Check the workspace: If you are in a sub-workspace, you can only see data for that workspace. Switch to the default workspace to view all data.
How to configure permissions for a RAM user to enable Advanced Monitoring?
Procedure:
Grant the
AliyunBailianFullAccessglobal management permission to the RAM user.Assign the
Model Observation - Operations(orAdministrator) page permission to the RAM user to allow write operations on the model observation page.Attach the AliyunCloudMonitorFullAccess system policy to the RAM user.
Attach a system policy to the RAM user that grants permissions to create a service-linked role.
Log on to the Resource Access Management (RAM) console. In the navigation pane on the left, choose . Then, click Create Policy.
Click Script Editor. Paste the following content into the policy input box and click OK.
{ "Version": "1", "Statement": [ { "Action": "ram:CreateServiceLinkedRole", "Resource": "*", "Effect": "Allow" } ] }Enter
CreateServiceLinkedRoleas the policy name and click OK.In the navigation pane on the left, choose . Find the RAM user that you want to authorize and click Add Permissions in the Actions column.
From the Policy list, select the policy you just created (CreateServiceLinkedRole) and click Grant permissions. The RAM user now has the permission to create service-linked roles.
After completing steps 1, 2, 3, and 4, return to the Model Observation (Singapore or Beijing) page and try to enable Advanced Monitoring again using the RAM user.
What are the possible reasons for a timeout when calling a large language model?
Model observation does not provide specific call logs. Common reasons include the following:
Long output: The model generates too much content, causing the total time to exceed the client's wait limit. Use the streaming output method to retrieve the first token faster and improve user experience.
Network issues: Check if the network connectivity between the client and the Alibaba Cloud service is stable.
Appendix
Glossary
Term | Description |
Real-time Inference | Refers to all direct and indirect calls to a model. This includes the following scenarios:
|
Batch Inference | Large-scale, offline data processing using the OpenAI compatible Batch interface for scenarios that do not require real-time responses. |



