Monitor your MaxCompute resources, including subscription resources, pay-as-you-go job consumption, to understand their operational status so you can promptly upgrade resources or plan jobs. You can also configure alert rules. When a metric meets the specified conditions, Cloud Monitor automatically sends a notification, helping you to quickly detect and handle anomalies.
Monitoring and alert solutions
MaxCompute supports monitoring and alerts through the following methods:
Use Cloud Monitor to monitor metrics for subscription resources, real-time job consumption.
Use a dashboard to view monitoring charts in real time and track changes for each metric.
Create custom alert rules and add alert contacts. When a metric reaches or exceeds a specified threshold, Cloud Monitor automatically sends a notification to the designated contacts. Supported notification methods include phone calls, SMS messages, emails, and DingTalk chatbots.
Log on to the MaxCompute console. On the Overview, you can view the number of alerts for each metric in the Alert and Risk Warnings section.
Use the MaxCompute client to monitor the consumption of individual SQL jobs. For more information about SQL consumption monitoring, see Single SQL consumption limit.
Metrics
The following table lists the metric types and metrics supported by MaxCompute.
Metric type | Metric category | Metric | Description |
MaxCompute-Subscription Compute Quota | level1 | Level 1 quota CPU utilization | The percentage of CPU usage of a level 1 quota relative to the total amount (reserved CUs + flexible reserved CUs). Unit: %. Data is collected every minute. |
Level 1 quota CPU usage | The total CPU usage of a level 1 quota. Unit: core. Data is collected every minute. | ||
Level 1 quota MEM utilization | The percentage of memory usage of a level 1 quota relative to the total memory (reserved + flexible reserved). Unit: %. Data is collected every minute. | ||
Level 1 quota MEM usage | The memory usage of a level 1 quota. Unit: MB. Data is collected every minute. | ||
level2 | Level 2 quota CPU utilization | The percentage of CPU usage of a level 2 quota relative to the total amount (reserved Min CUs + flexible reserved CUs). Unit: %. Data is collected every minute. | |
Level 2 quota CPU usage | The total CPU usage of a level 2 quota. Unit: core. Data is collected every minute. | ||
Level 2 quota MEM utilization | The percentage of memory usage of a level 2 quota relative to the total memory (reserved Min + flexible reserved). Unit: %. Data is collected every minute. | ||
Level 2 quota MEM usage | The memory usage of a level 2 quota. Unit: MB. Data is collected every minute. | ||
Level 2 quota waiting jobs | The number of waiting jobs in a level 2 quota. Unit: count. Data is collected every minute. | ||
MaxCompute-General | Tunnel | Project-level Tunnel download traffic | This metric measures real-time download traffic at the project level. You can set a maximum download traffic (bytes/min). An alert is triggered if this threshold is reached or exceeded. |
Project-level Tunnel upload traffic | This metric measures real-time upload traffic at the project level. You can set a maximum upload traffic (bytes/min). An alert is triggered if this threshold is reached or exceeded. | ||
Project-level Tunnel cumulative daily download volume | This metric measures the cumulative daily download volume for a project. You can set a maximum data volume (MB). An alert is triggered if this threshold is reached or exceeded. | ||
Project-level Tunnel cumulative daily upload volume | This metric measures the cumulative daily upload volume for a project. You can set a maximum data volume (MB). An alert is triggered if this threshold is reached or exceeded. | ||
Project-level current Tunnel concurrency (slots) | This metric measures the number of concurrent slots in use by the selected project. An alert is triggered if the threshold is reached or exceeded. | ||
Tenant-level current Tunnel concurrency (slots) | This metric measures the number of concurrent slots in use by the selected tenant. An alert is triggered if the threshold is reached or exceeded. | ||
Job | Job runtime | This metric monitors all jobs in a MaxCompute project. If the runtime (including the wait time) of a job exceeds the specified threshold, the system sends an alert notification to the alert contacts based on the configured alert rule. Important Jobs with a runtime of less than 1 minute cannot be monitored. | |
Job runtime (SQL type) | This metric monitors all SQL jobs in a MaxCompute project. If the runtime (including the wait time) of an SQL job exceeds the specified threshold, the system sends an alert notification to the alert contacts based on the configured alert rule. Important Jobs with a runtime of less than 1 minute cannot be monitored. | ||
Job runtime (SQL type, by submitter) | This metric monitors the runtime (including the wait time) of all SQL jobs in a MaxCompute project. When the runtime of an SQL job exceeds the specified threshold, the system sends an alert notification to the alert contacts based on the configured alert rule. The alert includes the job submitter's information to help recipients identify the job owner. Important Jobs with a runtime of less than 1 minute cannot be monitored. | ||
Cost | Daily Storage API read consumption | This metric measures the cumulative daily data read consumption (unit: GiB) of Storage APIs at the project level. An alert is triggered if the threshold is reached or exceeded. Note Each tenant is entitled to a free monthly quota of 1 TB for data reads and writes through Storage APIs. Monitoring starts after the consumption exceeds 1 TB. | |
Monthly Storage API read consumption | This metric measures the cumulative monthly data read consumption (unit: GiB) of Storage APIs at the project level. An alert is triggered if the threshold is reached or exceeded. Note Each tenant is entitled to a free monthly quota of 1 TB for data reads and writes through Storage APIs. Monitoring starts after the consumption exceeds 1 TB. | ||
Daily Storage API write consumption | This metric measures the cumulative daily data write consumption (unit: GiB) of Storage APIs at the project level. An alert is triggered if the threshold is reached or exceeded. Note Each tenant is entitled to a free monthly quota of 1 TB for data reads and writes through Storage APIs. Monitoring starts after the consumption exceeds 1 TB. | ||
Monthly Storage API write consumption | This metric measures the cumulative monthly data write consumption (unit: GiB) of Storage APIs at the project level. An alert is triggered if the threshold is reached or exceeded. Note Each tenant is entitled to a free monthly quota of 1 TB for data reads and writes through Storage APIs. Monitoring starts after the consumption exceeds 1 TB. | ||
Daily consumption of pay-as-you-go jobs (USD) | This metric measures the cumulative daily cost of SQL and MapReduce jobs at the project level. You can set a maximum daily cost threshold (USD). An alert is triggered if this threshold is reached or exceeded. | ||
Monthly consumption of pay-as-you-go jobs (USD) | This metric measures the cumulative monthly cost of SQL and MapReduce jobs at the project level. You can set a maximum monthly cost threshold (USD). An alert is triggered if this threshold is reached or exceeded. | ||
Storage | Project-level standard storage size | The total standard storage used by the project. Unit: GB. Data is collected every hour. | |
Project-level IA storage size | The total IA storage used by the project. Unit: GB. Data is collected every hour. | ||
Project-level IA storage access percentage in the last 30 days | The value is calculated using the following formula: | ||
Project-level archive storage size | The total archive storage used by the project. Unit: GB. Data is collected every hour. | ||
Project-level archive storage access percentage in the last 180 days | The value is calculated using the following formula: |
You can configure dashboards or alert rules for these metrics. For more information, see Configure a dashboard or Configure an alert rule.
Configure a dashboard
Log on to the Cloud Monitor console.
In the left navigation pane, choose .
On the Custom Dashboards page, click Create Dashboard. In the Create Dashboard dialog box, enter a Board Name, select a Folder, and then click OK.
Click the name of the newly created dashboard. On the page that appears, click Add Visualization Widget.
In the upper-right corner of the page, you can select a chart type, such as a line chart, bar chart, statistical chart, gauge, meter, pie chart, table, facet chart, stream chart, or histogram.
In the Query Analysis area, select Cloud Service Monitoring for Data Source Plugin. Then, you can configure other metrics.
For more information about how to manage monitoring charts, see Manage monitoring charts in a custom dashboard.
Configure an alert rule
You can set alert rules for any of the metrics described in the Metrics section.
The following example shows how to configure an alert rule for a resource group. The goal is to trigger an alert when the CU or memory utilization of a subscription MaxCompute quota group exceeds a specified value. Assume that the monitored resource group is configured with 150 CUs. One fully used core represents 100% utilization. Therefore, the maximum utilization for the resource group is 15,000%. You can set the alert threshold to a value greater than 12,000%. If you receive an alert, it indicates that the resource group is approaching full capacity, and subsequent jobs may be queued. This allows you to upgrade the resource group or reschedule jobs as needed.
Log on to the Cloud Monitor console.
In the left navigation pane, click .
On the Alert Rules page, click Create Alert Rule.
On the Create Alert Rule page, configure the parameters for the alert rule based on the scenario. For more information about the parameters, see Create an alert rule. For more information about how to configure alert contacts, see Create an alert contact or an alert contact group.
The following table describes the key parameters for this scenario.
Parameter
Description
Product
From the drop-down list, select MaxCompute_Subscription.
Resource Range
From the drop-down list, select Instances.
Associated Resources
Click Add Instance. On the Add Instance page, select the subscription quota group in the region where your MaxCompute project is located, and then click OK. For more information about quota groups, see Quota management for computing resources.
Rule Description
Click , and in the Configure Rule Description panel, configure the following parameters:
Alert Rule: Enter a name for the alert rule.
Metric Type: Select Simple Metric.
Metric: From the drop-down list, select the corresponding CPU usage metric.
NoteIf the added instance is a level 1 quota group, select . If the added instance is a level 2 quota group, select .
You can also monitor the number of waiting jobs. If CPU usage is high and many jobs are waiting for several consecutive statistical periods, you may need to adjust resource allocation.
Click Confirm to complete the alert rule configuration.
Related documents
To set consumption limits and alerts for pay-as-you-go computing jobs, see Consumption control.