MaxCompute lets you monitor job runtimes by configuring threshold-based alert rules. If a job's runtime exceeds the specified threshold, the system sends an alert notification to the designated alert contact. This helps you quickly identify abnormal jobs and improves O&M efficiency. This topic describes the monitoring metrics for job timeout alerts, how to configure them, and how to handle the alerts.
Monitoring metrics
The following metrics are used to monitor job runtimes.
Job runtime
This metric monitors all jobs within a MaxCompute project. If a job's total runtime, which includes the wait time, exceeds the specified threshold, the system sends an alert notification to the specified alert contact.
Scenario: This metric is suitable for MaxCompute projects that analysts use to retrieve data, where jobs typically have short runtimes. You can configure this metric in advance. If a job runs for too long, you can promptly check for issues, such as resource contention or an excessive computational load.
Job runtime_SQL type
This metric monitors all SQL jobs within a MaxCompute project. If an SQL job's total runtime, which includes the wait time, exceeds the specified threshold, the system sends an alert notification to the specified alert contact.
Scenario: This metric is suitable for production projects. You can configure this metric in advance. If a job runs for too long, you can handle the timeout issue promptly to prevent business delays.
Regions and permissions
Supported regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Ulanqab), China (Chengdu), China (Hong Kong), US (Silicon Valley), US (Virginia), Malaysia (Kuala Lumpur), Japan (Tokyo), Germany (Frankfurt), Indonesia (Jakarta), UK (London), and Singapore.
Permission configuration: If a Resource Access Management (RAM) user needs to configure monitoring and alerting, you must grant the AliyunCloudMonitorFullAccess and AliyunDataWorksFullAccess policies to the RAM user in the RAM console. These policies are required in addition to the standard permissions for CloudMonitor. For more information, see Grant permissions to a RAM user.
Configure an alert rule
Activate the Alibaba Cloud CloudMonitor service.
Log on to the Cloud Monitor console.
Create an alert contact.
In the navigation pane on the left, choose .
On the Alert Contacts page, click the Alert Contacts tab.
Click Create Alert Contact. In the Set Alert Contact window, enter the required information.
For more information about how to create an alert contact, see Create an alert contact or alert contact group.
Create an alert rule.
In the navigation pane on the left, choose .
On the Alert Rules page, click Create Alert Rule.
In the Create Alert Rule dialog box, for Product, select MaxCompute_Common.
For more information about other parameter settings for alert rules, see Metric description.
Handle an alert
If a job runs for a period of time that exceeds the specified threshold, an alert is triggered and the alert contact receives an alert notification. The alert contact can perform the following steps to handle the alert:
Log on to the MaxCompute console and select a region in the top-left corner.
In the navigation pane on the left, choose .
Find the job that timed out using the InstanceID from the alert notification.
(Optional) If the job is still running, determine whether it needs to continue. If necessary, you can stop the job. For more information, see Job O&M.
If a job was submitted through a DataWorks node (that is, the ExtPlantFrom value for the instance is DataWorks)
Go to the DataWorks Operation Center, view the job details, and handle the timeout issue as needed. For more information, see Manage auto triggered tasks.
If the job is not submitted through a DataWorks node
On the Job O&M page, in the Instance list area, click LogView in the Actions column to view detailed information about the job and troubleshoot the timeout issue. For more information, see Using Logview 2.0 to view job runtime information.