Automated operations and maintenance (O&M) is an advanced feature provided by DataWorks to ensure the continuous and stable operation of the system. You can configure your previous emergency experiences in handling data failures into automated O&M rules. When the conditions specified in an automated O&M rule are met, the system automatically performs the O&M operation. This improves service stability and O&M efficiency while reducing the frequency of night-time maintenance.
Background information
In DataWorks, the automated O&M feature consists of automated termination of running node instances and automated rerun.
Automated termination of running node instances
If a node that runs on an exclusive resource group for scheduling triggers a custom alert rule about resource groups, the system uses the specified automated O&M rule to terminate specific instances generated for the node. For example, if the resource utilization rate of an exclusive resource group for scheduling reaches 80% and persists for 10 minutes, the system automatically terminates the execution of non-auto triggered node instances with priorities 1 and 3 on the exclusive resource group for scheduling.
Automated rerun
A node is automatically rerun based on the automated rerun rule in the following scenarios: 1. The status of the node is Failed and the automatic rerun property is not configured for the node. 2. The node fails because the node running times out.
Limits
Limits on permissions: Only Alibaba Cloud accounts, RAM users to which the AliyunDataWorksFullAccess policy is attached, and workspace administrators can manage automated O&M rules.
Limits on resource groups:
Automated O&M rules about automated termination of running node instances take effect only for nodes that are run on an exclusive resource group for schedulingtake effect for nodes that are run on and for which an alert rule about the resource utilization rate of an exclusive resource group for scheduling is configured.
Automated O&M rules about creating automated rerun rules take effect only for nodes that are run on a serverless resource grouptake effect for nodes that are run on.
Limits on features:
You can associate multiple automated O&M rules about automated termination of running node instances with the same alert rule.
You can create only one automated O&M rule for creating automated rerun rules in each workspace.
You can view the execution records that are generated for automated O&M rules within the previous 30 days.
Go to the Automatic page
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Operation Center.
In the left-side navigation pane, choose .
Create a rule
On the
page, you can create automated O&M rules for Terminating Running Instances and Automatic Rerun. DataWorks performs the O&M operation only on nodes that meet the trigger condition and filter conditions specified in an automated O&M rule. You can configure a blacklist to exclude nodes on which you do not want to perform the O&M operation. The logic of how an automated O&M rule takes effect depends on the constraints that are specified in the rule. You can create and enable different automated O&M rules based on your O&M requirements.Create a rule for terminating running instances
In Automatic O&M, you can configure automated O&M operations to terminate instances that meet custom rules. Supported instances include recurring instances, data backfill instances, test instances, one-time task instances, and manually triggered workflow instances. The main configuration parameters for rules to terminate running instances are as follows:
Section | Parameter | Description |
Trigger Condition | Associated Monitoring Rule | The alert rule that you want to associate with the automated O&M rule. If the alert rule is triggered, the node instance is automatically terminated. Note
|
Filter Conditions | Workspace | The name of the workspace to which the automated O&M rule is applied. |
Instance Type | The type of the node instance to which the automated O&M rule is applied. | |
Scheduling Cycle | The scheduling frequency of the node instances to which the automated O&M rule is applied. If you set Instance Type to Recurring Instance or Data Backfill Instance, you must configure the Scheduling Cycle parameter. | |
Priority | The priority of the node instances to which the automated O&M rule is applied. A larger value indicates a higher priority. | |
Status | The status of the node instances to which the automated O&M rule is applied. | |
Blacklist | The nodes that meet the conditions specified in the automated O&M rule but on which you do not want to perform the O&M operation. To add a node to the blacklist, enter the name or ID of the node in the search box. | |
Constraints On Rule | Effective Period | The time range within which the automated O&M rule is effective. The automated O&M operation is performed only when the conditions specified in the automated O&M rule are met and the rule is triggered during the effective time period. If an automated O&M rule is triggered beyond the effective time period, the automated O&M operation is not performed even if the conditions specified in the rule are met. |
Maximum Effective Times | The maximum number of times that the automated O&M rule can be triggered, which is the maximum number of times that the rule can be executed. Note Each time before an automated O&M rule is executed, the system checks whether the trigger condition is met. If the trigger condition is not met, the automated O&M rule is not executed. | |
Minimum Effective Interval | The minimum interval at which the automated O&M rule can be triggered. |
Create an automated rerun rule
In Automatic O&M, you can configure Automatic Rerun for tasks that meet Trigger Conditions. Instances that will be automatically rerun include recurring instances, data backfill instances, test instances, one-time task instances, and manually triggered workflow instances.
When the instance is a recurring instance, automatic rerun only checks instances with a data timestamp of yesterday.
For example, if the current date is June 5, 2025, only recurring instances with a data timestamp of June 4, 2025 will be automatically rerun after meeting the automatic rerun trigger conditions.
When the instance is a data backfill instance, test instance, one-time task instance, or manually triggered workflow instance, automatic rerun checks instances created today, yesterday, and the day before yesterday.
For example, if the current date is June 5, 2025, then data backfill instances, test instances, one-time task instances, and manually triggered workflow instances created on June 5, June 4, and June 3 will be automatically rerun after meeting the automatic rerun trigger conditions.
The main configuration parameters for automatic rerun rules are as follows:
Section | Parameter | Description |
Trigger Condition | Running Status | The specified instance will be automatically rerun when it meets the following conditions and fails to run:
|
Filter Conditions | Workspace | The name of the workspace to which the automated O&M rule is applied. |
Instance Type | The type of the node instance to which the automated O&M rule is applied. | |
Scheduling Cycle | The scheduling frequency of the node instances to which the automated O&M rule is applied. If you set Instance Type to Recurring Instance or Data Backfill Instance, you can configure the Scheduling Cycle parameter. | |
Priority | The priority of the node instances to which the automated O&M rule is applied. A larger value indicates a higher priority. | |
Logs Contain Keywords | The keyword that you want to identify in the operation logs of the node instance. If the operation logs of the node instance contain the keyword, the automated rerun rule is automatically triggered. The valid values are Note The automated rerun rule can be triggered for a node whose operation logs contain the | |
Blacklist | Blacklist | The nodes that meet the conditions specified in the automated O&M rule but on which you do not want to perform the O&M operation. To add a node to the blacklist, enter the name or ID of the node in the search box. |
Rerun | Preparation | If your node is a computing node that is run on a serverless resource group, select Add CUs For Computing Tasks. Note Specify the number of CUs added for each rerun to prevent the running of other nodes from being blocked due to competition for resources. |
CUs To Add | In addition to the CUs consumed by the original node instance, add the specified CUs for the rerun instance. The added CUs are used only for the rerun of the instance. | |
Rerun Times | The maximum number of times that an automated rerun can be triggered. Valid values: 1 to 10. Unit: times. | |
Rerun Interval | The interval between reruns. Valid values: 3 to 30. Unit: minutes. | |
Constraints on Rule | Effective Period | The time range within which the automated O&M rule is effective. The automated O&M operation is performed only when the conditions specified in the automated O&M rule are met and the rule is triggered during the effective time period. If an automated O&M rule is triggered beyond the effective time period, the automated O&M operation is not performed even if the conditions specified in the rule are met. |
Enable or disable a rule
By default, an automated O&M rule takes effect immediately after the rule is created. To disable the rule, click the icon in the Actions column of the rule.
More operations
Manage rules
If you want to view the information about an automated O&M rule, find the desired rule in the automated O&M rule list of the Rules Management tab and click View in the Actions column.
If you want to modify the definition of an automated O&M rule, click Modify at the bottom of the View Rule dialog box.
If you want to delete an automated O&M rule, find the desired rule in the automated O&M rule list and click Delete in the Actions column. In the dialog box that appears, click OK.
In the search box in the upper-left corner of the Rules Management page, you can enter the name of an automated O&M rule to search for the rule.
View the execution records of a rule
The Execution Records page displays the execution information about automated O&M rules, including the time when the rules are executed, the rule owners, and the number of node instances to which the rules are applied. If you want to view the detailed execution information about a rule, click View Details in the Actions column of the rule.
When the conditions specified in an automated O&M rule are met, the O&M operation is performed in the identity of the rule owner. You can view the O&M operation in the operation logs of the node instance that triggers the automated O&M rule.
The execution records of an automated O&M rule about automated termination of running node instances include the following information:
Instances Waiting For Resources/Resource Usage: This section provides a chart that displays the number of node instances that are waiting for resources and the resource usage of the desired resource group. You can move the pointer over a point in the chart to view the number of node instances that are waiting for resources and the resource usage of the desired resource group at the related point in time.
Terminated Node Instances: This section displays all the node instances whose running is terminated.
The execution records of an automated O&M rule about automated rerun of node instances include the following information:
Instances That Are Automatically Rerun: This section displays the number of node instances that are automatically rerun, and the Node Name, Data Timestamp, Instance Type, Node Type, Owner, and other information of each instance.
Monitor resource groups
After you create an automated O&M rule, the system automatically monitors the resource usage of the resource group specified in the automated O&M rule. For more information about resource group monitoring, see Resource O&M.