All Products
Search
Document Center

DataWorks:Automated O&M

Last Updated:Jun 13, 2025

Automated operations and maintenance (O&M) is an advanced feature provided by DataWorks to ensure the continuous and stable operation of the system. You can configure your previous emergency experiences in handling data failures into automated O&M rules. When the conditions specified in an automated O&M rule are met, the system automatically performs the O&M operation. This improves service stability and O&M efficiency while reducing the frequency of night-time maintenance.

Background information

In DataWorks, the automated O&M feature consists of automated termination of running node instances and automated rerun.

  • Automated termination of running node instances

    If a node that runs on an exclusive resource group for scheduling triggers a custom alert rule about resource groups, the system uses the specified automated O&M rule to terminate specific instances generated for the node. For example, if the resource utilization rate of an exclusive resource group for scheduling reaches 80% and persists for 10 minutes, the system automatically terminates the execution of non-auto triggered node instances with priorities 1 and 3 on the exclusive resource group for scheduling.

  • Automated rerun

    A node is automatically rerun based on the automated rerun rule in the following scenarios: 1. The status of the node is Failed and the automatic rerun property is not configured for the node. 2. The node fails because the node running times out.

Limits

Go to the Automatic page

  1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Operation Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Operation Center.

  2. In the left-side navigation pane, choose O&M Assistant > Automatic.

Create a rule

On the Automatic O&M > Rules Management page, you can create automated O&M rules for Terminating Running Instances and Automatic Rerun. DataWorks performs the O&M operation only on nodes that meet the trigger condition and filter conditions specified in an automated O&M rule. You can configure a blacklist to exclude nodes on which you do not want to perform the O&M operation. The logic of how an automated O&M rule takes effect depends on the constraints that are specified in the rule. You can create and enable different automated O&M rules based on your O&M requirements.

Create a rule for terminating running instances

In Automatic O&M, you can configure automated O&M operations to terminate instances that meet custom rules. Supported instances include recurring instances, data backfill instances, test instances, one-time task instances, and manually triggered workflow instances. The main configuration parameters for rules to terminate running instances are as follows:

Section

Parameter

Description

Trigger Condition

Associated Monitoring Rule

The alert rule that you want to associate with the automated O&M rule. If the alert rule is triggered, the node instance is automatically terminated.

Note
  • For information about how to create a monitoring rule.

  • You can associate an automated O&M rule only with an alert rule whose Object Type is set to Schedule Resource and Trigger Condition is set to Resource Group Usage.

Filter Conditions

Workspace

The name of the workspace to which the automated O&M rule is applied.

Instance Type

The type of the node instance to which the automated O&M rule is applied.

Scheduling Cycle

The scheduling frequency of the node instances to which the automated O&M rule is applied. If you set Instance Type to Recurring Instance or Data Backfill Instance, you must configure the Scheduling Cycle parameter.

Priority

The priority of the node instances to which the automated O&M rule is applied. A larger value indicates a higher priority.

Status

The status of the node instances to which the automated O&M rule is applied.

Blacklist

The nodes that meet the conditions specified in the automated O&M rule but on which you do not want to perform the O&M operation. To add a node to the blacklist, enter the name or ID of the node in the search box.

Constraints On Rule

Effective Period

The time range within which the automated O&M rule is effective. The automated O&M operation is performed only when the conditions specified in the automated O&M rule are met and the rule is triggered during the effective time period. If an automated O&M rule is triggered beyond the effective time period, the automated O&M operation is not performed even if the conditions specified in the rule are met.

Maximum Effective Times

The maximum number of times that the automated O&M rule can be triggered, which is the maximum number of times that the rule can be executed.

Note

Each time before an automated O&M rule is executed, the system checks whether the trigger condition is met. If the trigger condition is not met, the automated O&M rule is not executed.

Minimum Effective Interval

The minimum interval at which the automated O&M rule can be triggered.

Create an automated rerun rule

In Automatic O&M, you can configure Automatic Rerun for tasks that meet Trigger Conditions. Instances that will be automatically rerun include recurring instances, data backfill instances, test instances, one-time task instances, and manually triggered workflow instances.

  • When the instance is a recurring instance, automatic rerun only checks instances with a data timestamp of yesterday.

    For example, if the current date is June 5, 2025, only recurring instances with a data timestamp of June 4, 2025 will be automatically rerun after meeting the automatic rerun trigger conditions.

  • When the instance is a data backfill instance, test instance, one-time task instance, or manually triggered workflow instance, automatic rerun checks instances created today, yesterday, and the day before yesterday.

    For example, if the current date is June 5, 2025, then data backfill instances, test instances, one-time task instances, and manually triggered workflow instances created on June 5, June 4, and June 3 will be automatically rerun after meeting the automatic rerun trigger conditions.

The main configuration parameters for automatic rerun rules are as follows:

Section

Parameter

Description

Trigger Condition

Running Status

The specified instance will be automatically rerun when it meets the following conditions and fails to run:

  • The running of a node instance times out.

  • The node instance fails to run but the automatic rerun property is not configured for the node that generates the instance on the Properties tab.

Filter Conditions

Workspace

The name of the workspace to which the automated O&M rule is applied.

Instance Type

The type of the node instance to which the automated O&M rule is applied.

Scheduling Cycle

The scheduling frequency of the node instances to which the automated O&M rule is applied. If you set Instance Type to Recurring Instance or Data Backfill Instance, you can configure the Scheduling Cycle parameter.

Priority

The priority of the node instances to which the automated O&M rule is applied. A larger value indicates a higher priority.

Logs Contain Keywords

The keyword that you want to identify in the operation logs of the node instance. If the operation logs of the node instance contain the keyword, the automated rerun rule is automatically triggered.

The valid values are abnormal exit (the node process fails to start or unexpectedly exits) and out of memory (the node fails to run and exits due to insufficient memory).

Note

The automated rerun rule can be triggered for a node whose operation logs contain the out of memory keyword only if the node is run on a serverless resource group.

Blacklist

Blacklist

The nodes that meet the conditions specified in the automated O&M rule but on which you do not want to perform the O&M operation. To add a node to the blacklist, enter the name or ID of the node in the search box.

Rerun

Preparation

If your node is a computing node that is run on a serverless resource group, select Add CUs For Computing Tasks.

Note

Specify the number of CUs added for each rerun to prevent the running of other nodes from being blocked due to competition for resources.

CUs To Add

In addition to the CUs consumed by the original node instance, add the specified CUs for the rerun instance. The added CUs are used only for the rerun of the instance.

Rerun Times

The maximum number of times that an automated rerun can be triggered. Valid values: 1 to 10. Unit: times.

Rerun Interval

The interval between reruns. Valid values: 3 to 30. Unit: minutes.

Constraints on Rule

Effective Period

The time range within which the automated O&M rule is effective. The automated O&M operation is performed only when the conditions specified in the automated O&M rule are met and the rule is triggered during the effective time period. If an automated O&M rule is triggered beyond the effective time period, the automated O&M operation is not performed even if the conditions specified in the rule are met.

Enable or disable a rule

By default, an automated O&M rule takes effect immediately after the rule is created. To disable the rule, click the image icon in the Actions column of the rule.

More operations

Manage rules

  • If you want to view the information about an automated O&M rule, find the desired rule in the automated O&M rule list of the Rules Management tab and click View in the Actions column.

  • If you want to modify the definition of an automated O&M rule, click Modify at the bottom of the View Rule dialog box.

  • If you want to delete an automated O&M rule, find the desired rule in the automated O&M rule list and click Delete in the Actions column. In the dialog box that appears, click OK.

  • In the search box in the upper-left corner of the Rules Management page, you can enter the name of an automated O&M rule to search for the rule.

View the execution records of a rule

The Execution Records page displays the execution information about automated O&M rules, including the time when the rules are executed, the rule owners, and the number of node instances to which the rules are applied. If you want to view the detailed execution information about a rule, click View Details in the Actions column of the rule.

Note

When the conditions specified in an automated O&M rule are met, the O&M operation is performed in the identity of the rule owner. You can view the O&M operation in the operation logs of the node instance that triggers the automated O&M rule.

  • The execution records of an automated O&M rule about automated termination of running node instances include the following information:

    • Instances Waiting For Resources/Resource Usage: This section provides a chart that displays the number of node instances that are waiting for resources and the resource usage of the desired resource group. You can move the pointer over a point in the chart to view the number of node instances that are waiting for resources and the resource usage of the desired resource group at the related point in time.

    • Terminated Node Instances: This section displays all the node instances whose running is terminated.

  • The execution records of an automated O&M rule about automated rerun of node instances include the following information:

    • Instances That Are Automatically Rerun: This section displays the number of node instances that are automatically rerun, and the Node Name, Data Timestamp, Instance Type, Node Type, Owner, and other information of each instance.

Monitor resource groups

After you create an automated O&M rule, the system automatically monitors the resource usage of the resource group specified in the automated O&M rule. For more information about resource group monitoring, see Resource O&M.