All Products
Search
Document Center

DataWorks:Automated O&M

Last Updated:Mar 26, 2026

Automated operations and maintenance (O&M) lets you encode your incident response playbook as rules that execute automatically. When a configured condition is met — such as a resource group hitting its utilization threshold or a node instance failing — DataWorks acts on your behalf without requiring manual intervention. This reduces after-hours pages and improves pipeline reliability.

How it works

DataWorks automated O&M covers two scenarios:

  • Terminating running instances: When a custom alert rule fires on an exclusive resource group for scheduling, DataWorks terminates the matching node instances. For example, if resource utilization reaches 80% and stays there for 10 minutes, the system automatically stops non-auto-triggered instances with priority 1 or 3 on that resource group.

  • Automated rerun: When a node instance fails without a configured automatic rerun property, or times out, DataWorks reruns it based on the automated rerun rule. The rerun rule applies only to nodes running on a serverless resource group.

These two rule types are independent — termination rules respond to resource pressure, while rerun rules respond to individual instance failures. If you want both behaviors, configure both rule types separately.

Limitations

Permissions: Only Alibaba Cloud accounts, RAM users with the AliyunDataWorksFullAccess policy attached, and workspace administrators can manage automated O&M rules.

Resource group constraints:

  • Termination rules apply only to nodes running on an exclusive resource group for scheduling that has a resource utilization alert rule configured.

  • Automated rerun rules apply only to nodes running on a serverless resource group.

Feature constraints:

  • Multiple termination rules can be associated with the same alert rule.

  • Only one automated rerun rule can be created per workspace.

  • Execution records are available for the previous 30 days.

Get to the Automatic page

  1. Log on to the DataWorks console. In the top navigation bar, select the target region. In the left-side navigation pane, choose Data Development and O&M > Operation Center. Select your workspace from the drop-down list and click Go to Operation Center.

  2. In the left-side navigation pane, choose O&M Assistant > Automatic.

Create a rule

On the Automatic O&M > Rules Management page, you can create two types of automated O&M rules:

Rule typeTriggerApplies to
Terminating running instancesAlert rule fires on resource group usageNodes on an exclusive resource group for scheduling
Automatic rerunNode instance fails or times outNodes on a serverless resource group

Each rule has a trigger condition that determines when it fires, filter conditions that scope which instances it affects, and constraints that limit how often it can execute. Nodes in the blacklist are excluded even if they match all other conditions.

Create a termination rule

Termination rules stop instances that match custom alert rules when those rules fire. Supported instance types: recurring instances, data backfill instances, test instances, one-time task instances, and manually triggered workflow instances.

SectionParameterDescription
Trigger conditionAssociated monitoring ruleThe alert rule that triggers this O&M rule. Only alert rules with Object Type set to Schedule Resource and Trigger Condition set to Resource Group Usage can be associated. See how to create a monitoring rule.
Filter conditionsWorkspaceThe workspace where this rule applies.
Instance typeThe instance type to act on.
Scheduling cycleThe scheduling frequency to match. Required when Instance Type is Recurring Instance or Data Backfill Instance.
PriorityThe priority of instances to act on. Higher values indicate higher priority.
StatusThe status of instances to act on.
BlacklistBlacklistNodes that match all conditions but should be excluded. Enter the node name or ID to add it.
Constraints on ruleEffective periodThe time window during which the rule can execute. Instances outside this window are not affected even if all conditions are met.
Maximum effective timesThe maximum number of times the rule can execute. Each execution is checked against the trigger condition before running.
Minimum effective intervalThe minimum time between consecutive executions.

Create an automated rerun rule

Automated rerun rules retry failed instances automatically. The rule fires when:

  • A node instance fails and the automatic rerun property is not configured on the node's Properties tab.

  • A node instance times out.

Supported instance types: recurring instances, data backfill instances, test instances, one-time task instances, and manually triggered workflow instances.

Scope of instances checked:

  • Recurring instances: Only instances with a data timestamp of yesterday are checked. For example, if today is June 5, 2025, the rule checks instances with a data timestamp of June 4, 2025.

  • Other instance types (data backfill, test, one-time task, manually triggered workflow): Instances created today, yesterday, and the day before yesterday are checked. For example, if today is June 5, 2025, instances created on June 3, 4, and 5 are eligible.

SectionParameterDescription
Trigger conditionRunning statusFires when a node instance fails without an automatic rerun property configured, or when the instance times out.
Filter conditionsWorkspaceThe workspace where this rule applies.
Instance typeThe instance type to act on.
Scheduling cycleThe scheduling frequency to match. Available when Instance Type is Recurring Instance or Data Backfill Instance.
PriorityThe priority of instances to act on. Higher values indicate higher priority.
Logs contain keywordsTriggers the rerun when operation logs contain a specific keyword. Valid values: abnormal exit (node process fails to start or exits unexpectedly) and out of memory (node exits due to insufficient memory). The out of memory keyword is supported only for nodes on a serverless resource group.
BlacklistBlacklistNodes that match all conditions but should be excluded. Enter the node name or ID to add it.
RerunPreparationIf the node is a computing node on a serverless resource group, select Add CUs For Computing Tasks to allocate extra compute capacity for the rerun.
CUs to addThe number of Computing Units (CUs) to add on top of the original instance's allocation. The added CUs are used only for the rerun instance. Set this to prevent reruns from competing for resources with other running nodes.
Rerun timesThe maximum number of retries. Valid values: 1–10.
Rerun intervalThe wait time between retries. Valid values: 3–30 minutes.
Constraints on ruleEffective periodThe time window during which the rule can execute. Instances outside this window are not retried even if all conditions are met.

Enable or disable a rule

Rules take effect immediately after creation. To disable a rule, click the image icon in the Actions column.

More operations

Manage rules

  • To view a rule, find it in the Rules Management tab and click View in the Actions column.

  • To edit a rule, open it with View and then click Modify at the bottom of the View Rule dialog box.

  • To delete a rule, click Delete in the Actions column and confirm the deletion.

  • To search for a rule by name, use the search box in the upper-left corner of the Rules Management page.

View execution records

The Execution Records tab shows when each rule ran, the rule owner, and how many node instances were affected. Click View Details in the Actions column to see the full execution log.

Note

O&M operations run under the identity of the rule owner. You can trace each automated action in the operation logs of the node instance that triggered the rule.

Termination rule execution records include:

  • Instances waiting for resources/resource usage: A chart showing the number of instances waiting for resources alongside resource group utilization over time. Hover over any point to see the values at that moment.

  • Terminated node instances: The full list of instances whose execution was stopped.

Automated rerun execution records include:

  • Instances that are automatically rerun: A list with Node name, Data timestamp, Instance type, Node type, Owner, and other details for each rerun instance.

Monitor resource groups

After you create an automated O&M rule, DataWorks automatically monitors the resource usage of the associated resource group. For details, see Resource O&M.