All Products
Search
Document Center

DataWorks:Node scheduling

Last Updated:May 06, 2025

In most cases, nodes and workflows in the Workspace Directories section of the DATA STUDIO pane in the DataWorks console need to be periodically scheduled. To enable the system to periodically schedule a node or workflow, you must configure scheduling properties for the node or workflow on the Properties tab. The scheduling properties include the scheduling cycle, scheduling dependencies, and scheduling parameters. This topic provides an overview of the configuration of scheduling properties.

Prerequisites

  • A node is created. Data development in DataWorks is based on nodes. Tasks of different types of compute engines are encapsulated into different types of nodes in DataWorks. You can select a specific type of node for data development based on your business requirements. For more information, see the topics in the Node development directory.

  • The Periodic scheduling switch is turned on. A node can be automatically scheduled based on its scheduling properties only if Periodic scheduling is turned on for the workspace to which the node belongs on the Scheduling Settings tab of the Settings page in Data Studio. You can turn on the switch on the Scheduling Settings tab of the Settings page in Data Studio for a workspace. For more information, see System Settings of Data Studio.

Precautions

  • Scheduling configurations define the scheduling properties used to run a node. The node can be scheduled based on the scheduling properties only after the node is deployed to the production environment.

  • The scheduling time specified for a node in Data Studio is the expected running time of an instance that is generated for the node. The actual running time of the instance is affected by the running situation of the ancestor instance of the current instance. For information about the conditions that must be met before a node starts to run, see Use the Intelligent Diagnosis feature.

  • DataWorks allows you to configure scheduling dependencies between nodes that have different scheduling frequencies. Before you configure scheduling dependencies, we recommend that you view the Principles and samples of scheduling configurations in complex dependency scenarios topic to understand the principles and samples of scheduling configurations in complex dependency scenarios.

  • In DataWorks, an auto triggered node generates instances based on the scheduling frequency and the number of scheduling cycles of the node. For example, the number of instances generated for a node scheduled by hour every day is the same as the number of scheduling cycles of the node every day. The node is run as an instance.

  • If you configure scheduling parameters, the input parameters in the code of an auto triggered node in each scheduling cycle are determined by the scheduling time of the node in the specific scheduling cycle and the expressions of the scheduling parameters. For information about the replacement relationship between input parameters in node code and configurations of scheduling parameters, see Supported formats of scheduling parameters.

  • A workflow contains the Workflow node and other inner nodes. The dependencies between the nodes are complex. This topic describes only how to configure dependencies and other scheduling settings for a single node. For information about the description for the dependencies between workflows, see Auto triggered workflow.

Go to the Properties tab

  1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

  2. Go to the Properties tab.

    1. In the Workspace Directories section of the DATA STUDIO pane, find the desired node and click the node name to go to the configuration tab of the node.

    2. In the right-side navigation pane of the configuration tab of the node, click Properties.

Configure scheduling properties for the node

On the Properties tab, you can configure parameters in the Scheduling Parameters, Scheduling Policies, Scheduling Time, Scheduling Dependencies, and Node Output Parameters sections.

(Optional) Scheduling Parameters

If you define a variable when you edit node code, you must assign a value to the variable in the Scheduling Parameters section.

Scheduling parameters are automatically replaced with specific values based on the data timestamps of nodes and value formats of scheduling parameters. This enables dynamic parameter configuration for node scheduling.

Configure parameters in the Scheduling Parameters section

You can use one of the following methods to define scheduling parameters.

Method

Description

Screenshot

Add a parameter

You can click Add Parameter to configure multiple scheduling parameters for a node.

  • You can manually assign a value to an added scheduling parameter. For more information, see Supported formats of scheduling parameters.

  • You can also click the image icon in the Actions of an added scheduling parameter to associate the scheduling parameter with an output parameter of the ancestor node of the current node.

image

Load parameters in node code

After you click Load Parameters in Code, DataWorks identifies the names of the variables defined in the code of the current node and adds the identified variable names to the Parameters section.

Note

In most cases, custom variables are defined in the format of ${Custom variable name} in the code.

For PyODPS nodes or common Shell nodes, variables are defined by using a different method from that for other types of nodes. For more information about how to define custom variables for different types of nodes, see Configure scheduling parameters for different types of nodes.

image

Supported formats of scheduling parameters

For more information about the supported formats of scheduling parameters, see Supported formats of scheduling parameters.

View the configurations of scheduling parameters in Operation Center in the production environment

To prevent unexpected configurations of scheduling parameters from affecting the running of an auto triggered node, we recommend that you check the configurations of the scheduling parameters for the auto triggered node on the Auto Triggered Nodes page in Operation Center in the production environment after the auto triggered node is deployed. For information about how to view an auto triggered node, see View and manage auto triggered nodes.

Scheduling Policies

In the Scheduling Policies section, you can define various information about an auto triggered node, such as the instance generation mode, scheduling type, computing resource, and resource group.

Parameter

Description

Instance Generation Mode

After an auto triggered node is committed and deployed to the scheduling system in the production environment, DataWorks generates auto triggered node instances for the node. The auto triggered instances are scheduled based on the value of the Instance Generation Mode parameter. Valid values of the Instance Generation Mode parameter:

  • Next Day: Instances generated for a node are automatically scheduled on the next day after you deploy the node to the production environment. You can view the status of the instances on the Auto Triggered Instances page in Operation Center. If you want to run the node on the day when you deploy the node, you can backfill data for the node. If you select the previous day as the data timestamp when you configure settings related to data backfill for the node, data backfill instances generated for the node are run in the same manner as the instances that are scheduled to run on the current day.

  • Immediately After Deployment: Instances generated for a node are automatically scheduled on the day you deploy the node to the production environment. You can view the status of the instances on the Auto Triggered Instances page in Operation Center. If you select this value when you create a node, whether the node generates data on the current day and is dry-run is related to the scheduling time and deployment time of the node. If you change the scheduling frequency of a node deployed to the production environment, DataWorks automatically replaces the instances that are generated but are not run for the node based on the latest scheduling configurations of the node and retains the expired instances for the node.

Scheduling Type

  • Normally Scheduled

    • Use scenario: You want a node and the instances that are generated for this node to be run as expected.

    • Impact: If you set the Scheduling Type parameter to Normally Scheduled, the node is run and generates data based on the settings of the scheduling cycle and scheduling time.

      After the node is run as expected, the descendant nodes of the node are also triggered and run. By default, the Scheduling Type parameter is set to Normally Scheduled.

  • Suspend Scheduling

    • Use scenario: You want to freeze a node and the instances generated for the node. In this case, the current node and its descendant nodes cannot be run.

      If you do not need to run a workflow within a specified period of time, you can set the Scheduling Type parameter to Suspend Scheduling for the workflow to freeze the root node of the workflow in that period of time based on your business requirements. You can also unfreeze the root node to resume the workflow based on your business requirements. For information about how to unfreeze a node, see Node freezing and unfreezing.

    • Impact: If you set the Scheduling Type parameter to Suspend Scheduling, the node is scheduled based on the settings of the scheduling cycle and scheduling time. However, the status of the node changes to frozen and the node generates no data.

      When the node is scheduled, the system directly returns a failure response and the descendant nodes of the node cannot be run.

  • Dry-run

    • Use scenario: You want to suspend a node for a certain period of time and require that its descendant nodes be run as expected.

    • Impact: If you set the Scheduling Type parameter to Dry-run, the node is scheduled based on the settings of the scheduling cycle and scheduling time. However, the node performs a dry run and generates no data.

      When the node is scheduled, the scheduling system returns a success response. However, the running duration of the node is 0 seconds, and no run logs are generated for the node. The dry-run node does not affect the running of its descendant nodes and occupies no resources.

Timeout Period

You can configure the Timeout Period parameter to specify a timeout period for a node. If the period of time for which the node is run exceeds the specified timeout period, the node fails. Take note of the following items when you configure this parameter:

  • The timeout period applies to auto triggered instances, data backfill instances, and test instances.

  • The default timeout period ranges from 72 hours to 168 hours. The system adjusts the default timeout period for a node based on the system load.

  • You can customize a timeout period that does not exceed 168 hours.

Support For Rerun

You can configure this parameter to define the rerun property of a node.

This parameter is required. Valid values of this parameter:

  • Allow Regardless of Running Status: If you want the same result to be returned each time a node is rerun, regardless of whether the last run is successful, you can set the parameter to this value.

  • Allow Upon Failure Only: If you want the same result to be returned after a failed node is rerun and different results to be returned after a successful node is rerun, you can set the parameter to this value.

  • Disallow Regardless of Running Status: If you want different results to be returned each time a data synchronization node or another type of node is rerun, regardless of whether the last run is successful, you can set the parameter to this value.

    Note
    • If you set the Support For Rerun parameter to Disallow Regardless of Running Status for a node and an exception occurs in the system, the system does not automatically rerun the node after the system recovers from the exception.

    • The Auto Rerun Upon Failure parameter is not displayed if you set the Support For Rerun parameter to Disallow Regardless of Running Status.

Auto Rerun Upon Failure

If you enable this feature, the scheduling system automatically reruns the related node based on the number of reruns and the rerun interval after the node fails. Take note that the scheduling system does not automatically rerun a node if you manually terminate the node.

  • Rerun Times: The default number of times that an auto triggered node is rerun after it fails to be run as scheduled.

    Valid values: 1 to 10. The value 1 indicates that the node is rerun once after it fails to run as expected. The value 10 indicates that the node is rerun ten times after it fails to run as expected. You can configure this parameter based on your business requirements.

  • Rerun Interval: The interval at which a node is rerun after it fails to be run as scheduled. You can configure this parameter based on your requirements. Valid values: 1 to 30. Default value: 30. Unit: minutes.

Note
  • You can specify the default number of reruns and default rerun interval for the nodes in a workspace on the Scheduling Settings tab. For more information, see System Settings of Data Studio.

  • The automatic rerun feature does not take effect if a node fails because the timeout period is exceeded.

Computing Resource

The computing resource that you want to use to run the node. You can configure this parameter based on your business requirements.

Resource Group For Scheduling

The resource group for scheduling that you want to use to run the node. You can configure this parameter based on your business requirements.

Scheduling Time

You can configure information such as the scheduling cycle and scheduling time for an auto triggered node in the Scheduling Time section.

Note

If your auto triggered node is stored in a workflow, you must configure the parameters in the Scheduling Time section for the node on the Properties tab of the workflow. If your auto triggered node is not stored in a workflow, you must configure the parameters in the Scheduling Time section for the node on the Properties tab of the node.

Precautions

  • The scheduling frequency of a node is unrelated to the scheduling frequency of the ancestor node of the node.

    The interval at which the node is scheduled is related to the scheduling frequency of the node and is unrelated to the scheduling frequency of the ancestor node of the node.

  • DataWorks allows you to configure scheduling dependencies between nodes whose scheduling frequencies are different.

    DataWorks generates instances for an auto triggered node based on the scheduling frequency and the number of scheduling cycles of the node. For example, the number of instances generated for a node scheduled by hour every day is the same as the number of scheduling cycles of the node every day. The node is run as an instance. In essence, dependencies between auto triggered nodes are dependencies between instances that are generated for the nodes. The number of instances generated for ancestor and descendant auto triggered nodes and dependencies between the instances vary based on the scheduling frequencies of the ancestor and descendant nodes. For information about how to configure scheduling dependencies between nodes whose scheduling frequencies are different, see Select a scheduling dependency type (cross-cycle scheduling dependency).

  • Dry-run instances are generated for a node on the days when the node is not scheduled to run.

    For a node that is not scheduled to run every day, such as a node scheduled by week or month, DataWorks generates dry-run instances for the node on the days when it is not scheduled to run. The dry-run instances return success results when the scheduling time of the node arrives on these days. This way, if a node scheduled by day depends on the node scheduled by week or month, the node scheduled by day can be run as expected. In this case, the node scheduled by week or month is dry run, but the node scheduled by day is run as scheduled.

  • Time when a node is run.

    You can specify the time when you want to schedule a node. The actual time when the node is run is affected by multiple factors. The running of a node is affected by various factors such as the scheduling time of the ancestor node of the node, resources required to run the node, and conditions for running the node. For more information, see What are the conditions that are required for a node to successfully run?

Configure parameters in the Scheduling Time section

Parameter

Description

Scheduling Cycle

The scheduling frequency of a node determines the number of cycles that the node is automatically run in the scheduling scenario. A scheduling frequency is used to define the interval at which the code logic of a node is actually executed in the scheduling system in the production environment. DataWorks generates instances for the node based on the scheduling frequency and the number of scheduling cycles of the node. For example, the number of instances generated for a node scheduled by hour every day is the same as the number of scheduling cycles of the node every day. The node is run as an instance.

  • Minute: The node is automatically run once every N minutes within a specific period every day. The minimum interval for running a node that is scheduled by minute is 1 minutes.

  • Hour: The node is automatically run once every N hours within a specific period every day.

  • Day: The node is automatically run at a specified point in time every day. If you create an auto triggered node that is scheduled by day, the node is scheduled to run at 00:00 every day by default. You can change the scheduling time of the node based on your business requirements.

  • Week: The node is automatically run at a specified point in time on specific days every week.

  • Month: The node is automatically run at a specified point in time on specific days every month.

  • Year: The node is automatically run at a specified point in time on specific days every year.

Important

For a node that is scheduled by week, month, or year, DataWorks generates dry-run instances for the node on the days when it is not scheduled to run. The dry-run instances return success results but do not generate data.

Effective Period

You can specify a validity period during which a node is automatically run as scheduled. The node is not automatically run in the period of time that falls out of the specified time range. Nodes whose validity period expires are expired nodes. You can view the number of expired nodes on the O&M Dashboard page of Operation Center and undeploy the nodes based on your requirements.

Cron Expression

A cron expression is automatically generated based on the configurations of time properties.

Scheduling Dependencies

Scheduling dependencies in DataWorks define the relationships between nodes in scheduling scenarios. After you configure scheduling dependencies for a node, the node can start to run only after its ancestor node is successfully run. Scheduling dependencies help ensure that a node can obtain the required data for its running from its ancestor node. If the ancestor node of the node is successfully run, DataWorks detects that the latest data is generated by the ancestor nodes based on the status of the ancestor node. Then, the node obtains the generated data. This prevents the node from failing to obtain the required data.

Precautions

  • After you configure scheduling dependencies for a node, the node can start to run only after its ancestor node is successfully run. Otherwise, data quality issues may occur when the node obtains data from its ancestor nodes.

  • The time at which a node is run is determined by the scheduling time, which is the expected running time of the node in scheduling scenarios, and the time at which its ancestor node finishes running. This indicates that the actual running time of the node also depends on the scheduling time of its ancestor node. If the ancestor node does not finish running, the node cannot start to run at its scheduling time even if the scheduling time of the node is earlier than that of its ancestor node. For information about the running conditions of a node, see Use the Intelligent Diagnosis feature.

Configure parameters in the Scheduling Dependencies section

Scheduling dependencies between nodes in DataWorks ensure that descendant nodes can obtain valid data from ancestor nodes. In essence, scheduling dependencies between nodes are dependencies between the data lineage of ancestor and descendant tables. You can determine whether to configure scheduling dependencies for a node based on the lineages between tables generated by the node and its ancestor node based on your business requirements. The following figure shows the procedure for configuring scheduling dependencies for a node.

image

If you configure scheduling dependencies for a node based on the table data lineage, the system determines that a strong lineage exists between the table data. This indicates that the table data that is generated by the node depends on the table data that is generated by its ancestor nodes. Before you configure scheduling dependencies for the node based on the table lineage, you must check whether a strong lineage exists between the table data generated by the node and its ancestor node. To check whether a strong lineage exists, you can check whether the node can obtain valid data if the ancestor node fails to generate data. If the node fails to obtain valid data, a strong lineage exists.

No.

Description

1

To ensure that the current node can be run at its scheduling time, you can check whether a strong lineage exists between the table data that is generated by the current node and its ancestor node, and determine whether to configure scheduling dependencies for the current node based on the lineage between the table data.

2

Check whether the table data on which the current node depends is generated by an auto triggered node. DataWorks determines whether the table data of an auto triggered node is generated based on the status of the node. However, DataWorks cannot determine whether the table data of non-auto triggered nodes is generated based on the status of the nodes. If you use tables that are not generated by auto triggered nodes, you cannot specify the tables when you configure scheduling dependencies.

Tables whose data is not generated based on periodic scheduling in DataWorks include but are not limited to the following tables:

  • Tables generated by real-time synchronization nodes

  • Tables uploaded from on-premises machines to DataWorks

  • Dimension tables

  • Tables generated by manually triggered nodes

  • Tables whose data is periodically updated but are not generated by auto triggered nodes in DataWorks

3 and 4

You can determine whether to configure the same-cycle or previous-cycle scheduling dependencies between the current node and its ancestor node based on the following conditions: Whether the current node needs to depend on the data that is generated by its ancestor node on the previous or current day, and whether the instance generated for the current node in the current cycle needs to depend on the instance generated for the node in the previous cycle if the current node is scheduled by hour or minute.

  • Same-cycle scheduling dependencies: A descendant node depends on the table data generated by an ancestor node on the current day.

  • Cross-cycle scheduling dependencies:

    • A descendant node depends on the table data generated by an ancestor node on the previous day.

    • Special dependency scenarios for nodes scheduled by hour and minute:

      • The instance generated for the node scheduled by hour or minute in the current cycle depends on the instance generated for the same node in the previous cycle.

      • Node A scheduled by hour depends on Node B scheduled by hour, and the scheduling time of the nodes is the same. You can configure cross-cycle scheduling dependencies for Node A to allow the instance generated for Node A at 02:00 to depend on the instance generated for Node B at 01:00. The same logic applies to a node that is scheduled by minute and depends on another node scheduled by minute.

Note

For information about how to select a scheduling dependency type and make the related configurations when you configure scheduling dependencies for a node based on lineages, see Select a scheduling dependency type (same-cycle scheduling dependency).

5, 6, and 7

After the scheduling dependencies are configured and deployed to the production environment, you can check whether the scheduling dependencies meet your expectations on the Auto Triggered Nodes page in Operation Center.

Configure custom scheduling dependencies for a node

You can configure scheduling dependencies for a node in the following scenarios based on your business requirements: Scenario 1: No strong lineage exists between the node and its ancestor node. For example, the node does not strongly depend on the data in a specific partition in the output tables of its ancestor node but depends only on the data in the partition that has the largest partition key value. Scenario 2: The node depends on table data that is not generated by an auto triggered node. For example, the node depends on the table data that is uploaded from your on-premises machine. You can use the following methods to configure scheduling dependencies for a node based on your business requirements:

  • Specify the root node of a workspace as the ancestor node

    For example, you can use the root node of a workspace as the ancestor node if a data synchronization node depends on data in other business databases or an SQL node processes the table data generated by a real-time synchronization node.

  • Specify a zero load node as the ancestor node

    In a workspace, if a workflow contains a large number of nodes or has complex relationships among nodes, you can use a zero load node to manage the nodes in the workflow in a centralized manner. You can specify the zero load node as the ancestor node of the nodes that you want to manage. This way, data forwarding paths in the workspace are clearer. For example, you can use a zero load node to determine the scheduling time for nodes in a workflow, and schedule or freeze nodes in a workflow in a centralized manner.

Node Output Parameters

After you define an output parameter and its value for a node, you can define an input parameter for the descendant node of the node and configure the descendant node to reference the value of the output parameter in the input parameter.

Precautions

  • An output parameter defined for a node can be used only as an input parameter of the descendant node of the node. Output parameters of specific nodes cannot be used to directly pass the query results of the nodes to their descendant nodes. To configure an output parameter of a node as an input parameter of the descendant node of the node, you can add a scheduling parameter for the descendant node in the Scheduling Parameters section of the Properties tab of the descendant node, and click the image icon in the Actions column to associate the output parameter with the scheduling parameter. If you want a node to use the query results of its ancestor node, you can use an assignment node to pass the query results. For more information,

  • You can configure output parameters for the following types of nodes: EMR Hive, EMR Spark SQL, ODPS Script, Hologres SQL, AnalyticDB for PostgreSQL, and MySQL.

Configure parameters in the Node Output Parameters section

You can define output parameters in the Node Output Parameters section. Values of the constant and variable types are supported for output parameters.

After you configure output parameters for a node and deploy the node, you can associate the output parameters with scheduling parameters configured for the descendant node of the node. This way, the output parameters are used as the input parameters of the descendant node.

image

  • Parameter Name: the name of an output parameter.

  • Parameter Value: the value of the output parameter. The value type can be constant or variable.

    • If the value type is constant, the value is a fixed string.

    • If the value type is variable, the value can be a global variable, built-in parameter, or custom parameter.

References