All Products
Search
Document Center

DataWorks:Scheduling dependency configuration guide

Last Updated:Aug 03, 2023

Scheduling dependencies in DataWorks are the dependency relationship between ancestor and descendant auto triggered nodes. The nodes are orderly scheduled to run based on the scheduling dependencies. The descendant nodes start to run after the ancestor nodes finish running. This ensures that valid business data is generated at the earliest opportunity. This topic describes how to configure scheduling dependencies for a node to prevent data exceptions due to incorrect configurations of scheduling dependencies. We recommend that you read this topic before you configure scheduling dependencies for a node.

Background information

Scheduling dependencies in DataWorks define the relationships between nodes in scheduling scenarios. After you configure scheduling dependencies for a node, the node can start to run only after its ancestor nodes are successfully run. Scheduling dependencies help ensure that a node can obtain the required data for its running from its ancestor nodes. If the ancestor nodes of the node are successfully run, DataWorks detects that the latest data is generated by the ancestor nodes based on the status of the ancestor nodes. Then, the node obtains the generated data. This prevents the node from failing to obtain the required data.

Precautions

  • After you configure scheduling dependencies for a node, the node can start to run only after its ancestor nodes are successfully run. Otherwise, data quality issues may occur when the node obtains data from its ancestor nodes.

  • The time at which a node is run is determined by the scheduling time, which is the expected running time of the node in scheduling scenarios, and the time at which its ancestor nodes finish running. This means that the actual running time of the node also depends on the scheduling time of its ancestor nodes. If the ancestor nodes do not finish running, the node cannot start to run at its scheduling time even if the scheduling time of the node is earlier than that of its ancestor nodes. For information about the running conditions of a node, see Use the Intelligent Diagnosis feature.

Configure scheduling dependencies

Scheduling dependencies between nodes in DataWorks ensure that descendant nodes can obtain valid data from ancestor nodes. This indicates that a strong lineage exists between the tables that are generated by the ancestor and descendant nodes. You can determine whether to configure scheduling dependencies for a node based on the lineages between tables generated by the node and its ancestor nodes based on your business requirements.

Method 1: Configure custom scheduling dependencies

You can configure scheduling dependencies for a node in the following scenarios based on your business requirements: Scenario 1: No strong lineage exists between the node and its ancestor nodes. For example, the node does not strongly depend on the data in a specific partition in the output tables of its ancestor nodes but depends only on the data in the partition that has the largest partition key value. Scenario 2: The node depends on table data that is not generated by an auto triggered node. For example, the node depends on the table data that is uploaded from your on-premises machine. You can use the following methods to configure scheduling dependencies for a node based on your business requirements:

  • Specify the root node of a workspace as the ancestor node

    For example, you can use the root node of a workspace as the ancestor node if a data synchronization node depends on data in other business databases or an SQL node processes the table data generated by a real-time synchronization node.

  • Specify a zero load node as the ancestor node

    In a workspace, if a workflow contains a large number of nodes or has complex relationships among nodes, you can use a zero load node to manage the nodes in the workflow in a centralized manner. You can specify the zero load node as the ancestor node of the nodes that you want to manage. This way, data forwarding paths in the workspace are clearer. For example, you can use a zero load node to determine the scheduling time for nodes in a workflow, and schedule or freeze nodes in a workflow in a centralized manner.

Method 2: Configure scheduling dependencies for a node based on the table data lineage

If you configure scheduling dependencies for a node based on the table data lineage, the system determines that a strong lineage exists between the table data. This indicates that the table data that is generated by the node depends on the table data that is generated by its ancestor nodes. Before you configure scheduling dependencies for the node based on the table lineage, you must check whether a strong lineage exists between the table data generated by the node and its ancestor nodes. To check whether a strong lineage exists, you can check whether the node can obtain valid data if the ancestor nodes fail to generate data. If the node fails to obtain valid data, a strong lineage exists.

No.

Goal

Description

1 and 2

Check whether a strong lineage exists between the table data that is generated by the current node and its ancestor nodes.

To ensure that the current node can be run at its scheduling time, you can check whether a strong lineage exists between the table data that is generated by the current node and its ancestor nodes, and determine whether to configure scheduling dependencies for the current node based on the lineage between the table data.

3

Check whether the table data on which the current node depends is generated by an auto triggered node.

DataWorks determines whether the table data of an auto triggered node is generated based on the status of the node. However, DataWorks cannot determine whether the table data of non-auto triggered nodes is generated based on the status of the nodes. If you use tables that are not generated by auto triggered nodes, scheduling dependency configuration is not supported.

4, 5, and 6

Determine and configure scheduling dependencies for the current node based on the table data lineage.

You can determine whether to configure the same-cycle or previous-cycle scheduling dependencies between the current node and its ancestor nodes based on the following conditions: Whether the current node needs to depend on the data that is generated by its ancestor nodes on the previous or current day, and whether the instance generated for the current node in the current cycle needs to depend on the instance generated for the node in previous cycle if the current node is scheduled by hour or minute.

7, 8, and 9

Preview scheduling dependencies.

After you configure the scheduling dependencies, you can perform the following operations to check whether the scheduling dependencies of the node in the production environment meet your business requirements: preview the scheduling dependencies of the current node after the configuration of the scheduling dependencies, use the code parsing result comparison feature before you deploy the node, and view the current node on the Cycle Task page in Operation Center after you deploy the node to the production environment.

Configure scheduling dependencies for a node based on the table data lineage

In DataWorks, the table data lineage is indicated by the scheduling dependencies between the nodes that generate table data. After you confirm that a strong lineage exists between the table data that is generated by a node and its ancestor nodes, you can determine whether to configure the same-cycle or previous-cycle scheduling dependencies between the nodes based on the following conditions: Whether the current node needs to depend on the data that is generated by its ancestor nodes on the previous or current day, and whether the instance generated for the current node in the current cycle needs to depend on the instance generated for the node in previous cycle if the current node is scheduled by hour or minute.

Note

DataWorks allows you to configure scheduling dependencies between nodes that have different scheduling frequencies. The number of instances of a node is determined by the scheduling frequency and the number of scheduling cycles of the node. The number of scheduling cycles of an ancestor node may be different from that of a descendant node. In different scheduling cycles, the scheduling dependencies between the ancestor and descendant instances may be different. To ensure that the scheduling dependencies meet your business requirements, we recommend that you use the preview feature to preview the scheduling dependencies between ancestor and descendant instances if the number of scheduling cycles and the scheduling time of the ancestor instances are different from those of the descendant instances. For more information, see Principles and samples of scheduling configurations in complex dependency scenarios.

Scenario-specific selection and configuration of scheduling dependencies

When a node is scheduled, the scheduling parameters of the node in the node code are used to determine the specific ancestor instance on which the current node instance depends.

Note

The scheduling parameters of the node are automatically replaced with specific values based on the data timestamp and scheduling time of the node and the value formats of the scheduling parameters. This way, the values of the scheduling parameters are dynamically replaced at the scheduling time of the node. This also implements changing of queried data and generated partition data.

To configure scheduling dependencies for a node based on the table data lineage, perform the following operations:

  1. Confirm the table data lineage

    To ensure that the table data that is generated by the node meets your business requirements, you must make sure that its ancestor nodes generate the required business data on the current day. This indicates that the data that the node obtains on the current day is generated by the ancestor nodes on the current day.

    • If the node is scheduled by hour or minute, you must make sure that each instance of the node generates the required table partition data.

    • For information about how to confirm the table data lineage in the scenarios in which you cannot view the configurations of the scheduling parameters of the ancestor nodes, such as dependencies on the ancestor nodes that are in another workspace, see Confirm the lineage of a table.

  2. Select a scheduling dependency type based on the table lineage

    The following table describes the scheduling dependency types that you can select based on the lineage between tables.

    Scheduling dependency type

    Lineage

    Configure same-cycle scheduling dependencies

    A descendant node depends on the table data generated by an ancestor node on the current day.

    Configure cross-cycle scheduling dependencies

    • A descendant node depends on the table data generated by an ancestor node on the previous day.

    • Special dependency scenarios for nodes scheduled by hour and minute:

      • The instance generated for the node scheduled by hour or minute in the current cycle depends on the instance generated for the same node in the previous cycle. For more information, see Dependency on the instance generated for the current node in the previous cycle.

      • Node A scheduled by hour depends on Node B scheduled by hour, and the scheduling time of the nodes is the same. You can configure the cross-cycle scheduling dependencies for Node A to allow the instance generated for Node A at 02:00 to depend on the instance generated for Node B at 01:00. The same logic applies to a node that is scheduled by minute and depends on another node scheduled by minute.

Scenarios in which scheduling dependencies cannot be configured

Scheduling dependencies between auto triggered nodes in DataWorks are configured to ensure that tables generated by the nodes are regularly updated at specific points in time, and descendant auto triggered nodes obtain valid data from ancestor auto triggered nodes. Therefore, DataWorks cannot monitor tables that are not generated by auto triggered nodes in DataWorks. Tables whose data is not generated based on periodical scheduling in DataWorks include but are not limited to the following tables:

  • Tables generated by real-time synchronization nodes

  • Tables uploaded from on-premises machines to DataWorks

  • Dimension tables

  • Tables generated by manually triggered nodes

  • Tables whose data is periodically updated but are not generated by auto triggered nodes in DataWorks

For nodes whose table data is not generated based on periodical scheduling in DataWorks, you can configure scheduling dependencies for the nodes based on your business requirements. For more information, see Configure scheduling dependencies.

Confirm the scheduling dependencies

After you configure the scheduling dependencies for the node, you can use the methods that are described in the following table to check whether the scheduling dependencies meet your business requirements.

Method

Description

Preview scheduling dependencies of a node when you configure the scheduling dependencies

You can use the preview feature to check whether the current scheduling dependencies of a node meet your business requirements.

DataWorks allows you to configure scheduling dependencies between nodes that are scheduled by minute, hour, day, week, month, or year. The number of scheduling cycles of a node varies based on the scheduling frequency of the node.

An instance is generated for a node in each scheduling cycle. The dependencies between ancestor and descendant instances vary based on the scheduling frequencies of the ancestor and descendant nodes that generate the instances. You can use this method in the following scenarios: A node scheduled by day depends on a node scheduled by hour, a node scheduled by hour depends on a node scheduled by minute, or you want to configure cross-cycle scheduling dependencies. This method ensures that nodes can be scheduled as expected, and prevents unexpected scheduling dependencies from delaying the running of nodes. For information about how to configure scheduling dependencies in complex dependency scenarios, see Principles and samples of scheduling configurations in complex dependency scenarios.

Compare code parsing results when you commit a node

You can use the code parsing result comparison feature to confirm whether the modifications you made to the current scheduling dependencies of a node meet your business requirements, and confirm the impacts of the modifications to data in the production environment.

If you enable the automatic parsing feature and you modify the scheduling dependencies of a node that are obtained based on the automatic parsing feature, you must confirm the modifications you made when you commit the node. This ensures that data is generated as expected in the production environment. This method ensures that modifications to the scheduling dependencies do not affect generation of data of the node in the production environment.

View the details of a node on the Cycle Task page after you deploy the node

You can use this method to check whether the scheduling dependencies of a node in the production environment meet your business requirements in Operation Center after you deploy the node.

  • Confirm scheduling dependencies of a node in the production environment

    In a workspace in standard mode, the scheduling dependencies of a node in the development and production environments can be different. You must configure the scheduling dependencies for a node in the production environment on the DataStudio page, and deploy the node for the configurations to take effect.

    After you deploy the node, you can go to the Cycle Task page in Operation Center, and show the ancestor and descendant nodes of the current node to view the scheduling dependencies of the node.

    Important

    You can view the latest status of nodes in the production environment on the Cycle Task page. However, whether instances are added or removed is related to the mode in which instances take effect. For more information, see Additional information.

  • Confirm the data of a node in the production environment

    After you confirm that the scheduling dependencies of a node are correct, you must check whether the partitions in the tables generated by the ancestor nodes are the partitions in the table on which the current node depends. This prevents that the data in the tables generated by the ancestor nodes is not the data in the table on which the current node depends.

    Note

    If process control for the node deployment procedure exists, we recommend that you go to the Cycle Task page in Operation Center in the production environment after you deploy a node. On this page, you can view the scheduling dependencies and related properties of the node. If you find that the configurations do not meet your requirements, you need to check whether the deployment procedure is blocked. For more information, see Deploy nodes.

Additional information

This section describes the common scheduling dependency scenarios. For information about frequently asked questions about scheduling dependencies, see Scheduling dependencies.

  • Node uniqueness

    • A node can have different scheduling dependency configurations in the development and production environments but the node must be unique.

    • Before you undeploy a node, you must remove all descendant nodes of the node from both the development and production environments. Due to node uniqueness, before you undeploy a node in DataWorks, you must remove all descendant nodes of the node, reconfigure a node as the ancestor node of the descendant nodes, and then commit and deploy the operations. This ensures that the descendant nodes can obtain valid data and are run as expected. Make sure that the scheduling dependencies of the node in both the development and production environments are removed before you undeploy the node.

  • Instance generation modes