Nodes in a workflow in DataWorks are run in sequence based on the scheduling dependencies configured for each node. This ensures that business data is generated in an effective and timely manner. This topic describes the principles for configuring same-cycle scheduling dependencies and how scheduling dependencies work.

Reasons for configuring scheduling dependencies

Scheduling dependencies define the relationships between nodes. After you configure scheduling dependencies for a node, the node is run only after its ancestor node is run as expected.

After the execution of a node is successful, the node generates the output data. Then, its descendant node extracts that data from the node. This mechanism ensures that a node can obtain valid data from its ancestor node. This also prevents the node from extracting output data from its ancestor node before the execution of the ancestor node is successful.

We recommend that you plan and configure scheduling dependencies for nodes based on the lineage of the table data of each node. Make sure that the following principles are met:
  • A table is generated by only one node, and the table must be configured as the output of the node.
    Note
    • The system automatically adds the table generated by an SQL node to the output of the SQL node based on the automatic parsing feature.
    • You must manually add the table generated by a batch sync node to the output of the node. The name of the table is in the projectname.tablename format. This way, the output table of the node is configured as the input table of its descendant node based on the automatic parsing feature when the descendant node is run.
  • The output of a node must be configured as the input of its descendant node, which forms dependencies between nodes.
Note For a node whose table data has no lineage, you can plan and configure scheduling dependencies for the node based on the upstream and downstream relationships between nodes in a workflow. The configuration principles and results must comply with those of a node whose table data has a lineage.
The scheduling dependencies of a node are configured in the Properties panel of the node. You must set the Parent Nodes and Output parameters for the node. 同周期依赖
Scheduling dependencies can be automatically or manually configured for a node.
  • In most cases, the system can automatically identify input and output commands such as SELECT and INSERT based on the standard code that you developed for a node. The system can also identify the lineage of table data based on the code. Then, the system automatically configures scheduling dependencies for the node based on the automatic parsing feature and the identified lineage.
  • In special cases, you can manually configure scheduling dependencies for a node. For example, the scheduling dependencies configured for a node contain a table that is not generated by an auto triggered node, such as a table uploaded from your on-premises machine. In this case, you can manually modify the scheduling dependencies.
When you commit a node, the system checks whether the scheduling dependencies configured for the node are consistent with the data lineage in the code developed for the node. If they are inconsistent, the system displays an error message. In this case, you can determine whether to modify the scheduling dependencies based on actual situations.

Automatic parsing

For an SQL node, the system can automatically determine the scheduling dependencies of the node based on the code developed for the node. Then, the system automatically sets the Output or Parent Nodes parameters for the scheduling dependencies of the node.

Principles for adding scheduling dependencies based on automatic parsingAutomatic parsing
  • If the code developed for a node contains output commands such as INSERT and CREATE, the system automatically parses the commands and adds the table generated by the node to Output for the node.
  • If the code developed for a node contains input commands such as SELECT, the system automatically parses the commands and adds the input table to Parent Nodes for the node.
  • The output of a node is configured as the input of its descendant node. This way, a scheduling dependency is established between the nodes based on their data lineages.
In normal cases, the results of automatic parsing are consistent with data lineages. If you commit a node, the system checks whether the scheduling dependencies configured for the node are consistent with data lineages. If they are inconsistent, the system displays an error message. In this case, you can use one of the following methods to resolve the issue:
  • If the scheduling dependencies of a node do not contain a table that is not generated by an auto triggered node, you must check whether the scheduling dependencies are correctly configured for the node.
  • If the scheduling dependencies of a node contain a table that is not generated by an auto triggered node, you must manually delete the dependency for the node.

Requirements and principles for code development

Automatic parsing enables the system to automatically identify the scheduling dependencies of a node based on the code that you developed for the node. Therefore, we recommend that you strictly comply with the following requirements when you develop data:
  • Requirements for code development: One node generates only one table, and a table is generated by only one node.
  • Requirements for node creation: The name of a node must be consistent with that of the table that is to be generated by the node.
  • Requirements for scheduling configurations: The table generated by a node must be added to Output for the node.

Manually configure scheduling dependencies between nodes

DataWorks allows you to manually modify the Parent Nodes and Output parameters for a node during the code development for the node. If the scheduling dependencies automatically generated by the system for your node do not meet your business requirements, you can manually modify the dependencies.

Scenarios

Scheduling dependencies ensure that a node can successfully obtain the table data generated by its ancestor node that is scheduled to run. However, if the ancestor node of a node is not scheduled to run, the system cannot monitor whether the ancestor node has generated the latest table data. If the table specified in the SELECT statement of the code for a node is not generated by an auto triggered node and the table name is automatically added to Parent Nodes for the node, you must manually delete the dependency for the node. Tables that are not generated by auto triggered nodes include the following types:
  • Tables uploaded from on-premises machines to DataWorks
  • Dimension tables
  • Tables that are not generated by nodes scheduled by DataWorks
  • Tables generated by manually triggered nodes

Configuration methods

  • Delete a scheduling dependency in the code editor of a nodeDelete InputIf the SELECT statement in the code of a node specifies a table that is not generated by an auto triggered node, you can delete the dependency for the node. Specifically, you can go to the code editor of the node, right-click the name of the table that you want to remove from the input, and then click Delete Input. The preceding figure shows the process. You can also add a rule as a comment at the top of the code. This way, the system does not automatically parse the dependency based on the rule.
  • Delete a scheduling dependency in the Properties panel of a node
    Manually configure scheduling dependencies between nodesIf the SELECT statement in the code of a node specifies a table that is not generated by an auto triggered node, you can manually remove the table from Parent Nodes for the node. Specifically, you can go to the Properties panel of the node, set Auto Parse to No, and then manually perform the operation.

Configuration by drawing lines to connect nodes

DataWorks allows you to specify the relationships between nodes by drawing lines to connect nodes on the editing pages of workflows. After the nodes are connected, the system automatically adds scheduling dependencies for each node based on the connections. Connect nodes by drawing linesAfter all nodes are created, the system automatically adds an output whose name is suffixed with _out for each node. When you connect nodes by drawing lines, the system adds an output whose name is suffixed with _out to the input of each descendant node.

Scenario

After you create a workflow, you can connect nodes by drawing lines on the configuration tab of the workflow to configure scheduling dependencies for each node based on your business requirements. During subsequent code development, you can add or modify scheduling dependencies for each node manually or by using the automatic parsing feature. This way, all nodes in the workflow can be configured with correct scheduling dependencies.

Case study

In this section, an example is provided to demonstrate how scheduling dependencies work. Process exampleThe preceding figure shows the following information:
  • The table generated by a node must be added to Output for the node. If the code developed for the node contains output commands such as INSERT, the system automatically parses the commands and adds the table to Output for the node.
  • The input of a node must be added to Parent Nodes for the node. If the code developed for the node contains input commands such as SELECT, the system automatically parses the commands and adds the table to Parent Nodes for the node.
  • The output of a node is configured as the input of its descendant node. This way, a scheduling dependency is established between the nodes based on their data lineages.
Then, the system runs the ancestor node first based on the established scheduling dependency. After the execution of the ancestor node is successful, the system begins to run its descendant node.
The preceding process indicates that the following principles must be met when you configure scheduling dependencies for each node:
  • For a node that has upstream and downstream relationships, the Output of the ancestor node of the node must be configured as the Parent Nodes of the node. This helps establish scheduling dependencies between nodes.
  • The value of the Output Table Name and Node ID parameters of Parent Nodes configured for a descendant node must be unique. This also indicates that the output names of all nodes must be unique. Otherwise, the descendant node cannot find its ancestor node based on the two pieces of information and obtain the data generated by the ancestor node.

Instructions on configuring scheduling dependencies

For more information about how to configure scheduling dependencies in common scenarios, see Configure same-cycle scheduling dependencies.

FAQ