Nodes in a workflow in DataWorks are run in an orderly manner based on the scheduling dependencies configured for each node. This ensures that business data is generated in an effective and timely manner. This topic describes the principles for configuring scheduling dependencies and how scheduling dependencies work.

Reasons for configuring scheduling dependencies

Scheduling dependencies define the relationships between nodes. After you configure scheduling dependencies for a node, the node is run only after the running of its ancestor node is successful.

After the running of a node is successful, the node generates the latest table data. Then, its descendant node extracts data from the node. This mechanism ensures that a node can obtain valid data from its ancestor node.

We recommend that you plan and configure scheduling dependencies for nodes based on the lineage of the table data of each node. Make sure that the following principles are met:
  • A table is generated by only one node, and the table must be configured as the output of the node.
    Note
    • The system automatically adds the table generated by an SQL node to the output of the SQL node based on the automatic parsing feature.
    • You must manually add the table generated by a batch synchronization node to the output of the node.
  • The output of a node must be configured as the input of its descendant node, which forms dependency relationships between nodes.
Note For a node whose table data has no lineage, you can plan and configure scheduling dependencies for the node based on the upstream and downstream relationships between nodes in a workflow. The configuration principles and results must comply with those of a node whose table data has a lineage.
The scheduling dependencies of a node are configured on the Properties panel of the node. You must specify Parent Nodes and Outputs.Scheduling dependencies
Scheduling dependencies can be automatically or manually configured for a node.
  • In ideal cases, the system can automatically identify input and output commands such as SELECT and INSERT based on the standard task code that you developed for a node. The system can also identify the lineage of table data based on the code. Then, the system automatically configures scheduling dependencies for the node based on the automatic parsing feature and the identified lineage.
  • In special cases, you can manually configure scheduling dependencies for a node. For example, the scheduling dependencies configured for a node contain a table that is not generated by an auto triggered node, such as a table uploaded from your on-premises machine. In this case, you can manually modify the scheduling dependencies.
When you commit a node, the system checks whether the scheduling dependencies configured for the node are consistent with the data lineage in the code developed for the node. If they are inconsistent, the system displays an error message. In this case, you can determine whether to modify the scheduling dependencies based on actual situations.

Automatic parsing

For an SQL node, the system can automatically determine the scheduling dependencies of the node based on the task code developed for it. Then, the system automatically adds the required output or dependent ancestor node to the scheduling dependencies of the node.

Principles for adding scheduling dependencies based on automatic parsingAutomatic parsing
  • If the code developed for a node contains output commands such as INSERT and CREATE, the system automatically parses the commands and adds the table generated by the node to Outputs for the node.
  • If the code developed for a node contains input commands such as SELECT, the system automatically parses the commands and adds the input table to Parent Nodes for the node.
  • The output of a node is configured as the input of its descendant node. This way, a scheduling dependency is established between the nodes based on their data lineages.
In principle, automatic parsing results are consistent with data lineages. When you commit a node, the system checks whether the scheduling dependencies configured for the node are consistent with data lineages. If they are inconsistent, the system displays an error message. In this case, you can use one of the following methods to resolve the issue:
  • If the scheduling dependencies of a node do not contain a table that is not generated by an auto triggered node, you must check whether the scheduling dependencies are correctly configured for the node.
  • If the scheduling dependencies of a node contain a table that is not generated by an auto triggered node, you must manually delete the dependency for the node.

Requirements and principles for code development

Automatic parsing enables the system to automatically identify the scheduling dependencies of a node based on the task code that you developed for the node. Therefore, we recommend that you strictly comply with the following requirements when you develop data:
  • Requirements for code development: One node generates only one table, and a table is generated by only one node.
  • Requirements for node creation: The name of a node must be consistent with that of the table that will be generated by the node.
  • Requirements for scheduling configurations: The table generated by a node must be added to Outputs for the node.

Manual configuration

DataWorks allows you to manually modify Parent Nodes and Outputs for a node during the code development for the node. If the scheduling dependencies automatically generated by the system for your node do not meet your business requirements, you can manually modify the dependencies.

Scenarios

Scheduling dependencies ensure that a node can successfully obtain the table data generated by its ancestor node that is scheduled to run. However, if the ancestor node of a node is not scheduled to run, the system cannot monitor whether the ancestor node has generated the latest table data. If the SELECT statement in the code of a node specifies a table that is not generated by an auto triggered node, and the table is automatically added to Parent Nodes for the node, you must manually delete the dependency for the node. Tables that are not generated by auto triggered nodes include the following types:
  • Tables uploaded from on-premises machines to DataWorks
  • Dimension tables
  • Tables that are not generated by nodes scheduled by DataWorks
  • Tables generated by manually triggered nodes

Configuration methods

  • Delete a scheduling dependency in the code editor of a nodeDelete InputIf the SELECT statement in the code of a node specifies a table that is not generated by an auto triggered node, you can delete the dependency for the node. Specifically, you can go to the code editor of the node, right-click the table name that you want to delete, and then click Delete Input to perform the operation. The preceding figure shows the process. You can also add a rule as a comment at the top of the code. This way, the system does not automatically parse the dependency based on the rule.
  • Delete a scheduling dependency on the Properties panel of a node
    Manual configurationIf the SELECT statement in the code of a node specifies a table that is not generated by an auto triggered node, you can manually delete the dependent ancestor node for the node. Specifically, you can go to the Properties panel of the node, set Auto Parse to No, and then manually perform the operation.

Configuration by drawing lines to connect nodes

DataWorks allows you to specify the relationships between nodes by drawing lines to connect nodes on the editing pages of workflows. After the nodes are connected, the system automatically adds scheduling dependencies for each node based on the connections.Configuration by drawing lines to connect nodesAfter all nodes are created, the system automatically adds an output whose name is suffixed with _out for each node. When you connect nodes by drawing lines, the system adds an output whose name is suffixed with _out to the input of each descendant node.

Scenarios

After you create a workflow, you can connect nodes by drawing lines on the editing page of the workflow to configure scheduling dependencies for each node based on your business requirements. During subsequent code development, you can add or modify scheduling dependencies for each node manually or by using the automatic parsing feature. This way, all nodes in the workflow can be configured with correct scheduling dependencies.

Cases

In this section, an example is provided to demonstrate how scheduling dependencies work.Process exampleThe preceding figure shows the following information:
  • The table generated by a node must be added to Outputs for the node. If the code developed for the node contains output commands such as INSERT, the system automatically parses the commands and adds the table to Outputs for the node.
  • The input of a node must be added to Parent Nodes for the node. If the code developed for the node contains input commands such as SELECT, the system automatically parses the commands and adds the table to Parent Nodes for the node.
  • The output of a node is configured as the input of its descendant node. This way, a scheduling dependency is established between the nodes based on their data lineages.
Then, the system runs the ancestor node first based on the established scheduling dependency. After the running of the ancestor node is successful, the system begins to run its descendant node.
The preceding process indicates that the following principles must be observed when you configure scheduling dependencies for each node:
  • For a node that has upstream and downstream relationships, the output of the ancestor node of the node must be configured as the dependent ancestor node of the node. This helps establish scheduling dependencies between nodes.
  • The output name of the dependent ancestor node and ID of the dependent ancestor node of the dependent ancestor node configured for a descendant node must be unique. This also indicates that the output names of all nodes must be unique. Otherwise, the descendant node cannot find its ancestor node based on the two pieces of information and obtain the data generated by the ancestor node.

Instructions on configuring scheduling dependencies

For more information about how to configure scheduling dependencies in common scenarios, see Instructions to configure scheduling dependencies.

FAQ