If same-cycle scheduling dependencies are configured for a node, the instance generated for the node in the current cycle depends on the data from the instance generated for another node in the same scheduling cycle. The current node can be run as expected only after the instance generated for another node is successfully run. If the current node needs to depend on the data in a table that is generated by another node in the same scheduling cycle, you can configure same-cycle scheduling dependencies for the current node. DataWorks allows you to configure same-cycle scheduling dependencies by using various methods and provides the dependency preview feature. You can view and adjust incorrect scheduling dependencies at the earliest opportunity to ensure that nodes are scheduled as expected. This topic describes the precautions, logic, and methods to configure same-cycle scheduling dependencies.
Precautions
To ensure smooth configuration of scheduling dependencies, you must understand the information that is described in the "Configure scheduling settings" topic.
In the directed acyclic graph (DAG) of a node, same-cycle scheduling dependencies for the node are presented as solid lines.
If same-cycle scheduling dependencies between nodes cannot meet your requirements in specific complex scenarios, you can configure cross-cycle scheduling dependencies between the nodes. For example, if a node scheduled by day depends on a node scheduled by hour, the instance generated for the node scheduled by day depends on all instances generated for the node scheduled by hour on the current day by default. If you configure the self-dependency for the node scheduled by hour, you can specify that the node scheduled by day depends on the instance generated for the node scheduled by hour in a specific scheduling cycle. For information about how to configure scheduling dependencies in complex dependency scenarios, see Principles and samples of scheduling configurations in complex dependency scenarios.
Configuration principles
To improve node development efficiency, we recommend that you use the automatic parsing feature to quickly configure scheduling dependencies for nodes. You must abide by the following principles during the development process:
Node creation: We recommend that you specify a node name that is the same as the name of the output table of the node.
Code development: Do not use multiple nodes to write data to the same table.
Dependency configuration: We recommend that you use the table generated by a node as the output of the node.
Configuration entry point and description
You can go to the configuration tab of a node that you created in Data Studio and click the Properties tab in the right-side navigation pane to configure scheduling dependencies for the node in the Scheduling Dependencies section of the Properties tab.
When you configure same-cycle scheduling dependencies for a node, you must specify the nodes on which the current node needs to depend in the Node Dependencies section and specify other nodes that need to depend on the current node based on the output of the current node in the Node Outputs section.
By default, you can configure scheduling dependencies for a node based on the lineage between the table from which you want to read data and the table to which you want to write data in the node code. When you commit the node, DataWorks checks whether the scheduling dependencies are configured as expected. You can specify whether to perform automatic parsing for code before you commit the desired node based on your business requirements.
DataWorks supports multiple configuration methods, such as configuration based on the lineage in the code of a node, configuration by drawing lines on the configuration tab of a workflow, and manual configuration. You can select a configuration method based on your business requirements.
If the instance generated for a node in the current cycle needs to depend on the data of an instance generated for another node on the previous day or if the instance generated for a node scheduled by hour or minute in the current cycle needs to depend on the instance generated for the same node in the previous cycle, you can configure cross-cycle scheduling dependencies.
Ancestor nodes
You can specify the nodes on which the current node depends. After the nodes are specified, the current node can start to run only after the ancestor nodes are successfully run. You must enter the output of an ancestor node as the input of the current node. Take note of the following items when you specify the nodes on which the current node depends:
You must configure ancestor nodes for all nodes. We recommend that you configure ancestor nodes for a node based on the table lineage. If no table lineage exists, you can select the root node or zero load node of a workspace as the ancestor node for the current node based on your business requirements.
Make sure that ancestor nodes are committed. If an error indicating that the output of an ancestor node does not exist is reported when you commit the current node, check whether the ancestor node is committed.
Configuration methods:
Method 1: Configure scheduling dependencies based on the lineage in the code of a node
DataWorks generates the name of the output table in the projectName.tableName format based on the code parsing results. The system searches for and recommends the nodes on which the current node needs to depend based on the name of the output table.
Method 2: Manually add scheduling dependencies for a node
In the Scheduling Dependencies section, click Add Dependency. In the form that appears, select a dependency type, search for a node by node name, output name, or scheduling task ID, and then add the node as an ancestor node of the current node.
If you use the scheduling dependencies obtained from the code parsing results of a node, make sure that the recommended nodes are committed and deployed to the production environment and generate the desired table. The recommended nodes must be committed to the scheduling system on the previous day. This way, the nodes can be identified by the automatic recommendation feature after data is generated on the current day. Therefore, automatically recommended nodes are updated with a delay of one day.
Output of the current node
You can configure the output of a node for establishing scheduling dependencies between the current node and other nodes. Other nodes find the current node by searching for the output name of the current node, and the current node is specified as the ancestor node of a node based on the scheduling dependency configurations. If the current node is configured as an ancestor node of a descendant node, the name of the output of the current node contains the name of the descendant node after the descendant node is committed. DataWorks does not allow you to manually modify the descendant node in the Node Outputs section of the current node. The following figures show all methods of specifying the output of the current node.
In DataWorks, the name of the output generated for a node is the same as the name of the node. If a workspace contains nodes that have the same name, the nodes may fail to be committed due to the duplicate output names. If you remove the output of a node that has descendant nodes, severe impacts may be exerted. For more information, see the Appendix 1: Impacts of the removal or modification of the output of a node section in this topic.
Method 1: Use the default node output
By default, DataWorks generates the output for a node. You can click Modify in the Actions column to modify the name of the output table.
The default output name of a node is globally unique and cannot be modified or deleted. If you configure scheduling dependencies between nodes in a workflow by drawing lines on the configuration tab of the workflow, DataWorks automatically generates an output table name and an output name as the input of a descendant node.
Method 2: Manually add a node output
In the Node Outputs section, click Add Output. In the row that appears, manually add an output for the current node and configure the output name and output table name.
You must configure an output name in the workspace name.custom output name
format. The output name must be globally unique.
Configuration logic
To configure scheduling dependencies between nodes, you use the output of a node as the input of another node. This way, scheduling dependencies between the nodes are formed. We recommend that you configure scheduling dependencies between nodes based on the lineage between the table from which you want to read data and the table to which you want to write data. After the scheduling dependencies are configured, the descendant node can start to run only after the ancestor node is successfully run. Scheduling dependencies help ensure that a node can obtain the required data for its running from its ancestor nodes.
You can configure scheduling dependencies between nodes by using the methods described in the following table. The configuration logic is the same for the three methods.
Configuration method | Description |
DataWorks automatically adds the default output of an ancestor node as the input of a descendant node. | |
Manually add ancestor nodes for a node in the Scheduling Dependencies section | In most cases, you can use this method to modify scheduling dependencies of a node if the scheduling dependencies that are obtained by using the automatic parsing feature do not meet your business requirements. |
You can configure scheduling dependencies between nodes based on the automatic parsing feature. This feature can automatically parse the table lineage based on the node code and allows you to quickly configure the scheduling dependencies between nodes. |
Configuration methods
Draw lines on the configuration tab of a workflow to connect nodes to establish scheduling dependencies between nodes
Manually add ancestor nodes for a node in the Scheduling Dependencies section
Use the automatic parsing feature to configure scheduling dependencies between nodes based on the table lineage
Subsequent steps: Check whether the scheduling dependencies meet your expectations
After the scheduling dependencies are configured, you can perform the following operations to confirm that nodes are scheduled as expected:
Commit nodes: Check whether changes to scheduling dependencies between nodes meet your expectations when you commit the nodes.
Confirm the scheduling dependencies between nodes in Operation Center: After you deploy the nodes, check whether the scheduling dependencies between the auto triggered nodes in the production environment meet your expectations in Operation Center. Auto triggered nodes in the production environment are the nodes in the latest status. Scheduling dependencies between instances generated for the nodes are relevant to the Instance Generation Mode parameter.