Configure same-cycle scheduling dependencies - DataWorks - Alibaba Cloud ドキュメントセンター

If same-cycle scheduling dependencies are configured for a node, the instance generated for the node in the current cycle depends on the data from the instance generated for another node in the same scheduling cycle. The current node can be run as expected only after the instance generated for another node is successfully run. If the current node needs to depend on the data in a table that is generated by another node in the same scheduling cycle, you can configure same-cycle scheduling dependencies for the current node. DataWorks allows you to configure same-cycle scheduling dependencies by using various methods and provides the dependency preview feature. You can view and adjust incorrect scheduling dependencies at the earliest opportunity to ensure that nodes are scheduled as expected. This topic describes the precautions, logic, and methods to configure same-cycle scheduling dependencies.

Precautions

To ensure smooth configuration of scheduling dependencies, you must understand the information that is described in Scheduling dependency configuration guide.Scheduling dependency configuration guide

In the directed acyclic graph (DAG) of a node, same-cycle scheduling dependencies for the node are presented as solid lines.
If same-cycle scheduling dependencies between nodes cannot meet your requirements in specific complex scenarios, you can configure cross-cycle scheduling dependencies between the nodes. For example, if a node scheduled by day depends on a node scheduled by hour, the instance generated for the node scheduled by day depends on all instances generated for the node scheduled by hour on the current day by default. If you configure the self-dependency for the node scheduled by hour, you can specify that the node scheduled by day depends on the instance generated for the node scheduled by hour in a specific scheduling cycle. For more information, see Principles and samples of scheduling configurations in complex dependency scenarios.
To prevent delayed scheduling of nodes in the production environment because of unexpected scheduling dependencies, we recommend that you use the preview feature to check whether the scheduling dependencies for instances generated for the nodes meet your expectations before you deploy the nodes. For more information, see Preview scheduling dependencies of a node.

Configuration principles

To improve node development efficiency, we recommend that you use the automatic parsing feature to quickly configure scheduling dependencies for nodes. You must abide by the following principles during the development process:

Node creation: We recommend that you specify a node name that is the same as the name of the output table of the node.
Code development: Do not use multiple nodes to write data to the same table.
Dependency configuration: We recommend that you use the table generated by a node as the output of the node.

Configuration entry point and description

You can go to the configuration tab of a node that you created in DataStudio and click the Properties tab in the right-side navigation pane to configure scheduling dependencies for the node in the Dependencies section of the Properties tab.

When you configure same-cycle scheduling dependencies for a node, you must specify the nodes on which the current node needs to depend in the Parent Nodes section and specify other nodes that need to depend on the current node based on the output of the current node in the Output Name of Current Node section.

Note

By default, you can configure scheduling dependencies for a node based on the lineage between the table from which you want to read data and the table to which you want to write data in the node code. When you commit the node, DataWorks checks whether the scheduling dependencies are configured as expected. For more information, see Confirm the lineage of a table. You can specify whether to perform automatic parsing for code before you commit the desired node based on your business requirements. For more information, see Configure scheduling settings.
DataWorks supports multiple configuration methods, such as configuration based on the lineage in the code of a node, configuration by drawing lines on the configuration tab of a workflow, and manual configuration. You can select a configuration method based on your business requirements. For more information, see the Configuration methods section in this topic.
If the instance generated for a node in the current cycle needs to depend on the data of an instance generated for another node on the previous day or if the instance generated for a node scheduled by hour or minute in the current cycle needs to depend on the instance generated for the same node in the previous cycle, you can configure cross-cycle scheduling dependencies. For more information, see Configure cross-cycle scheduling dependencies.
For information about FAQ in the scheduling dependency configuration process and best practices for configuration of scheduling dependencies between nodes that belong to different workspaces or workflows, see Appendix 1: FAQ and Appendix 2: Best practices.

Ancestor nodes

You can specify the nodes on which the current node depends. After the nodes are specified, the current node can start to run only after the ancestor nodes are successfully run. You must enter the output of an ancestor node as the input of the current node. Take note of the following items when you specify the nodes on which the current node depends:

You must configure ancestor nodes for all nodes. We recommend that you configure ancestor nodes for a node based on the table lineage. If no table lineage exists, you can select the root node or zero load node of a workspace as the ancestor node for the current node based on your business requirements. For more information, see Scheduling dependency configuration guide.
Make sure that ancestor nodes are committed. If an error indicating that the output of an ancestor node does not exist is reported when you commit the current node, check whether the ancestor node is committed.

The following figure shows all methods that you can use to specify an ancestor node for the current node. 上游节点输出 When you use the automatic recommendation method to specify the nodes on which the current node depends, make sure that the recommended nodes are committed and deployed to the production environment and generate the desired table. The recommended nodes must be committed to the scheduling system on the previous day. This way, the nodes can be identified by the automatic recommendation feature after data is generated on the current day. Therefore, automatically recommended nodes are updated with a delay of one day.

Output of the current node

You can configure the output of a node for establishing scheduling dependencies between the current node and other nodes. Other nodes find the current node by searching for the output name of the current node, and the current node is specified as the ancestor node of a node based on the scheduling dependency configurations. If the current node is configured as an ancestor node of a descendant node, the name of the output of the current node contains the name of the descendant node after the descendant node is committed. DataWorks does not allow you to manually modify the descendant node in the Output Name of Current Node section of the current node. The following figure shows all methods of specifying the output of the current node.

Important

In DataWorks, the name of the output generated for a node is the same as the name of the node. If a workspace contains nodes that have the same name, the nodes may fail to be committed due to the duplicate output names. If you remove the output of a node that has descendant nodes, severe impacts may be exerted. For more information, see Appendix 3: Impacts of the removal or modification of the output of a node.

Configuration logic

To configure scheduling dependencies between nodes, you use the output of a node as the input of another node. This way, scheduling dependencies between the nodes are formed. We recommend that you configure scheduling dependencies between nodes based on the lineage between the table from which you want to read data and the table to which you want to write data. After the scheduling dependencies are configured, the descendant node can start to run only after the ancestor node is successfully run. Scheduling dependencies help ensure that a node can obtain the required data for its running from its ancestor nodes. For information about how to confirm a table lineage, see Confirm the lineage of a table. 配置原理

You can configure scheduling dependencies between nodes by using the methods described in the following table. The configuration logic is the same for the three methods.

Configuration method	Description
Draw lines on the configuration tab of a workflow to connect nodes to establish scheduling dependencies between nodes	DataWorks automatically adds the output whose name ends with _out of an ancestor node as the input of the descendant node.
Use the automatic parsing feature to configure scheduling dependencies between nodes based on the table lineage	You can configure scheduling dependencies between nodes based on the automatic parsing feature. This feature can automatically parse the table lineage based on the node code and allows you to quickly configure the scheduling dependencies between nodes.
Manually add ancestor nodes for a node in the Dependencies section	In most cases, you can use this method to modify scheduling dependencies of a node if the scheduling dependencies that are obtained by using the automatic parsing feature do not meet your business requirements.

Configuration methods

Draw lines on the configuration tab of a workflow to connect nodes to establish scheduling dependencies between nodes

If you configure scheduling dependencies between nodes by drawing lines to connect the nodes on the configuration tab of a workflow, DataWorks automatically adds the output whose name ends with _out of an ancestor node as the input of the descendant node. 拉线方式

Note

If the connection lines for scheduling dependencies are removed from the configuration tab of the workflow, the scheduling dependencies are also removed from the node scheduling configurations.

Manually add ancestor nodes for a node in the Dependencies section

In the Parent Nodes section, you can enter the output of a node to add the node as an ancestor node of the current node. The output name is in the projectname.tablename format. 手动删除

Use the automatic parsing feature to configure scheduling dependencies between nodes based on the table lineage

DataWorks allows you to quickly configure scheduling dependencies between nodes by using the table lineage in the node code. If you enable the automatic parsing feature, the system names the table generated by a node in the projectname.tablename format and uses the table as the output of the node. The system also adds the table that the node queries as the input of the node. For example, if a table is specified in the SELECT statement in the code of a node, the system adds the table to Parent Nodes for the node based on the automatic parsing feature. If a table is specified in the INSERT statement in the code of a node, the system adds the table to Outputs for the node based on the automatic parsing feature. For information about the keywords that are supported by the automatic parsing feature for different types of nodes, see Support for the automatic parsing feature.

Details of configuration of scheduling dependencies between nodes based on the automatic parsing feature:

Configure scheduling dependencies between nodes based on the automatic parsing feature
The following figures show how scheduling dependencies are configured between nodes based on the automatic parsing feature.

Modify the scheduling dependencies that are obtained based on the automatic parsing feature

If the scheduling dependencies that are configured between nodes based on the automatic parsing feature do not meet your expectations, or the configuration of scheduling dependencies is not supported (tables whose data is not generated based on periodic scheduling), you can refer to the following table to modify the scheduling dependencies that are obtained based on the automatic parsing feature.

Scenario and method

Description

Example of operation and result

Use code to remove the input and output of a node when the automatic parsing feature is enabled.

Run a command in the code of a node to remove or add the input and output of the node, and parse the input and output of the node again.

After the input and output of the node are removed or added, comments are automatically added to automatic parsing results.

--@exclude_input=Remove the input of a node
--@exclude_output=Remove the output of a node
--@extra_output=Add the output of a node
--@extra_input=Add the input of a node

Modify the input and output of a node that are obtained based on the automatic parsing feature when the automatic parsing feature is disabled.

DataWorks does not allow you to remove the output of a node that has descendant nodes. If you remove the output of the node, an error occurs when the descendant nodes are run or the descendant nodes cannot obtain data.

In this case, we recommend that you adjust the downstream business logic. You can remove the dependency on the current ancestor node for the descendant nodes before you remove the output of the node.

Nodes that do not support the automatic parsing feature
Temporary tables in DataWorks, which are defined on the Table Management tab and are in a fixed format, such as tables prefixed with t_, are not used as the output or input of the current node based on the automatic parsing feature.
Precautions for using the automatic parsing feature
When you use the automatic parsing feature to configure scheduling dependencies, you must make sure that the output name of a node is unique in the current region. Take note of the following items when you use the automatic parsing feature in node development in DataWorks:
- Node creation: In DataWorks, the name of the output generated for a node is the same as the name of the node. If a workspace contains nodes that have the same name, you must manually change the output name of one of the nodes.
- Code development: The automatic parsing feature uses the table generated by a node as the output of the node. In a workspace, if two nodes are used to write data to the same table, an error may be reported for one of the nodes in the automatic parsing scenario. For more information, see Can multiple nodes have the same output name?
- Dependency configuration: In scenarios in which an SQL node is used to process the table that is generated by a batch synchronization node, you can use the automatic parsing feature to configure scheduling dependencies for the batch synchronization node based on the lineage. You must manually configure the output table of the batch synchronization node as the output of the node, or use the name of the output table of the batch synchronization node as the name of the batch synchronization node. Otherwise, the following error may be reported when you commit the SQL node that depends on the batch synchronization node: The output that is named in the ${projectname.tablename} format for the ancestor node does not exist. The current node cannot be committed. Make sure that the ancestor node whose output name is in the ${projectname.tablename} format is committed. In DataWorks, the name of the output generated for a node is the same as the name of the node.

Subsequent steps: Check whether the scheduling dependencies meet your expectations

After the scheduling dependencies are configured, you can perform the following operations to confirm that nodes are scheduled as expected:

Preview scheduling dependencies: Prevent delayed scheduling of nodes in the production environment because of unexpected scheduling dependencies.
Commit nodes: Check whether changes to scheduling dependencies between nodes meet your expectations when you commit the nodes.
Confirm the scheduling dependencies between nodes in Operation Center: After you deploy the nodes, check whether the scheduling dependencies between the auto triggered nodes in the production environment meet your expectations in Operation Center. Auto triggered nodes in the production environment are the nodes in the latest status. Scheduling dependencies between instances generated for the nodes are relevant to the Instance Generation Mode parameter.

For more information, see Confirm the scheduling dependencies.

Appendix 1: FAQ

For more frequently asked questions, see Scheduling dependencies.

Appendix 2: Best practices

For information about how to configure scheduling dependencies between nodes that belong to different workspaces or between nodes that belong to different workflows in the same workspace, see Scenario 3: Configure dependencies for nodes across workflows or workspaces.

Appendix 3: Impacts of the removal or modification of the output of a node

If you change the table generated by a node, the output of the node is changed accordingly. You can also directly modify the output of a node. Take note of the following items when you remove or modify the output of a node:

No impacts are exerted on the table generated by a node if you remove the output of the node.
If you remove or modify the output of a node that has descendant nodes, severe impacts may be exerted on the descendant nodes.
- Removal of the table generated by a node: If the table generated by a node is removed, the descendant nodes of the node may become isolated nodes and cannot be scheduled, or downstream data may be affected due to missing scheduling dependencies for the descendant nodes.
- Modification of the table generated by a node: You can transfer the table generated by a node to other nodes. For more information, see Impacts of removing or changing the output of a node.
When you delete the output of a node that has descendant nodes, we recommend that you notify the owners of descendant nodes of the deletion in advance and ask the owners to adjust the scheduling dependencies of the descendant nodes at the earliest opportunity to prevent the descendant nodes from becoming isolated nodes.