Scenario 1: Configure scheduling dependencies for batch synchronization nodes in a workflow - DataWorks

Scheduling dependencies cannot be automatically added to batch synchronization nodes in a workflow based on the automatic parsing feature. If a node depends on the table generated by its ancestor batch synchronization node, you must manually add the table to the output of the batch synchronization node. This way, when the node queries the table data, the automatic parsing feature can help quickly find the batch synchronization node.

Error message

If you do not manually add the table generated by a batch synchronization node to the output of the batch synchronization node, the system cannot use the automatic parsing feature to find the batch synchronization node. In this case, when you commit an SQL node that depends on the output of the batch synchronization node, an error message shown in the following figure appears. An error that is reported for a batch synchronization node

An error that is reported for a batch synchronization node

This error occurs because the system cannot find the batch synchronization node on which the SQL node depends based on the upstream dependency that is automatically parsed. For more information, see Error analysis. To prevent this error, we recommend that you use one of the following methods to configure scheduling dependencies for a batch synchronization node:

Method 1: Manually add the table generated by a batch synchronization node to its output
Method 2: Keep the name of a batch synchronization node the same as that of its generated table

Method 1: Manually add the table generated by a batch synchronization node to its output

To prevent the preceding error, you can manually add the upstream dependency that is automatically parsed to the output of the batch synchronization node. You can perform the operation on the Properties panel of the batch synchronization node. The following figure shows an example. Manually add the table

Method 2: Keep the name of a batch synchronization node the same as that of its generated table

Based on the preceding descriptions, you can obtain the following information:

When you create a batch synchronization node, the system automatically generates an output named in the projectname.nodename format for the node.
When the SQL node uses the generated table of the batch synchronization node, the system automatically generates a dependent ancestor node named in the projectname.tablename format for the SQL node.
To prevent errors, you must make sure that the name of the dependent ancestor node is the same as that of the output of the batch synchronization node.

Therefore, to prevent the preceding error when you commit the SQL node, you must keep the name of the batch synchronization node the same as that of the table generated by the batch synchronization node.

Note When you create a node, the system automatically generates an output named in the projectname.nodename format for the node. If you change the name of the node after the node is created, the name of the output does not change. Therefore, this method can be used only when you create a batch synchronization node. If you change the name of a node or a table generated by a batch synchronization node in subsequent operations to ensure consistent names, this error persists.

Error analysis

The following figure shows the nodes and scheduling dependencies configured in a workflow that contains batch synchronization nodes.


Step No.	Detailed step	Configured scheduling dependency
1	Create nodes in the workflow based on the planning of the workflow. In the preceding figure, virtual nodes, batch synchronization nodes, and MaxCompute nodes are created.	After the nodes are created in DataWorks, the system automatically generates two outputs for each node. One is named in the `projectname.nodename` format, and the name of the other is suffixed with _out. For example, the preceding figure shows that after the batch synchronization node user_1 is created, the system automatically generates the following outputs for the node: One output named `*******_out` The other output named `doctest.user_1`
2	Connect the nodes by drawing lines based on the planning of the workflow to determine the dependency relationships of the nodes.	After the nodes are connected, the system automatically adds dependency configurations for each ancestor node based on the connections. For example, after you connect the nodes, the MaxCompute node sql_1 in the preceding figure becomes a descendant node of the batch synchronization node user_1. In this case, the system automatically configures an output named `*******_out` of user_1 as a dependent ancestor node of sql_1.
3	Develop task code for each node.	When you develop task code for each node, the system automatically parses some input and output commands in the code and adds the output or descendant ancestor node for each node. For example, the MaxCompute node sql_1 needs to use the data in the `table_1` table generated by the batch synchronization node user_1, and the task code of sql_1 contains statements such as `select * from table_1`. In this case, the system automatically adds a dependent ancestor node for sql_1. The output of the automatically added ancestor node is named in the `projectname.tablename` format. In this example, the output name is `doctest.table_1`.

After you perform the preceding operations, if you fail to notice that the system does not automatically add the table generated by the batch synchronization node to the output of the batch synchronization node, an error is reported when you commit a node in the workflow. The error indicates that the output name of the dependent ancestor node does not exist. An error that is reported for a batch synchronization node

This error is caused by the following reasons:

The batch synchronization node user_1 does not support automatic parsing. Therefore, the table_1 table generated by user_1 is not automatically added to the output of user_1. This indicates that user_1 does not have an output named doctest.table_1.
The system automatically adds a dependent ancestor node named in the projectname.tablename format for the descendant node sql_1. In this example, the name of the dependent ancestor node is doctest.table_1. However, the system does not add doctest.table_1 to the output of user_1. Therefore, the system cannot find the ID of user_1.
When you commit sql_1, the system detects that sql_1 has an upstream dependency doctest.table_1. However, the system cannot associate the upstream dependency with the ID of an ancestor node and reports an error indicating that the output name of the dependent ancestor node of sql_1 does not exist.