To guarantee that business data is effectively produced in a timely manner, you must configure correct dependencies for nodes. After you configure correct dependencies for the current node, you do not need to consider which node generates the table data required by the current node. DataWorks automatically parses the node dependencies you configured.
- Manually connect nodes in the workflow editing panel.
- Use the automatic parsing feature.
- Add custom dependencies.
You can use the preceding methods to configure dependencies for the same node. For more information about how to configure dependencies, see Best practices for setting scheduling dependencies.
Manually connect nodes in the workflow editing panel
- Log on to the DataWorks console. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the target workspace and click Data Analytics in the Actions column.
- On the DataStudio page that appears, click the Data Analytics tab. Click Business Flow and double-click the target workflow. The workflow editing panel appears. For more information about how to create a workflow, see Create a workflow.
- Draw lines to connect the nodes in the workflow.
- Double-click the first node. On the node configuration tab that appears, click Properties on the right side. In the Properties pane that appears, configure the parent node for the first node. You can click Use Root Node in the Dependencies section to specify the root node of the workspace as the parent node. You can also select the output name of a node from the Parent Nodes drop-down list to specify the node as the parent node.
- After the configuration is completed, click the
icon in the toolbar.
After you connect the nodes in the workflow editing panel, DataWorks automatically uses the output name, which contains _out, of a node as the input name of its child node.
Use the automatic parsing feature
If you set the Auto Parse parameter to Yes for a node, DataWorks automatically parses the node dependencies.
Automatic parsing
- When a node executes the
INSERT
orCREATE
statement to add a record to a table or create a table, DataWorks uses the table name as the output name of the node. The table name is in the formatproject_name.table_name
. - When a node executes the
SELECT
statement to query data in a table, DataWorks uses the table name as the output name of the parent node.
insert overwrite table table_a as select * from project_b_name.table_b;
DataWorks determines that the node depends on the table_b table and generates the table_a table. In this case, the output name of the parent node is project_b_name.table_b and that of the current node is project_name.table_a.
- Automatically parses the parent node of a node based on the code logic. Assume that
you specify
project_name.tablename
as the output name of node A. If node B needs to query data in this table, you can directly specify the table name without the need to know the node that generates the table data.You can use the automatic parsing feature to configure node dependencies when you do not know the specific node that generates the required table data in the current workspace.
- Avoids incorrect lineage.
- Automatically parses the dependencies between nodes in different workspaces that belong to the same region.
Lineage
- You ignore the error message indicating incorrect lineage when you commit a node.
When child node A for which Auto Parse is set to Yes queries the table data of node B, DataWorks uses the table name of node B as the input name of node A. However, you do not specify the table name as the output name of node B when you configure node B. As a result, an error message appears indicating that the output name of the parent node does not exist when you commit node A.
- The Auto Parse parameter is set to No for a node.
When incorrect lineage is detected, an error message appears indicating that the input name of the node does not match the lineage specified in the code. This means that the output name of the parent node is different from the name of the table to query in the code.
Assume that you specify the xc_DPE_E2.xc_ods_user_info_d
and xc_DPE_E2.xc_ods_log_info_d
tables to query in the code of node A. However, you do not specify the node that
generates the table data as the parent node of node A. When you commit node A, an
error message appears indicating that the input name of the node does not match the
lineage specified in the code.
- Do not set Auto Parse to No or change the output names of nodes whenever possible. When a node executes the
SELECT
statement to query data in a table, DataWorks uses the table name as the output name of the parent node. When a node executes theINSERT
orCREATE
statement to add a record to a table or create a table, DataWorks uses the table name as the output name of the node. - When you commit a node, if an error message appears indicating that the specified table or the node that generates the table data cannot be found due to incorrect lineage, add custom dependencies.
Add or delete custom dependencies
You can add a custom dependency for a node by manually specifying the output name of another node as the input name of the current node.
When incorrect lineage is detected from the code of a node, you can add a custom dependency for the node. We recommend that you specify the correct lineage in the code of nodes to reduce custom dependencies.
- Use the automatic recommendation feature to specify the parent node for a node.
- Specify the parent node for a node by entering the output name of the parent node.
- Add or delete the input or output name of a node to add or delete dependencies.
- Use the automatic recommendation feature to specify the parent node for a node.
If you set the Auto Parse parameter to No for node A, you can use the automatic recommendation feature to specify the parent node for node A.
The automatic recommendation feature automatically parses all nodes that have been committed and published to the production environment and have generated data in the table that node A needs to query.Note The automatic recommendation feature only parses the nodes that have been committed and published to the production environment and have generated data in the target table. Therefore, certain nodes may be parsed one day later. - Specify the parent node for a node by entering the output name of the parent node.
On the Properties tab of a node, you can select an output name, that is, a table name, from the Dependencies section. Then, DataWorks uses the node that generates data in the table as the parent node of the current node.
drop-down list in the - Add or delete the input or output name of a node to add or delete dependencies.
Assume that the code of node A contains the output name of node B that node A does not need to depend on. You can right-click the output name in the code and select Delete Input from the right-click menu to delete the input name. You can also select Add Input, Add Output, or Delete Output from the right-click menu to add an input name, add an output name, or delete an output name, respectively. You can also add a rule as a comment at the top of the code, as shown in the following figure. Then, DataWorks skips parsing the dependency based on the rule.