To guarantee that business data is effectively produced in a timely manner, you must configure correct dependencies for nodes. After you configure correct dependencies for the current node, you do not need to consider which node generates the table data required by the current node. DataWorks automatically parses the node dependencies you configured.

You can use one of the following three methods to configure node dependencies:
  • Manually connect nodes in the workflow editing panel.
  • Use the automatic parsing feature.
  • Add custom dependencies.

You can use the preceding methods to configure dependencies for the same node. For more information about how to configure dependencies, see Best practices for setting scheduling dependencies.

Note We recommend that you specify only one node to generate data for a table, and specify the same name for the node and table.

Manually connect nodes in the workflow editing panel

Note The editing panel of a workflow only allows you to connect nodes that belong to the workflow.
  1. Log on to the DataWorks console. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the target workspace and click Data Analytics in the Actions column.
  2. On the DataStudio page that appears, click the Data Analytics tab. Click Business Flow and double-click the target workflow. The workflow editing panel appears. For more information about how to create a workflow, see Create a workflow.
  3. Draw lines to connect the nodes in the workflow.
  4. Double-click the first node. On the node configuration tab that appears, click Properties on the right side. In the Properties pane that appears, configure the parent node for the first node. You can click Use Root Node in the Dependencies section to specify the root node of the workspace as the parent node. You can also select the output name of a node from the Parent Nodes drop-down list to specify the node as the parent node.
  5. After the configuration is completed, click the Commit icon in the toolbar.

After you connect the nodes in the workflow editing panel, DataWorks automatically uses the output name, which contains _out, of a node as the input name of its child node.

Use the automatic parsing feature

If you set the Auto Parse parameter to Yes for a node, DataWorks automatically parses the node dependencies.

Automatic parsing

DataWorks can automatically parse the output names of a node and its parent node based on the lineage parsed from the code. We recommend that you set the Auto Parse parameter to Yes for a node so that DataWorks can automatically parse the node dependencies.
  • When a node executes the INSERT or CREATE statement to add a record to a table or create a table, DataWorks uses the table name as the output name of the node. The table name is in the format project_name.table_name.
  • When a node executes the SELECT statement to query data in a table, DataWorks uses the table name as the output name of the parent node.
Note By default, a table whose name starts with t_ is recognized as a temporary table. Temporary tables are excluded from the automatic parsing process. If a table whose name starts with t_ is not a temporary table, contact the workspace administrator to modify the table property on the Workspace Management page.
Assume that an ODPS SQL node executes the following SQL statement:
insert overwrite table table_a as select * from project_b_name.table_b;

DataWorks determines that the node depends on the table_b table and generates the table_a table. In this case, the output name of the parent node is project_b_name.table_b and that of the current node is project_name.table_a.

The automatic parsing feature has the following highlights:
  • Automatically parses the parent node of a node based on the code logic. Assume that you specify project_name.tablename as the output name of node A. If node B needs to query data in this table, you can directly specify the table name without the need to know the node that generates the table data.

    You can use the automatic parsing feature to configure node dependencies when you do not know the specific node that generates the required table data in the current workspace.

  • Avoids incorrect lineage.
  • Automatically parses the dependencies between nodes in different workspaces that belong to the same region.

Lineage

The causes for incorrect lineage are as follows:
  • You ignore the error message indicating incorrect lineage when you commit a node.

    When child node A for which Auto Parse is set to Yes queries the table data of node B, DataWorks uses the table name of node B as the input name of node A. However, you do not specify the table name as the output name of node B when you configure node B. As a result, an error message appears indicating that the output name of the parent node does not exist when you commit node A.

  • The Auto Parse parameter is set to No for a node.

When incorrect lineage is detected, an error message appears indicating that the input name of the node does not match the lineage specified in the code. This means that the output name of the parent node is different from the name of the table to query in the code.

Assume that you specify the xc_DPE_E2.xc_ods_user_info_d and xc_DPE_E2.xc_ods_log_info_d tables to query in the code of node A. However, you do not specify the node that generates the table data as the parent node of node A. When you commit node A, an error message appears indicating that the input name of the node does not match the lineage specified in the code.

You can use the following methods to avoid incorrect lineage:
  • Do not set Auto Parse to No or change the output names of nodes whenever possible. When a node executes the SELECT statement to query data in a table, DataWorks uses the table name as the output name of the parent node. When a node executes the INSERT or CREATE statement to add a record to a table or create a table, DataWorks uses the table name as the output name of the node.
  • When you commit a node, if an error message appears indicating that the specified table or the node that generates the table data cannot be found due to incorrect lineage, add custom dependencies.

Add or delete custom dependencies

You can add a custom dependency for a node by manually specifying the output name of another node as the input name of the current node.

When incorrect lineage is detected from the code of a node, you can add a custom dependency for the node. We recommend that you specify the correct lineage in the code of nodes to reduce custom dependencies.

You can use the following methods to add or delete custom dependencies:
  • Use the automatic recommendation feature to specify the parent node for a node.
  • Specify the parent node for a node by entering the output name of the parent node.
  • Add or delete the input or output name of a node to add or delete dependencies.
Details are as follows:
  • Use the automatic recommendation feature to specify the parent node for a node.

    If you set the Auto Parse parameter to No for node A, you can use the automatic recommendation feature to specify the parent node for node A.

    The automatic recommendation feature automatically parses all nodes that have been committed and published to the production environment and have generated data in the table that node A needs to query.
    Note The automatic recommendation feature only parses the nodes that have been committed and published to the production environment and have generated data in the target table. Therefore, certain nodes may be parsed one day later.
  • Specify the parent node for a node by entering the output name of the parent node.

    On the Properties tab of a node, you can select an output name, that is, a table name, from the Parent Nodes drop-down list in the Dependencies section. Then, DataWorks uses the node that generates data in the table as the parent node of the current node.

  • Add or delete the input or output name of a node to add or delete dependencies.

    Assume that the code of node A contains the output name of node B that node A does not need to depend on. You can right-click the output name in the code and select Delete Input from the right-click menu to delete the input name. You can also select Add Input, Add Output, or Delete Output from the right-click menu to add an input name, add an output name, or delete an output name, respectively. You can also add a rule as a comment at the top of the code, as shown in the following figure. Then, DataWorks skips parsing the dependency based on the rule.

Note We recommend that you specify the correct lineage in the code of nodes to reduce custom dependencies. To guarantee that complete business data can be generated, you must configure node dependencies based on the business scenario and scheduling parameters. Use custom dependencies with caution.