Scheduling dependencies are the foundation for building orderly workflows. You must configure correct dependencies between nodes to make sure that business data is produced effectively and in a timely manner. This helps standardize data development activities.

DataWorks allows you to automatically parse node dependencies from the code or manually customize node dependencies. You can configure correct relationships between ancestor and descendant nodes and monitor the running status of nodes to make sure the orderly production of business data.

After node dependencies are configured, you can check the data output time of tables queried by SQL statements and check whether data is properly produced from an ancestor node based on the node status.

You can set the output of an ancestor node as the input of a descendant node to configure a dependency between the two nodes.

DataWorks allows you to configure dependencies in one of the following modes: automatic recommendation, automatic parsing, and custom configuration. For more information about dependency configuration examples, see Best practices for setting scheduling dependencies.

Regardless of the dependency configuration mode, the overall scheduling logic is that descendant nodes can be run only after ancestor nodes are run. Therefore, each node in a workflow must have at least one parent node. The dependencies between the parent nodes and child nodes are the core of scheduling dependencies. The following sections describe the principles and configuration methods of scheduling dependencies in detail.

Note If a workspace is created before January 10, 2019, DataWorks may generate incorrect data for nodes in the workspace. If such an error occurs, submit a ticket to fix the problem. If a workspace is created after January 10, 2019, DataWorks can correctly generate data for nodes in the workspace.

Standardized data development scenarios

  • Before configuring scheduling dependencies, you need to understand the following basic concepts:
    • DataWorks node: the object that performs operations on data. For more information, see Concepts.
    • Output name: the default output name that DataWorks assigns to each node. Each output name ends with _out. You can also customize the output name, but make sure that each node output name is unique for a tenant. For more information, see Concepts.
    • Output table: the table whose name is used in the INSERT or CREATE statement in the SQL statements of a node.
    • Input table: the table whose name follows the FROM keyword in the SQL statements of a node.
    • SQL statement: the MaxCompute SQL statements.

    In practice, a DataWorks node can contain one or more SQL statements.

    Ancestor and descendant nodes are associated by output names. The root node of a workspace named Workspace name_root can be set as the upmost node.

  • Principles of standardized data development

    In a standardized data development process, multiple ancestor and descendant SQL nodes are created. We recommend that you follow these principles:

    • The input table of a descendant node must be the output table of its ancestor node.
    • One node produces only one table.
    • We recommend that the node name be the same as the output table name.

    This allows you to quickly configure complex dependencies by using the automatic parsing feature when a workflow contains too many nodes.

  • Example of a standardized data development process
    The following section describes the nodes and their code in the preceding figure:
    • ods_log_info_d node: The input data of this node comes from the ods_raw_log_d table, and the data is exported to the ods_log_info_d table. The code of this node is as follows:
      INSERT OVERWRITE TABLE ods_log_info_d PARTITION (dt=${bdp.system.bizdate})
        SELECT ……  // Your SELECT operation.
        FROM (
        SELECT ……  // Your SELECT operation.
        FROM ods_raw_log_d
        WHERE dt = ${bdp.system.bizdate}
      ) a;
    • dw_user_info_all_d node: The input data of this node comes from the ods_user_info_d and ods_log_info_d tables, and the data is exported to the dw_user_info_all_d table. The code of this node is as follows:
      INSERT OVERWRITE TABLE dw_user_info_all_d PARTITION (dt='${bdp.system.bizdate}')
      SELECT ……  // Your SELECT operation.
      FROM (
        SELECT *
        FROM ods_log_info_d
        WHERE dt = ${bdp.system.bizdate}
      ) a
      LEFT OUTER JOIN (
        SELECT *
        FROM ods_user_info_d
        WHERE dt = ${bdp.system.bizdate}
      ) b
      ON a.uid = b.uid;
    • rpt_user_info_d node: The input data of this node comes from the dw_user_info_all_d table, and the data is exported to the rpt_user_info_d table. The code of this node is as follows:
      INSERT OVERWRITE TABLE rpt_user_info_d PARTITION (dt='${bdp.system.bizdate}')
      SELECT ……  // Your SELECT operation.
      FROM dw_user_info_all_d
      WHERE dt = ${bdp.system.bizdate}
      GROUP BY uid;

Parent nodes

In the Dependencies section of a node, you must specify an ancestor node as the parent node on which the current node depends. You must enter the output name of the ancestor node, rather than the ancestor node name. A node may have multiple output names. Enter an output name as needed. You can search for an output name of the ancestor node to be added, or click Parse I/O to parse the output name based on the lineage parsed from the code. For more information, see Lineage.

Note You must enter an output name or output table name to search for the ancestor node.

If you enter an output name to search for the ancestor node, DataWorks searches for the output name among the output names of nodes that have been committed to the scheduling system.

  • Search by entering an output name

    You can enter an output name to search for the ancestor node and configure the node as the parent node of the current node to create a dependency.

  • Search by entering an output table name
    When using this method, make sure that the entered output table name of the ancestor node is the table name used in the INSERT or CREATE statement of the ancestor node, such as Workspace name. Table name. Such output names can be automatically parsed.Search
    After you click the Submit icon for such a node, the output name of the node can be found when you enter an output table name to search for the ancestor node for other nodes.Search by table name

Outputs

You can click the Properties tab in the right-side navigation pane to view and configure the output of the current node.

DataWorks assigns a default output name that ends with _out to each node. You can also customize an output name or click Parse I/O to parse the output name based on the lineage parsed from the code.
Note The output name of a node must be globally unique for your Alibaba Cloud account.

Automatic dependency parsing

DataWorks can parse different dependencies based on the SQL statements of a node. The parsed output names of the parent node and the current node are as follows:
  • Output name of the parent node: Workspace name.Table name used in the INSERT statement
  • Output name of the current node:
    • Workspace name.Table name used in the INSERT statement
    • Workspace name.Table name used in the CREATE statement for temporary tables
  • The SELECT statement specifies a table as the output of the parent node.
  • The INSERT statement specifies a table as the output of the current node.

If multiple INSERT and FROM clauses are used, multiple output names are automatically parsed for the parent node and the current node.

If you have created multiple nodes with dependencies and all input tables of descendant nodes are the output tables of ancestor nodes, you can use the automatic parsing feature to quickly configure dependencies for the entire workflow.

  • To make the nodes more flexible, we recommend that a node produce only one table. This simplifies dependency settings among nodes in a workflow.
  • If a table in the SQL statements of a node is both an output table and a referenced table on which another node depends, the table is parsed only as an output table.
  • If a table in the SQL statements of a node is used as an output table or a referenced table multiple times, only one scheduling dependency is parsed.
  • If the SQL code contains a temporary table whose name starts with t_, the table is not involved in a scheduling dependency. For more information about temporary tables, see Workspace settings.
When automatic parsing applies, you can add input and output to enable characters in SQL statements to be automatically parsed into input and output names, or delete input and output to avoid characters from being automatically parsed into input and output names.Automatic parsing

Select and right-click a table name, and select Add Input, Add Output, Delete Input, or Delete Output to modify the dependencies. This method applies to all table names in SQL statements. If you select Add Input, the characters are parsed as the output name of the parent node. If you select Add Output, the characters are parsed as the output name of the current node. If you select Delete Input or Delete Output, the characters are not parsed.

In addition to right-clicking characters in SQL statements, you can add comments to modify the dependencies. The specific code is as follows:
--@extra_input=Table name --Add an input.
--@extra_output=Table name --Add an output.
--@exclude_input=Table name --Delete an input.
--@exclude_output=Table name --Delete an output.

Custom dependencies

When dependencies between nodes cannot be accurately parsed from the lineage of the current node, you can set Auto Parse to No and customize dependencies.

When Auto Parse is set to No, you can click Recommendation to enable automatic recommendation of ancestor dependencies. The system recommends all other SQL nodes that generate the input table of the current node based on the SQL lineage of the current workspace. You can select one or more nodes from the recommended nodes and configure them as the parent nodes of the current node.

Note The recommended nodes must be committed to the scheduling system on the previous day, and can be recognized by the automatic recommendation feature after data is generated on the current day.
Common scenarios:
  • The input table of the current node is different from the output table of the ancestor node.
  • The output table of the current node is different from the input table of the descendant node.
In custom mode, you can configure dependencies in the following ways:
  • Manually add ancestor nodes
    1. Create three nodes in a workflow. By default, DataWorks assigns an output name for each of them.
    2. Configure the upmost node task_1 to depend on the root node of the workspace, and click the Save icon.
    3. Configure task_2 to depend on the output name of task_1, and click the Save icon.
    4. Configure task_3 to depend on the output name of task_2, and click the Save icon.
    5. After the configuration is completed, click the Submit icon to check whether the dependencies are correct. If the nodes can be committed, the dependencies are correct.Commit the nodes
  • Draw lines between nodes to configure dependencies
    1. Create three nodes in a workflow. Configure the upmost node task_1 to depend on the root node of the workspace, and click the Save icon.
    2. Draw lines between nodes on the dashboard of the workflow to configure dependencies.
    3. View the dependency between task_2 and task_3. The output name of the parent node is automatically generated.
    4. After the configuration is completed, click the Submit icon to check whether the dependencies are correct. If the nodes can be committed, the dependencies are correct.Commit the nodes

Advanced configuration

Click the Properties tab in the right-side navigation pane. On the tab that appears, click Advanced in the Dependencies section to configure cross-region node dependencies.

For example, if the current node is in the China (Shanghai) region, you can configure the node to depend on an ancestor node in the China (Beijing) region.

Note
  • Each time you configure a cross-region dependent node, two instances are generated to send Push and Check messages, respectively. These two instances consume your scheduling resources.
  • Currently, parameter values cannot be passed to and retroactive data cannot be generated for cross-region dependent nodes.
To configure cross-region dependencies, follow these steps:
  1. Double-click a node. On the node configuration tab that appears, click the Properties tab in the right-side navigation pane.
  2. Click Advanced in the Dependencies section.
  3. In the Configure Cross-Region Node Dependency dialog box that appears, set Region and Parent Node.
    Note Make sure that you have selected the name of the correct workspace in which the node to be added resides.
  4. After the configuration is completed, click Add.

Cross-workspace dependencies

DataWorks allows you to configure dependencies across workspaces in the same region. The configuration method is the same as that of dependencies between nodes in the same workspace.
Note You cannot configure a node in a standard workspace created earlier to depend on a node in a basic workspace. To fix this problem, submit a ticket.