Scheduling dependencies are fundamental to building orderly workflows. You must configure correct dependencies between nodes to ensure that business data is produced effectively and in time. This helps standardize data R&D scenarios.

DataWorks allows you to configure node dependencies by using the automatic parsing feature or customizing node dependencies. You can configure correct relationships between ancestor and descendant nodes and monitor the running status of nodes to ensure the orderly production of business data.

The purpose of configuring dependencies between nodes is to check the data output time of tables that are queried by SQL statements, and check whether correct data is produced from an ancestor node based on the node status.

You can set the output of an ancestor node as the input of a descendant node to configure a dependency between the two nodes.

DataWorks allows you to configure dependencies in one of the following modes: automatic parsing and custom configuration. For more information about how to configure dependencies, see Best practice to configure scheduling dependencies.

Regardless of how dependencies are configured, the overall scheduling logic is that descendant nodes can be run only after ancestor nodes are run. Therefore, each node in a workflow must have at least one parent node. The dependencies between parent nodes and child nodes are the core of scheduling dependencies. The following sections describe the principles and configuration methods of scheduling dependencies in detail.
Note If a workspace is created before January 10, 2019, some data generated in the workspace may be invalid.In this case, you must submit a ticket to resolve the issue. If a workspace is created after January 10, 2019, this issue does not occur in the workspace.

Standardized data development scenarios

  • Before you configure scheduling dependencies, you must understand the following basic concepts:
    • DataWorks node: the object that performs operations on data. For more information, see Basic concepts.
    • Output name: The system assigns a default output name that ends with _out to each node. You can also customize the output name, but make sure that the output name of each node is unique for a tenant. For more information, see Basic concepts.
    • Output table: the table whose name is used in the INSERT or CREATE statement in the SQL statements of a node.
    • Input table: the table whose name follows the FROM keyword in the SQL statements of a node.
    • SQL statement: the MaxCompute SQL statements.

    In practice, a DataWorks node can contain a single SQL statement or multiple SQL statements.

    All ancestor and descendant nodes are associated by using output names. The root node of a workspace named Project name_root can be set as the upmost node.

  • Principles of standardized data development
    In a standardized data development process, multiple SQL nodes are created to form ancestor and descendant nodes that are in dependencies. We recommend that you follow these principles:
    • The input table of a descendant node must be the output table of its ancestor node.
    • One node produces only one table.
    • We recommend that the node name be the same as the output table name.

    This allows you to configure complex dependencies by using the automatic parsing feature when a workflow contains excessive nodes.

  • Example of a standardized data development process
    The following section describes the nodes and their code in the preceding figure:
    • The input data of the ods_log_info_d node comes from the ods_raw_log_d table, and the data is exported to the ods_log_info_d table. The following code is used for the ods_log_info_d node:
      INSERT OVERWRITE TABLE ods_log_info_d PARTITION (dt=${bdp.system.bizdate})
        SELECT ……  // Your SELECT operation.
        FROM (
        SELECT ……  // Your SELECT operation.
        FROM ods_raw_log_d
        WHERE dt = ${bdp.system.bizdate}
      ) a;
    • The input data of the dw_user_info_all_d node comes from the ods_user_info_d and ods_log_info_d tables, and the data is exported to the dw_user_info_all_d table. The following code is used for the dw_user_info_all_d node:
      INSERT OVERWRITE TABLE dw_user_info_all_d PARTITION (dt='${bdp.system.bizdate}')
      SELECT ……  // Your SELECT operation.
      FROM (
        SELECT *
        FROM ods_log_info_d
        WHERE dt = ${bdp.system.bizdate}
      ) a
      LEFT OUTER JOIN (
        SELECT *
        FROM ods_user_info_d
        WHERE dt = ${bdp.system.bizdate}
      ) b
      ON a.uid = b.uid;
    • The input data of the rpt_user_info_d node comes from the dw_user_info_all_d table, and the data is exported to the rpt_user_info_d table. The following code is used for the rpt_user_info_d node:
      INSERT OVERWRITE TABLE rpt_user_info_d PARTITION (dt='${bdp.system.bizdate}')
      SELECT ……  // Your SELECT operation.
      FROM dw_user_info_all_d
      WHERE dt = ${bdp.system.bizdate}
      GROUP BY uid;

Ancestor node

An ancestor node refers to a parent node on which the current node depends. You must enter the output name of the ancestor node, rather than the ancestor node name. A node may have multiple output names. Enter an output name as needed. You can enter an output name to search for the ancestor node to be added. You can also parse the output name based on the lineage that is parsed from the code by using SQL statements. For more information, see Lineage.
Note You must enter an output name or output table name to search for the ancestor node.

If you enter an output name to search for the ancestor node, the search engine searches for the output name among the output names of nodes that have been committed to the scheduling system.

  • Search by entering the output name of the parent node

    You can enter an output name to search for a node and configure the node as the parent node of the current node to create a dependency.

  • Search by entering the output table name of the parent node
    When you use this method, make sure that the entered output table name of the parent node is the table name used in the INSERT or CREATE statement of the parent node, such as Project name. Table name. The output names can be automatically parsed.Search
    After you click the Submit icon, the output name can be found by other nodes when you enter an output table name.Enter a table name

Output of the current node

The Outputs parameter specifies the output of the current node. You can click the Properties tab in the right-side navigation pane to view and configure the output of the current node.

The system assigns a default output name that ends with _out to each node. You can also customize an output name or use the automatic parsing feature to obtain the output name.
Note The output name of a node must be globally unique for your Alibaba Cloud account.

Automatic dependency parsing

DataWorks can parse different dependencies based on the SQL statements of a node to obtain the following output names of the parent node and the current node:
  • Output name of the parent node: Project name.Table name used in the INSERT statement
  • Output name of the current node:
    • Project name.Table name used in the INSERT statement
    • Project name.Table name used in the CREATE statement for temporary tables
The principles of automatic dependency parsing.
  • The SELECT statement specifies a table as the output of the parent node on which the current node depends.
  • The INSERT statement specifies a table as the output of the current node.

If multiple INSERT and FROM clauses are used, multiple output and input names are automatically parsed.

If you have created multiple nodes with dependencies and all input tables of descendant nodes are the output tables of ancestor nodes, you can use the automatic parsing feature to configure dependencies for the entire workflow.
  • To make the nodes more flexible, we recommend that you configure a node to produce only one table. This way, you can flexibly configure SQL workflows and achieve decoupling.
  • If a table in the SQL statements is both an output table and a referenced table on which another node depends, the table is parsed only as an output table.
  • If a table in the SQL statements is used as an output table or a referenced table multiple times, only one scheduling dependency is parsed.
  • If you specify t_ as the prefix of the temporary table name on the Workspace Settings page and the SQL code contains a table whose name starts with t_, the table is not involved in a scheduling dependency. For more information about temporary tables, see Workspace settings.
When automatic parsing applies, you can add inputs and outputs to enable characters in SQL statements to be automatically parsed into input and output names. In addition, you can delete inputs and outputs to avoid characters from being automatically parsed into input and output names.Automatic parsing

You can select and right-click a table name to add or delete inputs and outputs. This method applies to all table names in SQL statements. If you select Add Input, the characters are parsed as the output name of the parent node. If you select Add Output, the characters are parsed as the output name of the current node. If you select Delete Input or Delete Output, the characters are not parsed.

In addition to selecting and right-clicking characters in SQL statements, you can add the following comments to modify the dependencies:
--@extra_input=Table name --Add an input.
--@extra_output=Table name --Add an output.
--@exclude_input=Table name --Delete an input.
--@exclude_output=Table name --Delete an output.

Custom dependencies

When dependencies between nodes cannot be accurately and automatically parsed from the SQL lineage of the current node, you can set the Auto Parse parameter to No. Then, you can customize dependencies.

When the Auto Parse parameter is set to No, you can click Recommendation to enable automatic recommendation of upstream dependencies.Automatic recommendation
The system recommends all other SQL nodes that generate the input table of the current node based on the SQL lineage of the current workspace. You can select one or more nodes from the recommended nodes as needed and configure them as the ancestor nodes of the current node.
Note Only the nodes that have been committed and deployed to the production environment and have generated data in the required table can be parsed. Therefore, specific nodes may be parsed one day later.

The recommended nodes must be committed to the scheduling system on the previous day. Then, they can be recognized by the automatic recommendation feature after data is generated on the current day.

Common scenarios:
  • The input table of the current node is different from the output table of the ancestor node.
  • The output table of the current node is different from the input table of the descendant node.
In custom mode, you can configure dependencies in the following ways:
  • Example of manually adding ancestor nodes
    1. Create three nodes. By default, the system assigns an output name for each of them.
      • The output name of task_1 is workshop_y****.500022365.out.
      • The output name of task_2 is workshop_y****.500022366.out.
      • The output name of task_3 is workshop_y****.500022367.out.
    2. Configure the upmost node task_1 to depend on the root node of the workspace, and click the Save icon in the toolbar.
    3. Configure task_2 to depend on the output name of task_1, and click the Save icon.
    4. Configure task_3 to depend on the output name of task_2, and click the Save icon.
    5. After you configure dependencies, click the Submit icon to check whether the dependencies are correct. If the nodes can be committed, the dependencies are correct.Commit
  • Example of drawing lines between nodes to configure dependencies
    1. Create three nodes. Configure the upmost node task_1 to depend on the root node of the workspace, and click the Save icon.
    2. Draw lines among three nodes to connect the nodes.
    3. View the dependency between task_2 and task_3. The output name of the parent node is automatically generated.
    4. After you configure dependencies, click the Submit icon to check whether the dependencies are correct. If the nodes can be committed, the dependencies are correct.Commit

Dependencies across projects or workspaces

DataWorks allows you to configure dependencies across projects in the same region. The configuration method is the same as that of common dependencies.
Note You cannot configure a standard workspace that is created earlier to depend on a basic workspace.To resolve this issue, submit a ticket.