Scheduling dependencies are the foundation for building orderly business flows. You need to correctly configure the dependencies between nodes to ensure that business data is produced effectively and in time. This helps standardize data R&D scenarios.

DataWorks V2.0 allows you to configure dependencies in any of the following modes: automatic recommendation, automatic parsing, and custom configuration. For more information about dependency configuration examples, see Best practices for setting scheduling dependencies.

Regardless of the dependency configuration mode, the overall scheduling logic is that descendant nodes can be scheduled only after ancestor nodes are run. Therefore, each workflow node must have at least one parent node. The dependencies between the parent nodes and child nodes are the core of scheduling dependencies. The following sections describe the principle and configuration methods of scheduling dependencies in detail.
Note

Data problems exist for workspaces created before January 10, 2019. You must submit a ticket to apply for troubleshooting. Projects created on January 10 and later are not affected.

Standardized data development scenarios

  • Before configuring scheduling dependencies, you need to understand the following basic concepts:
    • DataWorks node: defines the operations performed on data. For more information, see Concepts.
    • Output name: refers to the default output name that the system assigns to each node. Each output name ends with .out. You can also customize the output name, but make sure that the node output name is unique for the tenant. For more information, see Concepts.
    • Output table: refers to the table following INSERT or CREATE in the SQL statement of a node.
    • Input table: refers to the table following FROM in the SQL statement of a node.
    • SQL statement: refers to MaxCompute SQL.

    In practice, a DataWorks node can contain a single SQL statement or multiple SQL statements.

    Ancestor and descendant nodes are associated by output names. The ancestor node of the upmost node that is created can be configured as the root node of the workspace (workspace name: projectname_root).

  • Principles of standardized data development
    In a standardized data development process, multiple ancestor and descendant SQL nodes are created. We recommend that you follow these principles:
    • The input table of a descendant node must be the output table of its ancestor node.
    • One table can only be exported by one node.

    The purpose is to quickly configure complex dependencies by using the automatic parsing feature when business processes are inflated.

  • Example of a standardized data development process

    Each node and its code in the preceding figure are described as follows:
    • The code of Task_1 is as follows. The input data of this node comes from the ods_raw_log_d table, and the data is exported to the ods_log_info_d table.
      INSERT OVERWRITE TABLE ods_log_info_d PARTITION (dt=${bdp.system.bizdate})
        SELECT ...  //It represents your SELECT operation.
        FROM (
        SELECT ...  //It represents your SELECT operation.
        FROM ods_raw_log_d
        WHERE dt = ${bdp.system.bizdate}
      ) a;
    • The code of Task_2 is as follows. The input data of this node comes from the ods_user_info_d and ods_log_info_d tables, and the data is exported to the dw_user_info_all_d table.
      INSERT OVERWRITE TABLE dw_user_info_all_d PARTITION (dt='${bdp.system.bizdate}')
      SELECT ...  //It represents your SELECT operation.
      FROM (
        SELECT *
        FROM ods_log_info_d
        WHERE dt = ${bdp.system.bizdate}
      ) a
      LEFT OUTER JOIN (
        SELECT *
        FROM ods_user_info_d
        WHERE dt = ${bdp.system.bizdate}
      ) b
      ON a.uid = b.uid;
    • The code of Task_3 is as follows. The input data of this node comes from the dw_user_info_all_d table, and the data is exported to the rpt_user_info_d table.
      INSERT OVERWRITE TABLE rpt_user_info_d PARTITION (dt='${bdp.system.bizdate}')
      SELECT ...  //It represents your SELECT operation.
      FROM dw_user_info_all_d
      WHERE dt = ${bdp.system.bizdate}
      GROUP BY uid;

Ancestor node

An ancestor node indicates the parent node on which the current node depends. You must enter the output name of the ancestor node, rather than the ancestor node name. (A node may contain multiple output names. Enter an output name as needed.) You can search for the output name of the ancestor node to be added, or run the SQL statement for lineage analysis to parse the output name.

Note If you use the first method, the crawler searches for the output name among the output names of nodes that have been submitted to the scheduling system.
  • Search by entering the output name of the parent node
    You can search for the output name of a node and configure the node as the ancestor node of the current node to create a dependency.

  • Search by entering the table name corresponding to the output name of the parent node
    When using this method, make sure that one of the output names of the parent node is the table name following INSERT or CREATE in the SQL statement of the node, such as projectname. Table name. (Such output names can be automatically parsed.)

    Click Submit. The output name can be searched by other nodes by searching the table name.

Current node output

The current node output indicates the output of the current node.

The system assigns a default output name that ends with .out to each node. You can also customize the output name or obtain an output name by automatic parsing.
Note The output name of a node must be globally unique for your Alibaba Cloud account.

Automatic dependency parsing

DataWorks can parse different dependencies based on the actual SQL content of the node. The parsed output names of the parent node and the current node are as follows:
  • Output name of the parent node: projectname.The table name following INSERT.
  • Output names of the current node:
    • projectname.The table name following INSERT.
    • projectname.The table name following CREATE (generally used for temporary tables).
Note If you upgrade from DataWorks V1.0 to DataWorks V2.0, the output name of the current node is projectname. The name of the current node.
If multiple INSERT and FROM clauses are displayed, multiple output and input names are automatically parsed.

If you create multiple nodes with dependencies, and all input tables of descendant nodes come from the output tables of ancestor nodes, the automatic parsing feature can be used to quickly configure dependencies for the entire workflow.

Note
  • To make the node more flexible, we recommend that a node contain only one output node, so that you can flexibly assemble SQL business processes for decoupling.
  • If a table in an SQL statement is both an output table and a referenced table (a depended table), the table is parsed only as an output table.
  • If a table in an SQL statement is referenced or exported for multiple times, only one scheduling dependency is parsed.
  • If the SQL code contains a temporary table (for example, a table whose name starts with t_ is specified as a temporary table during attribute configuration), the table is not involved in a scheduling dependency.
When automatic parsing applies, you can add input and output to enable characters in SQL statements to be automatically parsed into input and output names, or delete input and output to avoid characters from being automatically parsed into input and output names.

Right-click a table name and select Add Input, Add Output, Remove Input, or Remove Output to modify the dependencies. This method applies to all table names in SQL statements. If you select Add Input, the characters are parsed as the input name of the parent node. If you select Add Output, the characters are parsed as the output name of the current node. If you select Remove Input or Remove Output, the characters are not parsed.

Note In addition to right-clicking characters in SQL statements, you can add comments to modify the dependencies. The specific code is as follows:
 --@extra_input=table name --Add an input.
--@extra_output=table name --Add an output.
--@exclude_input=table name --Delete an input.
 --@exclude_output=table name --Delete an output.

Custom dependencies

When dependencies between nodes cannot be accurately parsed by running the SQL statement for lineage analysis, you can set Auto Parse to No in the following figure and configure dependencies.

When Auto Parse is set to No, you can click Auto Recommendation to enable automatic recommendation of ancestor dependencies. The system recommends all other SQL nodes that export the input table of the current node based on the SQL lineage of the project. You can select one or more nodes in the recommendation list as needed and configure them as the ancestor nodes of the current node.
Note The recommended nodes must be submitted to the scheduling system on the previous day, and can be recognized by the automatic recommendation feature after data is generated on the second day.

Common scenarios:

  • The input table of the current node is not equivalent to the output table of the ancestor node.
  • The output table of the current node is not equivalent to the input table of the descendant node.

In custom mode, you can configure dependencies in the following ways:

  • Manually add ancestor nodes
    1. Create three nodes. The system configures an output name for each of them by default.




    2. Configure the upmost node task_1 to depend on the root node of the workspace, and click Save.

    3. Configure task_2 to depend on the output name of task_1, and click Save.

    4. Configure task_3 to depend on the output name of task_2, and click Save.

    5. After the configuration is complete, click Submit to check whether the dependency is correct. If the submission is successful, the dependency configuration is correct.

  • Build dependencies by dragging and dropping
    1. Create three nodes, configure the upmost node task_1 to depend on the root node, and click Save.

    2. Drag the three nodes to connect them.

    3. View the dependency configuration of task_2 and task_3. The output name of the parent node is automatically generated.


    4. After the configuration is complete, click Submit to check whether the dependency is correct. If the submission is successful, the dependency configuration is correct.

FAQs

Q: After automatic parsing, the submission fails. The following error message is displayed: The output workshop_yanshi.tb_2 of the parent node does not exist. Submit this node after submitting the parent node.

A: This problem can be caused by either of the following reasons:
  • The ancestor node is not submitted. Submit the ancestor node and try again.
  • The ancestor node is submitted, but the output name of the ancestor node is not workshop_yanshi.tb_2.
Note Usually, the output names of the parent node and the current node are automatically parsed based on the table name following INSERT, CREATE, or FROM. Make sure that the configuration method is consistent with that described in the section "Automatic dependency parsing."

Q: In the output of the current node, the descendant node name and ID are empty and cannot be specified. Why does this happen?

A: If the current node does not have any descendant node, the descendant node name and ID are empty. After a descendant node is configured for the current node, the corresponding content is automatically parsed.

Q: What is the output name of a node used for?

A: The output name of a node is used to establish dependencies with other nodes. If the output name of node A is ABC and node B takes ABC as its input, the dependency is established between nodes A and B.

Q: Can a node have multiple output names?

A: Yes. If a descendant node references an output name of the current node as its parent node output name, the dependency is established between the descendant node and the current node.

Q: Can multiple nodes have the same output name?

A: No. The output name of each node must be unique for your Alibaba Cloud account. If multiple nodes export data to the same MaxCompute table, we recommend that you use Table name_Partition ID as the output of these nodes.

Q: How can I configure no parsing of intermediate tables during automatic dependency parsing?

A: Right-click the intermediate table name in the SQL code and select Remove Input or Remove Output, and then perform the automatic parsing of the input and output again.

Q: How do I configure dependencies of the upmost node?

A: In general, the upmost node depends on the root node of the workspace.

Q: Why do I find a non-existent output name of node B when searching for the ancestor node output name on node A?

A: Because the search feature works based on the submitted node information. After node B is submitted, if you delete the output name of node B and does not submit node B to the scheduling system, the deleted output name of node B can still be found on node A.

Q: How do I implement the node flow of A->B->C once an hour (run node B after node A is completed, and run node C after node B is completed)?

A: Set the output of node A as the input of node B and the output of node B as the input of node C, and set the scheduling periods of nodes A, B, and C to 1 hour.