Scheduling dependencies are fundamental to the establishment of orderly workflows. You must configure correct dependencies between nodes to ensure that business data is produced effectively and in time. This topic provides instructions to configure scheduling dependencies.

Background information

The configurations of scheduling dependencies mainly include Parent Nodes and Outputs. You must configure the outputs of ancestor nodes as the dependent ancestor nodes of descendant nodes to establish dependencies between the nodes.Configuration based on automatic parsing
Scheduling dependencies can be manually configured or configured by the system based on the automatic parsing feature. They can also be generated by the system or configured by drawing lines to connect nodes. When you create nodes, the system automatically adds outputs for the nodes. For more information, see the following sections: For more information about the configurations in common scenarios, see Instructions to configure scheduling dependencies in common business scenarios.

After scheduling dependencies are configured for a node, the system checks whether the scheduling dependencies are consistent with the data lineage in the code developed for the node when you commit the node. For more information, see Operations performed by the system after scheduling dependencies are configured.

General principles for configuring scheduling dependencies

This section describes the general principles for configuring scheduling dependencies and configuration items:
  • Outputs: the output of the current node. The node name must be unique within your Alibaba Cloud account.
  • Parent Nodes: the ancestor node on which the current node depends. After you configure this parameter, the system can find the ancestor node based on the configured output name or output table name of the ancestor node.

    If you enter an output name to search for the ancestor node, the system searches for the output name among the output names of the nodes that have been committed to the scheduling system.

Regardless of the method that is used to configure scheduling dependencies, the scheduling dependencies must comply with the following principles:
  • A table is generated by only one node, and the table must be configured as the output of the node.
    Note
    • The system automatically adds the table generated by an SQL node to the output of the SQL node based on the automatic parsing feature.
    • You must manually add the table generated by a batch synchronization node to the output of the node.
  • The output of a node must be configured as the input of its descendant node, which forms dependency relationships between nodes.
For more information about the logic of scheduling dependencies, see Logic of scheduling dependencies. The following sections describe the principles and configuration methods for scheduling dependencies in detail.
Note If a workspace is created before January 10, 2019, some of the data generated in the workspace may be invalid. In this case, you must submit a ticket to resolve the issue. If a workspace is created after January 10, 2019, this issue does not occur.

Configuration based on the automatic parsing feature

  • Scenarios

    The system can automatically parse SQL statements in the code developed for a node to obtain the data lineage. Then, the system adds an output or a dependent ancestor node for the node based on the data lineage and automatic parsing feature. The automatic parsing feature is convenient, efficient, and suitable for most scenarios.

  • How it works
    The following figure shows the principles of automatic parsing for dependencies.Configuration based on automatic parsing
    • If the SELECT statement in the code of a node specifies a table, the system adds the table to Parent Nodes for the node based on the automatic parsing feature.
    • If the INSERT statement in the code of a node specifies a table, the system adds the table to Outputs for the node based on the automatic parsing feature.
    If multiple INSERT and SELECT statements are used, multiple output and input names are parsed based on the automatic parsing feature.
  • Configuration principles
    The automatic parsing feature enables the system to automatically identify relationships between nodes and configure scheduling dependencies for nodes. The following table describes the principles of automatic parsing.
    Node type SQL statement Configuration based on automatic parsing Configuration principle
    MaxCompute node
    • CREATE
    • INSERT
    If the code developed for the node contains such SQL statements, the system automatically adds an output for the node. The output automatically added by the system is named in the odps_project_name.table_name format.
    In the preceding format:
    • odps_project_name: the DataWorks workspace to which the node belongs.
    • table_name: the name of the generated table.
    SELECT If the code developed for the node contains the SQL statement, the system automatically adds a dependent ancestor node for the node.

    The dependent ancestor node automatically added by the system is named in the project_name.table_name format.

    In the preceding format:
    • project_name: the name of the workspace to which the node that generates the table belongs.
    • table_name: the name of the generated table.
    SQL nodes that are not of the MaxCompute type
    • ALTER
    • CREATE
    • UPDATE
    • INSERT
    If the code developed for the node contains such SQL statements, the system automatically adds an output for the node. The outputs automatically added by the system for different nodes are named in the following formats:
    • E-MapReduce (EMR): workspace_name.db_name.table_name
    • AnalyticDB for PostgreSQL: workspace_name.db_name.schema_name.table_name
    • AnalyticDB for MySQL: workspace_name.db_name.schema_name.table_name
    • Hologres: workspace_name.db_name.schema_name.table_name
    In the preceding formats:
    • workspace_name: the name of the DataWorks workspace to which the node belongs.
    • db_name: the name of the database to which the data is written.
    • schema_name: the name of the schema of the node.
    • table_name: the name of the generated table.
    SELECT If the code developed for the node contains the SQL statement, the system automatically adds a dependent ancestor node for the node.

    The dependent ancestor node automatically added by the system is named in the project_name.table_name format.

    In the preceding format:
    • project_name: the name of the workspace to which the node that generates the table belongs.
    • table_name: the name of the generated table.
    Batch synchronization nodes Batch synchronization nodes do not support the automatic parsing feature. You must manually configure scheduling dependencies for such nodes.
  • Precautions
    • Requirements for node development
      Automatic parsing enables the system to automatically identify the scheduling dependencies of a node based on the task code that you developed for the node. Therefore, we recommend that you strictly comply with the following requirements when you develop data:
      • Requirements for code development: One node generates only one table, and a table is generated by only one node.
      • Requirements for node creation: The name of a node must be consistent with that of the table that will be generated by the node.
      • Requirements for scheduling configurations: The table generated by a node must be added to Outputs for the node.
    • Nodes that do not support the automatic parsing feature
      • Batch nodes, AnalyticDB for PostgreSQL nodes, AnalyticDB for MySQL nodes, and EMR nodes do not support the automatic parsing feature. The tables generated by these nodes must be manually added to the outputs of these nodes.
      • Temporary tables created by executing SQL statements do not support the automatic parsing feature. For example, in the Workspace settings topic, the tables whose names are prefixed with t_ are specified as temporary tables. Such tables cannot be automatically added to Outputs or Parent Nodes for nodes.
    • Logic for handling non-standard configurations
      • If a table specified in SQL statements is used as both an output table and a referenced table, the table is parsed only as an output table.
      • If a table specified in the SQL statements is used as an output table or a referenced table multiple times, only one scheduling dependency is parsed.
    • Other precautions

      If an SQL statement in the code of a node specifies a table that is not generated by an auto triggered node, the system adds the table to Parent Nodes for the current node based on the automatic parsing feature. The tables that are not generated by auto triggered nodes include dimension tables and tables uploaded from on-premises machines to DataWorks. However, the system cannot find the node that generates the table added to Parent Nodes, and an error occurs. In this case, you must manually delete the table.

Manual configuration

  • Scenarios

    DataWorks allows you to manually modify Parent Nodes and Outputs for a node during the code development for the node. If the scheduling dependencies automatically generated by the system for your node do not meet your business requirements, you can manually modify the dependencies.

    You can manually configure scheduling dependencies in the following scenarios:
    • You must delete a scheduling dependency configured by using a table that is not generated by an auto triggered node.
      Scheduling dependencies ensure that a node can successfully obtain the table data generated by its ancestor node that is scheduled to run. However, if the ancestor node of a node is not scheduled to run, the system cannot monitor whether the ancestor node has generated the latest table data. If the SELECT statement in the code of a node specifies a table that is not generated by an auto triggered node, and the table is automatically added to Parent Nodes for the node, you must manually delete the dependency for the node. Tables that are not generated by auto triggered nodes include the following types:
      • Tables uploaded from on-premises machines to DataWorks
      • Dimension tables
      • Tables that are not generated by nodes scheduled by DataWorks
      • Tables generated by manually triggered nodes
    • You must manually add the tables generated by nodes that do not support the automatic parsing feature to the outputs of these nodes.

      Batch nodes, AnalyticDB for PostgreSQL nodes, AnalyticDB for MySQL nodes, and EMR nodes do not support the automatic parsing feature. The tables generated by these nodes must be manually added to the outputs of these nodes.

  • Configuration methods
    • Delete a scheduling dependency in the code editor of a nodeDelete InputIf the SELECT statement in the code of a node specifies a table that is not generated by an auto triggered node, you can delete the dependency for the node. Specifically, you can go to the code editor of the node, right-click the table name that you want to delete, and then click Delete Input to perform the operation. The preceding figure shows the process. You can also add a rule as a comment at the top of the code. This way, the system does not automatically parse the dependency based on the rule.
    • Delete a scheduling dependency on the Properties panel of a node
      Manual configurationIf the SELECT statement in the code of a node specifies a table that is not generated by an auto triggered node, you can manually delete the dependent ancestor node for the node. Specifically, you can go to the Properties panel of the node, set Auto Parse to No, and then manually perform the operation.
  • Precautions
    • When you configure scheduling dependencies on the Properties panel of a node, we recommend that you click Parse I/O after you set Auto Parse to No. This way, the system automatically adds scheduling dependencies for the node by using a method that is similar to automatic parsing. Then, you can perform operations on the scheduling dependencies based on your business requirements.
    • You can click Clear I/O Parameters to delete the scheduling dependencies that are automatically added by the system. The scheduling dependencies that are manually added are not deleted.
    • If you click Recommendation, the system recommends all the other SQL nodes that generate the input table of the current node based on the SQL lineage of the current workspace. You can select your desired node from the recommended results and configure the node as the ancestor node on which the current node depends.
      Note Only the nodes that have been committed and deployed to the production environment and have generated data in the required table can be parsed. Therefore, the recommended nodes may be parsed one day later.

      The recommended nodes must be committed to the scheduling system on the previous day. This way, they can be identified by the automatic recommendation feature after data is generated on the current day.

Generation by the system

After you create nodes in DataWorks, the system automatically generates the following two outputs for each node:
  • One output whose name is suffixed with *******_out
  • The other output named in the projectname.nodename format
Generation by the system

The output automatically generated by the system can be used in the same way as the output that is automatically parsed or manually added.

Configuration by drawing lines to connect nodes

  • Scenarios

    DataWorks allows you to specify the relationships between nodes by drawing lines to connect nodes on the editing page of workflows. After the nodes are connected, the system automatically adds scheduling dependencies for each node based on the connections.

    After you create a workflow, you can connect nodes by drawing lines on the editing page of the workflow to configure scheduling dependencies for each node based on your business requirements. During subsequent code development, you can add or modify scheduling dependencies for each node manually or by using the automatic parsing feature. This way, all nodes in the workflow can be configured with correct scheduling dependencies.

  • How it works

    When you connect nodes by drawing lines, the system automatically adds an output whose name is suffixed with *******_out to the input of each descendant node.

  • Configuration methods

    You can connect nodes by drawing lines on the editing page of workflows.

    Connect nodes by drawing lines

Instructions to configure scheduling dependencies in common business scenarios

Operations performed by the system after scheduling dependencies are configured

After scheduling dependencies are configured for a node, the system checks whether the configured scheduling dependencies are consistent with the data lineage in the code developed for the node. If they are inconsistent, the system displays an error message. In this case, you must modify the scheduling dependencies based on actual situations. For more information, see When I commit a node, the system reports an error that the input and output of the node are not consistent with the data lineage in the code developed for the node. What do I do?.

FAQ