Scheduling dependencies are fundamental to the establishment of orderly workflows. You must configure correct dependencies between nodes to ensure that business data is produced effectively and in time. This topic provides instructions to configure scheduling dependencies.

Background information

The configurations of scheduling dependencies include Parent Nodes and Output.
  • Parent Nodes: the ancestor node of the current node. The current node can start to run only after its ancestor node is successfully run.
  • Output: the output of the current node. You can search for the current node by its output name and configure the current node as an ancestor node of another node.
Automatic parsing
You can use one of the following methods to configure scheduling dependencies for an auto triggered node in DataStudio:
  • On the configuration tab of the workflow to which the auto triggered node belongs, draw lines to connect the nodes in the workflow. For more information, see Configuration by drawing lines to connect nodes.
  • In the Dependencies section of the Properties tab on the configuration tab of the auto triggered node, click the Same Cycle tab, and enter the name of a node or the name of the output table of the node in the Parent Nodes field to add the node as an ancestor node of the auto triggered node. You can click Analyze Code and modify the analysis results based on your business requirements. For more information, see Manual configuration.
  • In the Dependencies section of the Properties tab on the configuration tab of the auto triggered node, select Yes for Auto Parse to enable the automatic parsing feature. When you commit the node, the system parses the code of the node to obtain the dependencies of the node and configures dependencies for the node in the Parent Nodes section. For more information, see Configuration based on the automatic parsing feature.
  • In the Dependencies section of the Properties tab on the configuration tab of the auto triggered node, select No for Auto Parse and click Recommendation. In the Recommended Parent Nodes dialog box, the system provides the recommended ancestor nodes for the auto triggered node based on the data lineage in the code of the auto triggered node. Recommendation
    Note The data lineage is updated one day later than the scheduling time of the auto triggered node. As a result, the recommendation of ancestor nodes may be delayed.
After the configuration is complete, you can preview the scheduling dependencies of the node. For more information, see Preview scheduling dependencies.

For more information about the configurations in common scenarios, see Instructions to configure scheduling dependencies in common business scenarios.

After scheduling dependencies are configured for a node, the system checks whether the scheduling dependencies are consistent with the data lineage in the code developed for the node when you commit the node. For more information, see Operations performed by the system after scheduling dependencies are configured.

General principles for configuring scheduling dependencies

This section describes the general principles for configuring scheduling dependencies and configuration items:
  • Output: the output of the current node. The node name must be unique within your Alibaba Cloud account.
    After you create nodes, the system automatically generates the following two outputs for each node:
    • One output whose name is suffixed with *******_out
    • The other output named in the projectname.nodename format
    Outputs generated by the system
  • Parent Nodes: the ancestor node on which the current node depends. After you configure this parameter, the system can find the ancestor node based on the configured output name or output table name of the ancestor node.

    If you enter an output name to search for the ancestor node, the system searches for the output name among the output names of the nodes that have been committed to the scheduling system.

    Fuzzy match is supported. After you enter a keyword, all nodes whose names contain the keyword are displayed. If the message The node is frozen is displayed on the right side of a node, do not use this node as an ancestor node of the current node. If you use a frozen node as an ancestor node of the current node, the current node may not run as expected. Do not use a frozen node as an ancestor node of the current node
Regardless of the method that is used to configure scheduling dependencies, the scheduling dependencies must comply with the following principles:
  • A table is generated by only one node, and the table must be configured as the output of the node.
    Note
    • The system automatically adds the table generated by an SQL node to the output of the SQL node based on the automatic parsing feature.
    • You must manually add the table generated by a batch sync node to the output of the node. The name of the table is in the projectname.tablename format. This way, the output table of the node is configured as the input table of its descendant node based on the automatic parsing feature when the descendant node is run.
  • The output of a node must be configured as the input of its descendant node, which forms dependencies between nodes.
For more information about the logic of scheduling dependencies, see Logic of same-cycle scheduling dependencies. The following sections describe the principles and configuration methods for scheduling dependencies in detail.
Note If a workspace is created before January 10, 2019, some of the data generated in the workspace may be invalid. In this case, you must submit a ticket to resolve the issue. If a workspace is created after January 10, 2019, this issue does not occur.

Configuration based on the automatic parsing feature

  • Scenarios
    The system can automatically parse SQL statements in the code developed for a node to obtain the data lineage. Then, the system adds an output or an ancestor node for the node based on the data lineage and automatic parsing feature. The automatic parsing feature is convenient, efficient, and suitable for most scenarios.
    Note The automatic parsing feature cannot be used to configure scheduling dependencies for batch synchronization nodes. After a table is generated in a synchronization node, you must manually add the table as the output of the node. The table name is in the project_name.table_name format. This way, the scheduling dependencies of the synchronization node can be quickly configured by using the automatic parsing feature when data of the generated table is cleansed by the descendant nodes of the synchronization node.
  • How it works
    The following figure shows the principles of automatic parsing for dependencies. Configuration based on automatic parsing
    • If a table is specified in the SELECT statement of the code for a node, the system adds the table name to Parent Nodes for the node based on the automatic parsing feature.
    • If a table is specified in the INSERT statement of the code for a node, the system adds the table name to Output for the node based on the automatic parsing feature.
    If multiple INSERT and SELECT statements are used, multiple output and input names are parsed based on the automatic parsing feature.
  • Configuration principles
    The automatic parsing feature enables the system to automatically identify relationships between nodes and configure scheduling dependencies for nodes. The following table describes the principles of automatic parsing.
    Node type SQL statement Configuration based on automatic parsing Configuration principle
    ODPS node
    • CREATE
    • INSERT
    If the code developed for the node contains such SQL statements, the system automatically adds an output for the node. The output that is automatically added by the system is named in the odps_project_name.table_name format.
    In the preceding format:
    • odps_project_name: the DataWorks workspace to which the node belongs.
    • table_name: the name of the generated table.
    SELECT If the code developed for the node contains such an SQL statement, the system automatically adds an ancestor node for the node.

    The ancestor node that is automatically added by the system is named in the project_name.table_name format.

    In the preceding format:
    • project_name: the name of the workspace to which the node that generates the table belongs.
    • table_name: the name of the generated table.
    SQL node other than the ODPS node
    • ALTER
    • CREATE
    • UPDATE
    • INSERT
    If the code developed for the node contains such SQL statements, the system automatically adds an output for the node. The outputs that are automatically added by the system for different nodes are named in the following formats:
    • E-MapReduce (EMR): workspace_name.db_name.table_name
    • AnalyticDB for PostgreSQL: workspace_name.db_name.schema_name.table_name
    • AnalyticDB for MySQL: workspace_name.db_name.schema_name.table_name
    • Hologres: workspace_name.db_name.schema_name.table_name
    In the preceding formats:
    • workspace_name: the name of the DataWorks workspace to which the node belongs.
    • db_name: the name of the database to which the data is written.
    • schema_name: the name of the schema of the node.
    • table_name: the name of the generated table.
    SELECT If the code developed for the node contains such an SQL statement, the system automatically adds an ancestor node for the node.

    The ancestor node that is automatically added by the system is named in the project_name.table_name format.

    In the preceding format:
    • project_name: the name of the workspace to which the node that generates the table belongs.
    • table_name: the name of the generated table.
    Batch synchronization node Batch synchronization nodes do not support the automatic parsing feature. You must manually configure scheduling dependencies for such nodes.
    Note After a table is generated in a synchronization node, you must manually add the table as the output of the node. The table name is in the project_name.table_name format. This way, the scheduling dependencies of the synchronization node can be quickly configured by using the automatic parsing feature when data of the generated table is cleansed by the descendant nodes of the synchronization node.
  • Precautions
    • Requirements for node development
      Automatic parsing enables the system to automatically identify the scheduling dependencies of a node based on the code that you developed for the node. Therefore, we recommend that you strictly comply with the following requirements when you develop data:
      • Requirements for code development: One node generates only one table, and a table is generated by only one node.
      • Requirements for node creation: The name of a node must be consistent with that of the table that is to be generated by the node.
      • Requirements for scheduling configurations: The table generated by a node must be added to Output for the node.
    • Nodes that do not support the automatic parsing feature
      • Batch synchronization nodes, AnalyticDB for PostgreSQL nodes, AnalyticDB for MySQL nodes, and EMR nodes do not support configuration of scheduling dependencies based on the automatic parsing feature. The tables generated by these nodes must be manually added to the outputs of these nodes.
      • Temporary tables created by executing SQL statements do not support the automatic parsing feature. For example, in the Workspace settings topic, the tables whose names are prefixed with t_ are specified as temporary tables. Such tables cannot be automatically added to Output or Parent Nodes for nodes.
    • Logic for handling non-standard configurations
      • If a table specified in SQL statements is used as both an output table and a referenced table, the table is parsed only as an output table.
      • If a table specified in SQL statements is used as an output table or a referenced table multiple times, only the latest scheduling dependency that is recommended by the system is used.
    • Scheduling dependency inconsistencies

      After you enable automatic parsing for a node, the system automatically generates the input and output for the node based on the data lineage in the code of the node when you commit the node. This ensures that the node can successfully generate data. You can also modify the input and output of the node based on your business requirements.

      When you commit a node, if the scheduling dependencies of the node that are modified based on the automatic parsing result are different from those in the development or production environment, a message indicating the changes of the scheduling dependencies is displayed. The changes indicate that some inputs or outputs are added or deleted after you modify the scheduling dependencies in the Dependencies section of the Properties tab based on the automatic parsing result. You can choose whether to use the current scheduling dependencies and commit the current node that is scheduled based on the current scheduling dependencies. After the node is committed, the latest inputs and outputs are automatically added to Parent Nodes and Output in the Dependencies section of the Properties tab.
      Note When you commit a node, if the system detects that the current scheduling dependencies of the node are different from those in the production or development environment, check whether the current scheduling dependencies of the node meet your business requirements. This can prevent inappropriate changes that you make on the scheduling dependencies from affecting data generation. If the current node has a large number of descendant nodes, inappropriate scheduling dependencies may have a great impact on the descendant nodes and data generation. Do not change the scheduling dependencies unless necessary.

      For example, if the system detects that the scheduling dependencies of a node do not include the input A that is included in the scheduling dependencies in the development or production environment when you commit the node, you must check whether the current scheduling dependencies are correct. A is the output name of an ancestor node of the current node. If Table A is specified in the SELECT statement in the code developed for the current node but the node that generates data of Table A is not configured as an ancestor node for the current node, the current node may start to run before the data of Table A is generated. In this case, the current node cannot be run and generate data.

    • Other precautions

      If an SQL statement in the code of a node specifies a table that is not generated by an auto triggered node, the system adds the table to Parent Nodes for the current node based on the automatic parsing feature. The tables that are not generated by auto triggered nodes include dimension tables and tables uploaded from on-premises machines to DataWorks. However, the system cannot find the node that generates the table added to Parent Nodes, and an error occurs. In this case, you must manually delete the table.

Manual configuration

  • Scenarios

    DataWorks allows you to manually modify the Parent Nodes and Output parameters for a node during the code development for the node. If the scheduling dependencies automatically generated by the system for your node do not meet your business requirements, you can manually modify the dependencies.

    You can manually configure scheduling dependencies in the following scenarios:
    • You must delete a scheduling dependency configured by using a table that is not generated by an auto triggered node.
      Scheduling dependencies ensure that a node can successfully obtain the table data generated by its ancestor node that is scheduled to run. However, if the ancestor node of a node is not scheduled to run, the system cannot monitor whether the ancestor node has generated the latest table data. If the table specified in the SELECT statement of the code for a node is not generated by an auto triggered node and the table name is automatically added to Parent Nodes for the node, you must manually delete the dependency for the node. Tables that are not generated by auto triggered nodes include the following types:
      • Tables uploaded from on-premises machines to DataWorks
      • Dimension tables
      • Tables that are not generated by nodes scheduled by DataWorks
      • Tables generated by manually triggered nodes
    • You must manually add the tables generated by nodes that do not support the automatic parsing feature to the outputs of these nodes.

      Batch synchronization nodes, AnalyticDB for PostgreSQL nodes, AnalyticDB for MySQL nodes, and EMR nodes do not support configuration of scheduling dependencies based on the automatic parsing feature. The tables generated by these nodes must be manually added to the outputs of these nodes.

  • Configuration principles
    • Delete a scheduling dependency in the code editor of a nodeDelete InputIf the SELECT statement in the code of a node specifies a table that is not generated by an auto triggered node, you can delete the dependency for the node. Specifically, you can go to the code editor of the node, right-click the name of the table that you want to remove from the input, and then click Delete Input. The preceding figure shows the process. You can also add a rule as a comment at the top of the code. This way, the system does not automatically parse the dependency based on the rule.
    • Delete a scheduling dependency in the Properties panel of a node
      Manually configure scheduling dependencies between nodesIf the SELECT statement in the code of a node specifies a table that is not generated by an auto triggered node, you can manually remove the table from Parent Nodes for the node. Specifically, you can go to the Properties panel of the node, set Auto Parse to No, and then manually perform the operation.
  • Precautions
    • When you configure scheduling dependencies on the Properties tab of a node, we recommend that you click Analyze Code after you set Auto Parse to No. This way, the system automatically adds scheduling dependencies for the node by using a method that is consistent with automatic parsing. Then, you can perform operations on the scheduling dependencies based on your business requirements.
    • You can click Clear Code Analysis Results to delete the scheduling dependencies that are automatically added by the system. The scheduling dependencies that are manually added are not deleted.
    • If you click Recommendation, the system recommends all the other SQL nodes that generate the input table of the current node based on the SQL lineage of the current workspace. You can select your desired node from the recommended results and configure the node as the ancestor node on which the current node depends.
      Note Only the nodes that are committed and deployed to the production environment and have generated data in the required table can be recommended by the system. Therefore, the recommended nodes may be parsed one day later than the scheduling time of the current code.

      The recommended nodes must be committed to the scheduling system on the previous day. This way, they can be identified by the automatic recommendation feature after data is generated on the current day.

Configuration by drawing lines to connect nodes

  • Scenarios

    DataWorks allows you to specify the relationships between nodes by drawing lines to connect nodes on the editing page of a workflow. After the nodes are connected, the system automatically generates scheduling dependencies for each node based on the connections.

    After you create a workflow, you can connect nodes by drawing lines on the configuration tab of the workflow to configure scheduling dependencies for each node based on your business requirements. During subsequent code development, you can add or modify scheduling dependencies for each node manually or by using the automatic parsing feature. This way, all nodes in the workflow can be configured with correct scheduling dependencies.

  • How it works

    When you connect nodes by drawing lines, the system automatically adds an output whose name is suffixed with *******_out to the input of each descendant node.

  • Configuration principles

    You can connect nodes by drawing lines on the editing page of a workflow.

    Connect nodes by drawing lines

Preview scheduling dependencies

After you configure scheduling dependencies for a node, click Preview Dependencies. In the Preview Dependencies dialog box, you can preview the scheduling dependencies of the node on the Node Dependency and Instance Dependency tabs. You can modify the scheduling dependencies that do not meet your business requirements.
Note
  • A directed acyclic graph (DAG) that is generated based on the scheduling dependencies is only for reference. The DAG that is generated may be different from the DAG in the production environment.
  • Only the following roles can be used to preview the scheduling dependencies of a node: Development, O&M, Project Owner, and Workspace Manager. If a user wants to preview the scheduling dependencies of a node, you must assign one of the preceding roles to the user. For more information, see Manage workspace-level roles and members.
  • You can preview only the ancestor nodes and descendant nodes that are at the nearest level of the current node.
  • If you do not save scheduling dependencies of a node before you click Preview Dependencies, click Confirm in the Attention dialog box. This way, you can view the latest scheduling dependencies of the node.
  • On the Instance Dependency tab, you can preview scheduling dependencies of an auto triggered node that generates multiple instances. For example, you can preview the scheduling dependencies of an auto triggered node scheduled by hour. The auto triggered node scheduled by hour depends on an auto triggered node scheduled by minute.
  • The scheduling dependencies of a node that you preview are as expected only after you save the ancestor nodes of the node.
  • Select a preview method.

    You can select Not Aggregate, Aggregate by Workspace, or Aggregate by Owner to preview the scheduling dependencies of a node. For more information about aggregation methods, see Manage instances in a DAG.

    The following figures show preview effects by using different aggregation methods on the Node Dependency tab. Node DependencyYou can click a node to view the basic information about the node. Basic information about a node
  • Preview scheduling dependencies of an auto triggered node that generates multiple instances.

    For an auto triggered node that generates multiple instances, you can click the Instance Dependency tab and select a scheduling cycle to preview scheduling dependencies of the node.

    Scheduling dependencies of an auto triggered node that generates multiple instances

Instructions to configure scheduling dependencies in common business scenarios

Operations performed by the system after scheduling dependencies are configured

After scheduling dependencies are configured for a node, the system checks whether the configured scheduling dependencies are consistent with the data lineage in the code developed for the node. If they are inconsistent, the system displays an error message. In this case, you must modify the scheduling dependencies based on actual situations. For more information, see When I commit a node, the system reports an error that the input and output of the node are not consistent with the data lineage in the code developed for the node. What do I do?.

FAQ