This topic provides answers to some frequently asked questions about scheduling dependencies.

What are scheduling dependencies?

Scheduling dependencies define the relationships between nodes. After you configure scheduling dependencies for a node, the node is run only after its ancestor node is successfully run.

Why do we need to configure scheduling dependencies?

Scheduling dependencies ensure that a node can obtain data from its dependent ancestor node. After a node is successfully run, the node generates the latest table data, and the descendant node of this node can obtain the data.

How do I configure scheduling dependencies for a node?

Use the output of a node as the input of another node to establish a dependency between the two nodes.
Note
  • The system automatically configures an input or output for an SQL node by using the following methods:
    • If a table is specified in the SELECT statement of the code for a node, the system adds the table name to Parent Nodes for the node based on the automatic parsing feature.
    • If a table is specified in the INSERT or CREATE statement of the code for a node, the system adds the table to Output for the node based on the automatic parsing feature.
  • You must manually add the output of a Data Integration node to Output for the node in the format of Project name.Table name. This way, the system can find the node that generates the output for its descendant node based on the automatic parsing feature.
  • The output name of a node must be unique. This way, the system can find the node that generates the output based on the unique output name.

Which scenarios do not support scheduling dependencies?

Node dependencies ensure that a node can successfully obtain the table data generated by its ancestor node that is scheduled to run. However, if the ancestor node is not scheduled to run, the system cannot detect whether the ancestor node has generated the latest table data. If the table specified in the SELECT statement of the code for a node is not generated by an auto triggered node and the table name is automatically added to Parent Nodes for the node, you must manually delete the dependency for the node. Tables that are not generated by auto triggered nodes include the following types:
  • Tables uploaded from on-premises machines to DataWorks
  • Dimension tables
  • Tables that are not generated by nodes scheduled by DataWorks
  • Tables generated by manually triggered nodes

How do I delete a table on which a node does not depend?

On the configuration tab of the node, find the table name in the code for the node, right-click the table name, and then select Delete Input. In the Dependencies section of the Properties tab, set the Auto Parse parameter to Yes.Delete Input

The system automatically adds an output name to Parent Nodes for my node based on the automatic parsing feature, but an error message is displayed, indicating that the output represented by the output name does not exist. What do I do?

Commit failed

The system fails to find the node that generates the output based on the output name.

This error may be caused by the following reasons:
  • The node that generates the output is not committed. Commit the node and try again.
  • The node that generates the output is committed, but the output name of the node is different from the output name that is automatically added by the system.
Note
  • If tb_2 in the preceding figure is the output table of a node, you must add tb_2 to Output for the node in the format of Project name.Table name. For more information, see Logic of scheduling dependencies.
  • If tb_2 is a table that is not generated by an auto triggered node, you must right-click the table name in the code and select Delete Input to delete the table. In the Dependencies section of the Properties tab, set Auto Parse to Yes.
For more information, see Which scenarios do not support scheduling dependencies?.

The name and ID of the descendant node of my node are empty and cannot be specified in the output of my node. Why does this happen?

Dependencies between nodes are established after a node uses the output of its ancestor node. If a node has no descendant node, the name and ID of the descendant node are empty. After you configure a descendant node for the current node, the name and ID of the descendant node are automatically displayed.

Why do I need to configure dependencies between nodes?

In the scheduling system of DataWorks, dependencies are configured to ensure that a node can successfully obtain the data generated by another node. You can determine whether to configure dependencies between nodes based on the data lineage of the tables generated by the nodes. For more information, see Logic of scheduling dependencies.

What is the output name of a node used for?

The output name of a node is used to establish a dependency with another node. For example, if the output name of Node A is ABC and Node B uses ABC as its input name, a dependency is established between Node A and Node B.

Can a node have multiple output names?

Yes, a node can have multiple output names. The output name of a node defines the node. If a node (Node A) needs to depend on another node (Node B), Node A can reference an output name of Node B as its input name. This way, a dependency is established between Node A and Node B.

Can multiple nodes have the same output name?

No, the output name of each node must be unique. This way, if a node references the output of another node, the system can find the node that generates the output based on the unique output name and the automatic parsing feature, and a dependency can be established between the two nodes. If multiple nodes generate data to the same table, you must determine the last node that generates data to the table. This ensures that another node can successfully obtain data from the table. In addition, you must change the output names of the remaining nodes to ensure that the output names of all nodes are unique.

How do I avoid temporary tables when I enable DataWorks to automatically parse node dependencies for a node?

On the configuration tab of the node, right-click a temporary table name in the SQL code for the node and select Delete Input or Delete Output. In the Dependencies section of the Properties tab, set the Auto Parse parameter to Yes and click Parse I/O to parse the input and output for the node.

How do I configure an ancestor node for the start node of a workflow?

If you want to configure an ancestor node for the start node of a workflow, you can create a zero load node in the workflow and use the zero load node as the start node of the workflow. Then, you can configure the root node of the workflow as the ancestor node of the zero load node.

For more information about how to use a zero load node, see Create a zero-load node.

Why do I find a non-existent output name of Node B when I enter an output name to search for the ancestor node of Node A?

DataWorks searches for the ancestor node of a node among the output names of nodes that are committed and deployed to the scheduling system based on the automatic parsing feature. After Node B is committed, if you delete the output name of Node B and do not commit Node B to the scheduling system again, the deleted output name of Node B can still be found.

When I undeploy a node, the system displays an error message indicating that the node has descendant nodes and cannot be undeployed. However, no descendant nodes can be found for the node on the Properties tab. Why does this happen?

You can undeploy a node only after no nodes depend on the node in the development and production environments.

You can go to the development environment and production environment separately to check whether some nodes still depend on the node.

Why do some dependencies of nodes appear as dotted lines in Operation Center?

If the dependencies of a node appear as dotted lines, cross-cycle dependencies are configured for the node. For more information about cross-cycle dependencies, see Scenario 2: Configure scheduling dependencies for a node that depends on last-cycle instances.

I configure a node scheduled by hour to depend on its own instance that is generated in the previous cycle. What are the impacts on this node and its descendant node?

  • Impact on the current node: The instance of the node in the current cycle can run only after the instance of the node in the previous cycle is successfully run.

    Scenario: If a node that is scheduled by hour starts to run at 00:00, the instance of the node in the second cycle can run only after the instance of the node in the first cycle is successfully run.

  • Impact on the descendant node of the current node: If the current node has a descendant node scheduled by day, the descendant node no longer depends on multiple instances of the current node but instead directly depends only on a specific instance of the current node. In this case, the descendant node indirectly depends on multiple instances of the current node.

How do I configure a dependency in which a node scheduled by day depends on a node scheduled by hour?

  • Scenario 1: Configure a node scheduled by day to depend on all the instances that are generated on the current day for a node scheduled by hour.
    • Configure the node scheduled by day to directly depend on all the instances that are generated on the current day for the node scheduled by hour.
  • Scenario 2: Configure a node scheduled by day to depend on a specific instance that is generated on the current day for a node scheduled by hour.
    • Configure the node scheduled by hour to depend on its own instance that is generated in the previous cycle.
      • This indicates that you must select Dependency on Last Cycle and set the Depend On parameter to Instances of Current Node in the Schedule section of the Properties tab for the node scheduled by hour.
    • Configure the node scheduled by hour as the ancestor node of the node scheduled by day.
      • This indicates that you must add the output name of the node scheduled by hour to Parent Nodes in the Dependencies section of the Properties tab for the node scheduled by day.
  • Scenario 3: Configure a node scheduled by day to depend on all the instances that are generated on the previous day for a node scheduled by hour.
    • Configure the node scheduled by day to depend on the node scheduled by hour.
      • This indicates that you must select Dependency on Last Cycle and set the Depend On parameter to Instances of Custom Nodes. Then, enter the ID of the node scheduled by hour in the field that appears in the Schedule section of the Properties tab for the node scheduled by day.
    • Configure the node scheduled by day to depend on its own instance in the previous cycle and remove the dependency on the node scheduled by hour.
      • This indicates that you must remove the output name of the node scheduled by hour from Parent Nodes in the Schedule section of the Properties tab for the node scheduled by day.
Note You must remove the dependency between the node scheduled by day and the node scheduled by hour. Otherwise, the node scheduled by day depends on all the instances that are generated on the previous day and the current day for the node scheduled by hour.

When does a node scheduled by day start to run if I configure a node scheduled by hour as the ancestor node of the node scheduled by day?

Principle: If a node scheduled by hour is configured as the ancestor node of a node scheduled by day, the node scheduled by day depends on all instances that are generated on the current day for the node scheduled by hour. This indicates that the node scheduled by day starts to run only after the last instance that is generated on the current day for the node scheduled by hour is successfully run.

Scenario:
  • The node scheduled by hour starts to run at 00:00 and runs every hour. In this case, the node scheduled by day starts to run only after all the 24 instances of the node scheduled by hour are successfully run.
  • View the dependencies of the node scheduled by day in Operation Center: Find the node scheduled by day in Operation Center, open the directed acyclic graph (DAG) of the node, right-click the node name in the DAG, and then select Show Ancestor Nodes to view all the 24 instances that are generated on the current day for the node scheduled by hour. All the dependencies of the node scheduled by day in the DAG appear as solid lines.

How do I configure a node scheduled by day to depend on a specific instance that is generated on the current day for a node scheduled by hour?

Principle: If you want to configure a node scheduled by day to depend on a specific instance that is generated on the current day for a node scheduled by hour, you must configure the node scheduled by hour to depend on its instance in the previous cycle and set the scheduled time of the node scheduled by day to the scheduled time of the instance.

Scenario: Configure a node scheduled by day to depend on an instance that is generated on the current day for a node scheduled by hour and starts to run at 12:00.
  • Dependency configuration:
    • For the node scheduled by hour: Select Dependency on Last Cycle and set the Depend On parameter to Instances of Current Node in the Schedule section of the Properties tab.
    • For the node scheduled by day: Set the time when the node starts to run to 12:00.
  • View dependencies in Operation Center:
    • Find the node scheduled by day in Operation Center, open the DAG of the node, right-click the node name in the DAG, and then select Show Ancestor Nodes to view the instance that is generated on the current day for the node scheduled by hour and starts to run at 12:00. The dependency of the node scheduled by day in the DAG appears as a solid line.
    • Find the node scheduled by hour in Operation Center, open the DAG of the node, right-click the node name in the DAG, and then select Show Ancestor Nodes to view the instance that starts to run at 11:00. The instance that starts to run at 12:00 depends on the instance that starts to run at 11:00. The dependency of the node scheduled by hour appears as a dotted line. This is because the node scheduled by hour depends on its own instance in the previous cycle.

How do I configure a node scheduled by day to depend on all the instances that are generated on the previous day for a node scheduled by hour?

Principle: If you want to configure a node scheduled by day to depend on all the instances that are generated on the previous day for a node scheduled by hour, you must configure cross-cycle dependency on the node scheduled by hour for the node scheduled by day.

Scenario: Configure a node scheduled by day to depend on all the instances that are generated on the previous day for the node scheduled by hour.
  • Dependency configuration:
    • For the node scheduled by day: Select Dependency on Last Cycle and set the Depend On parameter to Instances of Custom Nodes. Then, enter the ID of the node scheduled by hour in the field that appears in the Schedule section of the Properties tab.
    • For the node scheduled by hour: You do not need to configure dependencies.
  • View dependencies in Operation Center:

    Find the node scheduled by day in Operation Center, open the DAG of the node, right-click the node name in the DAG, and then select Show Ancestor Nodes to view all the instances that are generated on the previous day for the node scheduled by hour. The dependencies of the node scheduled by day appear as dotted lines. This is because that this node is configured with cross-cycle dependency on the node scheduled by hour.

In which scenarios do I need to configure a node to depend on its own instance in the previous cycle?

Scenario: If a node needs to use data that is generated by the node itself in the previous cycle, you can configure the node to depend on its own instance in the previous cycle. In this case, the instance of the node in the current cycle runs only after the instance in the previous cycle is successfully run. This ensures that the instance in the current cycle can obtain data from the instance in the previous cycle.
  • A node needs to use data generated by the node itself in the previous cycle. You must select Dependency on Last Cycle and set the Depend On parameter to Instances of Current Node for the node in the Schedule section of the Properties tab for the node.
  • A node scheduled by hour depends on a node scheduled by day. After the instance that is generated on a day for the node scheduled by day is successfully run, the running time of all the instances that are generated on this day for the node scheduled by hour arrives. As a result, all the instances of the node scheduled by hour are concurrently run. In this case, select Dependency on Last Cycle and set the Depend On parameter to Instances of Current Node in the Schedule section of the Properties tab for the node scheduled by hour.

How do I configure dependencies for a node that needs to depend on multiple nodes?

If a node needs to depend on multiple nodes, you must determine whether to configure dependencies between the node and these nodes. If the node strongly depends on the table data generated by these nodes, we recommend that you configure dependencies between the node and these nodes. For more information about how to determine whether to configure dependencies between nodes, see Why do we need to configure scheduling dependencies?.

For example, Node A is scheduled by hour and generates Table A, and Node B is scheduled by day and generates Table B. Node C depends on Node A and Node B and needs to use data in Table A and Table B.

If you add the output name of Node A to Parent Nodes for Node C, but do not add the output name of Node B to Parent Nodes for Node C, Node C may start to run even if Node B is still running. As a result, Node C fails to obtain data in Table B, and an error occurs on Node C. To resolve this issue, you must add the output names of both Node A and Node B to Parent Nodes for Node C.

If a node does not strongly depend on the table data generated by another node and the node can obtain the data even if the latest data is not generated by another node, you do not need to configure a dependency between the two nodes.

Node B scheduled by day depends on Node A scheduled by hour, and Node B starts to run only after all the instances that are generated on the current day for Node A are successfully run. Will the running of Node B be affected if Node A still runs on the next day?

Node B depends on all the instances that are generated on the current day for Node A. Node B automatically runs every day after all the instances of Node A are successfully run. If the last instance of Node A is successfully run on the next day, Node B still runs, but at a time that is different from the specified time. Scheduling parameters can be replaced as expected.

Node A runs every hour on the hour, and Node B runs once every day. How do I configure Node B to automatically run after the first instance of Node A is successfully run every day?

When you configure time properties for Node A in the Schedule section of the Properties tab, select Dependency on Last Cycle and set the Depend On parameter to Instances of Current Node. You must set the Run At parameter to 00:00 for Node B in the Schedule section of the Properties tab. This way, Node B depends only on the first instance of Node A that is generated at 00:00 every day.

How do I configure Nodes A, B, and C to run in sequence once per hour?

  1. Dependencies: Configure the output of Node A as the input of Node B and the output of Node B as the input of Node C.
  2. Scheduling cycle: Configure Node A, Node B, and Node C to be scheduled by hour.

How do I configure dependencies between nodes that reside in the same region but belong to different workspaces and workflows?

Principle: Use the output of a node as the input of another node to establish a dependency between the two nodes.

Add the output name of a node to Parent Nodes for another node to establish a dependency between the two nodes. The two nodes can belong to different workspaces and workflows.