In DataWorks V2.0, when configuring scheduling dependencies, dependencies between tasks need to be set according to the output name of the current node as an associated item. This article details how to configure the input and output of task scheduling dependencies.

How to configure the node input of a task

There are two ways to configure the node input: one is to use the automatic code parsing function to resolve the dependency of the task, the other is to manually enter the task dependency (manually entering the Upstream Node Output Name).

Note When manually entering an upstream node, the input is Output Name of the parent node. If the parent node task name does not match the parent node's output name, be sure to enter the node output name correctly.
When configuring an upstream node, you may encounter problems with the upstream node parsed automatically is an invalid upstream dependency. A method of identifying whether dependencies are valid: view the parsed upstream dependencies and check if the value is displayed in the Upstream Node ID column, as shown in the following figure.

The configuration of task dependencies is essentially to set the dependencies between two nodes. Only the nodes that exist will be able to set up valid dependencies, task dependencies can be set successfully.

Invalid upstream dependency

Invalid upstream dependencies are usually in two cases.
  1. The parent node does not exist.

  2. The parent node output does not exist.

Invalid upstream dependencies typically occur because the parsed parent node output name does not exist. In this case, it may be due to the fact that the table "project_b_name.pm_table_b" does not output task, or the node output is configured incorrectly for the table output task and can't be parsed. There are two solutions:
  1. Confirm that the table has an output task.
  2. Confirm what the output name of this table's output task is, and manually enter the node output name into the dependent upstream node.

Note When you enter an upstream node manually, you enter the parent node's output name. If the parent node task name does not match the parent node's output name, be sure to enter the node output name correctly.

For example, the output name of the upstream node A is A1, and downstream node B depends on node A. At this point, enter A1 in the input box of the upstream node, and click the plus sign on the right to add it.

How to configure upstream dependencies

If your table is extracted from the source library and there is no upstream, you can click Use The Workspace Root Node to obtain upstream dependencies.

How to configure the node output of a task

The simplest way to efficiently configure the node output is: the node name, the node output name and the node output table name share the same name and three in one. The advantages are as follows.

  1. You can quickly know which table this task is operating on.
  2. It is possible to quickly know how far this task will impact if it fails.
  3. When you use auto parsing to configure task dependencies, as long as the node output is consistent with the three-in-one rule, the precision performance of automatic parsing is greatly improved.

Automatic parsing

Automatic parsing: refers to automatically parse scheduling dependencies by the code. Implementation principle: only table names can be obtained in the code, and the automatic parsing function can parse the corresponding output task according to the table name.

For example, the type node code is shown below.
INSERT OVERWRITE TABLE pm_table_a SELECT * FROM project_b_name.pm_table_b;
The dependencies parsed are as follows.

DataWorks can automatically parse the node which this node needs to be dependent on project_b_name to output pm_table_b, and the final output of the node pm_table_a. Therefore, the resolution is that the parent node output name is project_b_name.pm_table_b, and the node output name is project_name.pm_table_a(The project name is MaxCompute_DOC ).
  • If you do not want to use dependencies that are parsed from the code, select No.
  • If there are many tables in the code that are temporary tables: For example, the table beginning with t_ is a temporary table. Then the table is not parsed as the schedule dependency. The definition of temporary tables is that you can define which form the table begins with is a temporary table by project configuration.
  • If a table in the code is both the output table and referenced table (depended table), it is parsed only as the output table.
  • If a table in the code is referenced or output for multiple times, only one scheduling dependency is parsed.
Note By default, a table with a name starting with t_ is recognized as a temporary table. Auto parsing does not resolve the temporary table. If the table with a name starting with t_ is not a temporary table, contact your project administrator to modify it in the project configuration.

How to delete the input and output of a table

When you're in the process of data development, you often use static tables (data is uploaded to a table from a local file ), this static data does not actually output task. At this time, when configuring dependencies, you need to delete the input of the static table: if the static table does not satisfy the form of t_, it will not be processed as a temporary table, in which case you need to delete the input of the static table.

You select the table name in the code, click Remove Input.

If you are upgrading from DataWorks to DataWorks V2.0, we set the node output for the migrated DataWorks task to ProjectName.NodeName for you by default.

Attentions

When the task dependency configuration is complete, the submitted window shows an option: whether confirm to proceed with the submission when the input and output does not match the code blood analysis.

The premise of this option is that you have confirmed that the dependencies are correct. If you cannot confirm, you can confirm the dependencies as described above.