In DataWorks V2.0, when configuring scheduling dependencies, dependencies between tasks need to be set according to the output name of the current node as an associated item. This article details how to configure the input and output of task scheduling dependencies.
How to configure the node input of a task
The configuration of task dependencies is essentially to set the dependencies between two nodes. Only the nodes that exist will be able to set up valid dependencies, task dependencies can be set successfully.
Invalid upstream dependency
- The parent node does not exist.
- The parent node output does not exist.
- Confirm that the table has an output task.
- Confirm what the output name of this table's output task is, and manually enter the node output name into the dependent upstream node.
For example, the output name of the upstream node A is A1, and downstream node B depends on node A. At this point, enter A1 in the input box of the upstream node, and click the plus sign on the right to add it.
How to configure upstream dependencies
How to configure the node output of a task
The simplest way to efficiently configure the node output is: the node name, the node output name and the node output table name share the same name and three in one. The advantages are as follows.
- You can quickly know which table this task is operating on.
- It is possible to quickly know how far this task will impact if it fails.
- When you use auto parsing to configure task dependencies, as long as the node output is consistent with the three-in-one rule, the precision performance of automatic parsing is greatly improved.
Automatic parsing: refers to automatically parse scheduling dependencies by the code. Implementation principle: only table names can be obtained in the code, and the automatic parsing function can parse the corresponding output task according to the table name.
INSERT OVERWRITE TABLE pm_table_a SELECT * FROM project_b_name.pm_table_b;
DataWorks can automatically parse the node which this node needs to be dependent on
pm_table_b, and the final output of the node
pm_table_a. Therefore, the resolution is that the parent node output name is
project_b_name.pm_table_b, and the node output name is
project_name.pm_table_a(The project name is MaxCompute_DOC ).
- If you do not want to use dependencies that are parsed from the code, select No.
- If there are many tables in the code that are temporary tables: For example, the table beginning with t_ is a temporary table. Then the table is not parsed as the schedule dependency. The definition of temporary tables is that you can define which form the table begins with is a temporary table by project configuration.
- If a table in the code is both the output table and referenced table (depended table), it is parsed only as the output table.
- If a table in the code is referenced or output for multiple times, only one scheduling dependency is parsed.
How to delete the input and output of a table
When you're in the process of data development, you often use static tables (data is uploaded to a table from a local file ), this static data does not actually output task. At this time, when configuring dependencies, you need to delete the input of the static table: if the static table does not satisfy the form of t_, it will not be processed as a temporary table, in which case you need to delete the input of the static table.
If you are upgrading from DataWorks to DataWorks V2.0, we set the node output for the migrated DataWorks task to
ProjectName.NodeNamefor you by default.
When the task dependency configuration is complete, the submitted window shows an option: whether confirm to proceed with the submission when the input and output does not match the code blood analysis.