Background of branch nodes
During the daily use of DataWorks, you may often encounter the following problem: I have a node that needs to be executed on the last day of each month. How should I set it up?
Answer: Before the branch node appears, the Cron expression can not express this scene, so it is temporarily unavailable to support.
Now, DataWorks officially supports branch nodes. With branch nodes, we can apply the switch-case programming model to perfectly meet the above requirements.
Branch nodes and other control nodes
On the Data Development page, you can see the various control nodes currently supported by DataWorks, including assignment nodes, branch nodes, merge nodes and so on.
- Pass its own results to the downstream assignment node:
The assignment noed reuses the characteristics that the node context depends. Based on the two existing constant/variable node context, the assignment node comes with a custom context output. DataWorks captures the select result or the print result of the assignment node. This result is used as the value of the context output parameter in the form of outputs for reference by downstream nodes.
- Determine which downstream branch nodes are normally executed:
The branch node reuse the characteristics of the input and output set in the dependencies on DataWorks.
For common nodes, the output of the node is only a globally unique string. When the downstream needs to set dependencies, searching for this globally unique string as input to the node can be hung into the list of downstream nodes.
However, for a branch node, we can associate a condition for each output: when the downstream set dependency, we can selectively use the output of a certain condition as the output of the branch node. In this way, when the node becomes the downstream of the branch node, it is also associated with the condition of the branch node: that is, the condition is satisfied, and the downstream corresponding to the output is executed normally; the other downstream nodes corresponding to the output that does not meet the criteria are set to run empty.
- The merge node that is normally scheduled regardless of whether the upstream performs normally:
For branches that are not selected by the branch node, DataWorks puts all node instances on this branch link as empty run instances. That is to say, once an upstream of a certain instance is running empty , this instance itself becomes empty.
Dataworks can currently prevent this empty run attribute from being passed without restriction by merge node: for a merge node instance, no matter how many empty run instances its upstream has, it will succeed directly and will no longer leave the downstream empty running.
- ASN: an assignment node, which is used to calculate more complex situations to prepare for branch node conditional selection.
- X/Y: branch nodes, they are downstream of the assignment node ASN, and make branch selection
according to the output of the assignment node. As shown in the green line in the
figure, the node X selects the left branch, the node Y selects the two branches on
- The node A/C are executed normally because they are downstream of the output selected by the node X/Y.
- Although the node B is downstream of the branch selected by the node Y, since the node X does not select this output, the node B is set to run empty.
- The node E is not selected by node Y, so even if there is an ordinary upstream named node Z, it is also set to run empty.
- The upstream node E of the nodeG runs empty, so even if the node C/F are both executed normally, the node E also runs empty.
- How can the empty running properties no longer be passed down?
JOIN node is a merge node. Its special function is to stop the transfer of empty run properties. You can see that because the node D is downstream of the JOIN node, the empty run attribute of the node B is blocked, and the node D can start running normally.
By using branch nodes to cooperate with other control nodes, we can satisfy the requirement scenario where a node only runs on the last day of each month.
Use a branch node
Define task dependencies
- The root assignment node calculates whether the current day is the last day of the month by timing SKYNET_CYCTIME. If it is, the output is 1, and if it is not, the output is 0. This output is captured by DataWorks and passed to the downstream.
- The branch node defines the branch according to the output of assignment node.
- The two shell nodes are hung under the branch node and perform different branch logic.
Define assignment nodes
- For SQL types, DataWorks captures the SQL of the last SELECT statement as the value of outputs.
- For SHELL/Python types, DataWorks captures the last line of standard output as the value of outputs.
In this article, the Python type is used as the code for the assignment node, and the scheduling properties and code settings are as follows.
- The code is as follows:
- Schedule configuration
Branch nodes can define conditions with simple Python syntax expressions, each of which is bound to an output. This means that the downstream node under this output is executed when the condition is met, and the other nodes are set to run empty.
- Schedule configuration
- Branch configuration
- Schedule configuration generates output of conditional bindings
Hang the execution task nodes under different branches
Finally, it is important to note when setting dependencies on the nodes that actually perform tasks: you can see that the branch node already has three outputs, according to the logic of setting dependencies in the past, any one of these three outputs can be regarded as input; however, since the output of the branch node is now associated with the condition, it should be carefully selected.
- Node dependencies performed on the last day of each month
- Node dependencies performed at other times of each month
Once completing all of the above configuration, you can submit and publish the task. After publishing, you can perform Retroactive instances to test the effect: the business date 2018-12-30 and 2018-12-31 are selected , that is, the timing is 2018-12-31 and 2019-01-01 respectively, so that the first batch of patch data should trigger the logic of "last day", the second batch triggers the logic of "non-last day". We look at the difference between the two.
- Branch selection results of branch node
- The node "RunOnLast" is executed normally.
- The node "RunExceptLast" is set to run empty.
- Branch selection results of branch node
- The node "RunOnLast" is set to run empty.
- The node "RunExceptLast" is executed normally.
Based on the branch node, you have achieved the goal which execute on last day of each month. Of course, this is the easiest way to use a branch node. By using an assignment node with a branch node, you can combine a variety of conditions to meet your business needs.
- DataWorks captures the last SELECT statement or the last line of the standard output stream of the assignment node as the output of an assignment node for downstream references.
- Each output of the branch node is associated with the condition, and the downstream branch node is used as the upstream. It is necessary to understand the meaning of the conditions associated with each output before selecting.
- Unselected branches are set to run empty, and the empty run properties are passed down until the merge node is encountered.
- In addition to blocking the empty run properties, the merge node has more powerful features to wait for your mining.