Learn how to create a workflow, create nodes, and configure their dependencies. You can then use the data development feature to analyze and compute data in your workspace.
Prerequisites
Before you begin, create the business data table bank_data and the sink table result_table in your workspace. The business data table must contain data. For more information, see Create a table and upload data.
Background
In DataWorks, you can visually configure dependencies between nodes within a workflow. Workflows let you process data and define its dependencies. You can create multiple workflows in a single workspace. For more information, see Create a workflow.
Create a workflow
-
Log on to the DataWorks console. In the target region, click in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Data Development.
-
On the Data Studio page, move the pointer over the
icon and click Create Workflow. -
In the Create Workflow dialog box, enter the Business Name and Description.
-
Click Create.
Create nodes and configure dependencies
Create a zero load node (start) and an ODPS SQL node (insert_data) in the workflow. Then, configure the dependency so that the insert_data node depends on the start node.
-
A zero load node is a control node that does not process data. Instead, it only manages its downstream nodes.
-
If you manually set a zero load node to failed during an O&M task, its unexecuted downstream nodes are not triggered. This feature helps prevent the propagation of incorrect data from upstream sources.
-
The parent node of a zero load node is typically the workspace root node. The workspace root node name follows the format
WorkspaceName_root. -
DataWorks automatically adds an output for each node with the name WorkspaceName.NodeName. If two nodes in the same workspace have the same name, you must rename the output of one of the nodes.
We recommend creating a zero load node to serve as the root node of your workflow. This node can control the execution of the entire workflow. To design the workflow, follow these steps:
-
Double-click the workflow name to open the development panel. Click General>Zero Load Node.
You can also drag the Zero Load Node to the development panel on the right.
-
In the Create Node dialog box, select a Path, enter start for the Node Name, and click OK.
-
Repeat the preceding steps to create an ODPS SQL node named insert_data.
-
Drag a line from the start node to the insert_data node to set start as the parent node of insert_data.
Configure the zero-load node's upstream dependency
A zero load node often controls the entire workflow, acting as the ultimate parent node for all other nodes.
Typically, set the Add Root Node as the Parent Nodes for the zero load node:
-
Double-click the name of the zero load node to open its editor.
-
In the right-side pane, click Scheduling.
-
In the Scheduling Dependencies section, click Add Root Node to set the workspace root node as the parent node of the zero load node.
-
Save and commit the node.
ImportantYou must configure the Rerun attribute and Parent Nodes properties before you can commit the node.
-
In the toolbar, click the
icon to save the node. -
In the toolbar, click the
icon. -
In the Submit dialog box, enter a Change Description.
-
Click OK.
-
Edit and run the ODPS SQL node
This section describes how to use SQL code in the insert_data ODPS SQL node to query the number of single individuals with home loans, grouped by education level, and save the results for further analysis or presentation by downstream nodes.
-
Open the editor for the ODPS SQL node and enter the following code.
For more information about the syntax, see SQL overview.
INSERT OVERWRITE TABLE result_table --Insert data into the result_table table. SELECT education , COUNT(marital) AS num FROM bank_data WHERE housing = 'yes' AND marital = 'single' GROUP BY education; -
Right-click bank_data in the code and select Delete Input.
The bank_data table, created in Create a table and upload data, is not generated by a scheduled node. If a node reads from such a table, you can right-click the table name in the code editor to remove the automatically parsed dependency. Alternatively, you can add a specific comment at the beginning of the code to prevent the dependency from being created.
NoteScheduling dependencies in DataWorks ensures that downstream nodes can reliably access data from upstream nodes that are updated on a regular schedule. The platform cannot monitor tables that are not updated by scheduled nodes within DataWorks. If a node selects data from a table that is not generated by a scheduled task, you must manually delete the dependency that DataWorks automatically creates.
-
Click the
icon in the toolbar to save your code. -
Click the
icon to run the code.After the run is complete, you can view the run log and results at the bottom of the page.
Commit the workflow
-
After running and debugging the insert_data ODPS SQL node, return to the workflow canvas.
-
Click the
icon in the toolbar. -
In the Submit dialog box, select the nodes that you want to commit, enter a Change Description, and select Ignore I/O Inconsistency Alerts .
-
Click OK.
After you commit the workflow, you can check the commit status of each node in the node list under Workflow. An icon
to the left of a node name indicates that the node has uncommitted changes. If no such icon
is present, the node has no uncommitted changes.
Next steps
Now that you have created and committed a workflow, proceed to the next tutorial to learn how to create a sync task to write data back to different types of data sources. For more information, see Create a sync task.