DataWorks offers a data development function that supports the graphic design of data analysis flows. It also processes data and forms mutual dependencies through flow tasks and inner nodes. Currently, it supports multiple task types such as ODPS_SQL, data synchronization, OPEN_MR, SHELL, machine learning, and virtual nodes. For more information about the use of each task type, see Task type description.
Here, we use a creation of a flow task named “work” as an example to show how to create nodes in a flow, configure dependencies, and conveniently design and display steps and sequences for data analysis. This article explains how to use the data development function for further data analysis and computing in the workspace.
You have prepared the business data table bank_data, the data it contains, and the result_table in the workspace according to Upload a local file instructions.
Log on to the DTplus console, and click Data Development > New > Create Task.
Select the relevant content in the dialog box and specify the task type as Flow task.
Note: Once selected, the scheduling attribute cannot be changed.
This section shows how to create a virtual node “start” and an odps_sql node “insert_data”, and to configure “insert_data” to depend on “start”.
- As a control-type node, the virtual node does not affect the data during flow operation and is only used for O&M control of downstream nodes.
- When a virtual node depends on the other nodes and its status is manually set to failure by the O&M personnel, its downstream nodes that have not run yet, cannot be triggered. This prevents further propagation of erroneous upstream data during the O&M process. For more information, see the section on virtual nodes in Task type description.
In a nutshell, we recommend that you create a virtual node as the root node to control the whole flow when designing a flow.
Double-click the virtual node, and enter the node name “start”.
Double-click ODPS_SQL and enter the node name “insert_data”.
Click the start node, and draw a line between start and insert_data to have insert_data dependent on start.
This section describes how to use SQL code in the ODPS_SQL node insert_data to query the quantity of mortgages available for individuals having different educational background and save results for analysis or display by the following nodes. For more information about the syntax, see the MaxCompute documentation. The SQL statements are as follows.
INSERT OVERWRITE TABLE result_table --Insert data to result_table
, COUNT(marital) AS num
WHERE housing = 'yes'
AND marital = 'single'
GROUP BY education
After editing the SQL statements in the insert_data node, click Save to prevent code loss.
Click Run to view operations logs and results.
Click Table Query in the left-side navigation pane, to query data in the table.
After running and debugging the ODPS_SQL node “insert_data”, return to the flow page. Click Save and Submit the whole flow.
Now you have learned how to create, save, and submit the flow. You can proceed with the next tutorial that demonstartes how to create a synchronization task to export data to the diffrent types of the data sources. For more information, see Create a synchronization task to export results.