The data development function of DataWorks supports the graphic design of data analysis flows and processes data and forms mutual dependencies through flow tasks and inner nodes. Currently, it supports multiple task types, including ODPS_SQL, data synchronization, OPEN_MR, SHELL, machine learning, and virtual nodes. For more information about the use of each task type, see Task type description.
This section uses the creation of a flow task named “work” as an example to show how to create nodes in a flow, configure dependencies, and conveniently design and display steps and sequences for data analysis. We briefly describe how to use the data development function for further data analysis and computing in the workspace.
You have prepared the business data table bank_data, the data it contains, and the result_table in the workspace according to Create a table and upload data.
log on to the project, and click Data Development > New > Create Task.
Select the relevant content in the dialog box and specify the task type as Flow task.
Note:Once selected, the scheduling attribute cannot be changed.
This section shows how to create a virtual node “start” and an odps_sql node “insert_data”, and to configure “insert_data” to depend on “start”.
- As a control-type node, the virtual node does not affect the data during flow operation and is only used for O&M control of downstream nodes.
- When a virtual node is depended on by other nodes and its status is manually set to failure by the O&M personnel, its downstream nodes that have not yet run cannot be triggered. This prevents further propagation of erroneous upstream data during the O&M process. For more information, see the section on virtual nodes in Task type description.
In summary, we recommend that you create a virtual node as the root node to control the whole flow when designing a flow.
Double-click the virtual node, and enter the node name “start”.
Double-click ODPS_SQL and enter the node name “insert_data”.
Click the start note, and draw a line between start and insert_data to make insert_data dependent on start.
This section describes how to use the SQL code in the ODPS_SQL node insert_data to query the quantity of mortgages for individual persons with different education backgrounds, and save the results for analysis or display by subsequent nodes. The SQL statements are as follows. For more information about the syntax, see the MaxCompute documentation.
INSERT OVERWRITE TABLE result_table --Insert data to result_table
, COUNT(marital) AS num
WHERE housing = 'yes'
AND marital = 'single'
GROUP BY education
After editing the SQL statements in the insert_data node, click Save to prevent code loss.
Click Run to view the operations logs and results.
Then, click Table Query on the left to query data in the table.
After running and debugging the ODPS_SQL node “insert_data”, return to the flow page, and save and submit the whole flow.
Now, you know how to create, save, and submit a flow. Continue to the next tutorial for further study. This tutorial shows you how to create a synchronization task to export data to data sources of different types. For more information, see Create a synchronization task to export results.