If you want to process a group of data or a task multiple times until a condition is met, you can use do-while nodes to automatically repeat the process. This simplifies the procedure of complex tasks and improves the execution efficiency and reliability of tasks. This topic describes how to use a do-while node. In this topic, a do-while node is used to collect statistics on order data in the e-commerce industry.
Background information
DataWorks provides do-while nodes. You can rearrange the workflow inside a do-while node, write the logic to be executed in a loop in the node, and then configure an end node to determine whether to exit from looping. You can use a do-while node alone, or use a do-while node together with an assignment node to loop through the result set passed by the assignment node. This topic describes how to use a do-while node to collect statistics on order data.
Example:
Raw data
In the e-commerce industry, the daily order volume may be large. Therefore, an order table that is partitioned by day is created in DataWorks. The orders that are created on the same day are stored in the same partition. The partition is named in the YYYYMMDD format. For example, the data of all orders that are created on
September 1, 2022is stored in the partition20220901.| id | user_id | order_amount | ds | |----|---------|--------------|----------| | 1 | 1001 | 500 | 20220901 | | 2 | 1002 | 1500 | 20220901 | | 7 | 1003 | 890 | 20221021 | | 8 | 1004 | 240 | 20221021 |Data requirements
Collect the statistics on the order amount and the number of orders based on the first day of each month in the
last month (30 days),last two months (60 days), andlast three months (90 days)in 2022. For example,stat_day=20220901 stat_type=30dindicates the statistics for thelast month (30 days)of September 1, 2022.| stat_day | stat_type | order_total | order_amount_total | |----------|-----------|-------------|--------------------| | 20220901 | 30d | 10 | 0 | | 20220901 | 60d | 20 | 0 | | 20220901 | 90d | 30 | 0 |Requirement analysis
A total of 12 data records need to be generated. The data records indicate the statistics of the 12 months of 2022.
During statistical collection of each record, MaxCompute must query all partitions in the
last month (30 days),last two months (60 days), andlast three months (90 days). The range of partitions that need to be queried is dynamically calculated based on the first day of each month.To meet the preceding business requirements, you can perform 12 loops to calculate the data of each month by using a do-while node. You can also use three SQL nodes to calculate the data of
the last month (30 days),the last two months (60 days), andthe last three months (90 days).${dag.loopTimes}specifies the current month of looping. This helps calculate the partition for the first day of the current month and the partitions for the first day of the last month, the first day of the last two months, and the first day of the last three months in the SQL nodes.
Prerequisites
MaxCompute is activated and a MaxCompute project is created. For more information, see Activate MaxCompute and Create a MaxCompute project.
DataWorks Standard Edition or a more advanced edition is activated. For more information, see Purchase guide and Differences among DataWorks editions.
The workspace of DataWorks is in standard mode. For more information, see Scenario: Upgrade a workspace from the basic mode to the standard mode.
A MaxCompute compute engine is associated with the workspace of DataWorks. For more information, see Associate a MaxCompute compute engine with a workspace.
Prepare data
Log on to the MaxCompute console. In the left-side navigation pane, choose DataWorks > Data Development to go to the DataStudio page of DataWorks.
In the left-side navigation pane, click Scheduled Workflow.
Click the desired workflow and choose . Then, right-click Data Analytics and choose Create Node > ODPS SQL to create an ODPS SQL node, such as
init_test_data.
In the code editor for the
init_test_datanode, enter the following SQL script:-- Create an order table. CREATE TABLE orders ( id BIGINT ,user_id BIGINT -- The user ID. ,order_amount BIGINT -- The order amount. ) PARTITIONED BY ( ds STRING -- The order date. ) ; -- Insert order data. INSERT INTO orders PARTITION (ds = '20220901') VALUES (1,1001,500) ,(2,1002,1500); INSERT INTO orders PARTITION (ds = '20220905') VALUES (3,1005,260) ,(4,1002,780); INSERT INTO orders PARTITION (ds = '20221010') VALUES (5,1003,890) ,(6,1004,240); INSERT INTO orders PARTITION (ds = '20221021') VALUES (7,1003,890) ,(8,1004,240); INSERT INTO orders PARTITION (ds = '20221025') VALUES (9,1002,260) ,(10,1007,780); -- Create an order statistics table. CREATE TABLE orders_stat ( stat_day STRING -- The statistical date, which is the first day of each month. ,stat_type STRING -- The statistical type. ,order_total BIGINT -- The total number of orders. ,order_amount_total BIGINT -- The order amount. ) ;Commit the
init_test_datanode to the production environment and backfill data.Click the
and
icons in sequence on the toolbar of the current node to save and commit the node. When you commit a node, enter a change description as prompted and specify whether to perform code review and smoke testing. NoteYou can commit the node only after you configure the Rerun and Parent Nodes parameters on the Properties tab.
If the code review feature is enabled, a node can be deployed only after the code of the node is approved by a specified reviewer. For more information, see Code review.
To ensure that the node you created can be run as expected, we recommend that you perform smoke testing before you deploy the node. For more information, see Perform smoke testing.
If the workspace that you use is in standard mode, click Deploy in the upper-right corner to deploy the node after you commit it. For more information, see Workspace in standard mode and Deploy nodes.
In the upper-right corner of the DataStudio page for the current node, click Operation Center. In the left-side navigation pane of the Operation Center page, choose . In the list of cycle tasks in the middle pane of the Cycle Task page, click
init_test_datain the Name column. Then, right-click theinit_test_datanode in the directed acyclic graph (DAG) on the right and select Test to run a task test to initialize data.
Freeze a node.
NoteThis step performs one-time data initialization. After the node is executed once, you can pause (freeze) the node to prevent continuous execution.
In the list of cycle tasks in the middle pane of the Cycle Task page, right-click the node and select Freeze.
Procedure
Create a do-while node
Make sure that the current workspace is associated with a MaxCompute compute engine. Otherwise, you cannot perform the following operations. For more information, see Associate a MaxCompute compute engine with a workspace.
Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Create a do-while node.
On the DataStudio page, move the pointer over the
icon and choose . Alternatively, you can find the workflow in which you want to create a do-while node, click the workflow name, right-click General, and then choose .
In the Create Node dialog box, configure the Name and Path parameters.
Click Confirm.
Create a Shell node and specific ODPS SQL nodes
Double-click the name of the do-while node. The configuration tab of the node appears.
Delete the default sql node. Right-click the sql node in the middle of the do-while node and click Delete Node. In the Delete message, click OK.
In the left-side navigation pane, choose to create three ODPS SQL nodes and choose to create one Shell node.
The three ODPS SQL nodes are named
30_day,60_day, and90_day.The Shell node is named
echo.
Configure the node dependencies between the ODPS SQL nodes and the Shell node. On the canvas of the do-while node, drag lines to configure the
startnode as the ancestor node of theechonode, configure theechonode as the ancestor node of the30_day,60_day, and90_daynodes, and configure theendnode as the descendant node of the three nodes.
Write node code
The
${dag.loopTimes}variable is a reserved variable of the system. This variable specifies the current number of loops, and the value of this variable starts from 1. All inner nodes of the do-while node can reference this variable. For more information about built-in variables, see Built-in variables and Examples of variable values.After you modify the code in the Shell node, you must save the modification. No message is displayed when you submit the modification. If you do not save the modification, the code cannot be updated to the latest version at the earliest opportunity.
Write the code of the
echonode. This node is used to generate logs for the current number of loops. Double-click theechonode. On the configuration tab for the Shell node, enter the following code:#!/bin/bash echo "loop times: ${dag.loopTimes}"Write the code of the
30_daynode. This node is used to collect the order data of the last month for the first day of each month in 2022. Double-click the30_daynode. On the configuration tab for the ODPS SQL node, enter the following code:INSERT INTO orders_stat SELECT CONCAT('2022',LPAD(${dag.loopTimes},2,'0'),'01') AS stat_day ,'30d' AS stat_type ,COUNT(id) AS order_total ,nvl(SUM(order_amount),0) AS order_amount_total FROM orders WHERE -- The first day of the last month. ds >= REPLACE(ADD_MONTHS(TO_DATE(CONCAT('2022',LPAD(${dag.loopTimes},2,'0'),'01'),'yyyyMMdd'),-1),'-','') -- The first day of the current month. AND ds < CONCAT('2022',LPAD(${dag.loopTimes},2,'0'),'01') ;The following DML statement and functions are used in the preceding code:
Configure the
60_daynode. This node is used to collect the order data of the last two months for the first day of each month in 2022.INSERT INTO orders_stat SELECT CONCAT('2022',LPAD(${dag.loopTimes},2,'0'),'01') AS stat_day ,'60d' AS stat_type ,COUNT(id) AS order_total ,nvl(SUM(order_amount),0) AS order_amount_total FROM orders WHERE -- The first day of the last two months. ds >= REPLACE(ADD_MONTHS(TO_DATE(CONCAT('2022',LPAD(${dag.loopTimes},2,'0'),'01'),'yyyyMMdd'),-2),'-','') -- The first day of the current month. AND ds < CONCAT('2022',LPAD(${dag.loopTimes},2,'0'),'01') ;Configure the
90_daynode. This node is used to collect the order data of the last three months for the first day of each month in 2022.INSERT INTO orders_stat SELECT CONCAT('2022',LPAD(${dag.loopTimes},2,'0'),'01') AS stat_day ,'90d' AS stat_type ,COUNT(id) AS order_total ,nvl(SUM(order_amount),0) AS order_amount_total FROM orders WHERE -- The first day of the last three months. ds >= REPLACE(ADD_MONTHS(TO_DATE(CONCAT('2022',LPAD(${dag.loopTimes},2,'0'),'01'),'yyyyMMdd'),-3),'-','') -- The first day of the current month. AND ds < CONCAT('2022',LPAD(${dag.loopTimes},2,'0'),'01') ;Configure the
endnode. This node controls when to exit from looping.
Commit nodes
Double-click the do-while node. On the configuration tab of the node, perform the following steps:
You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the node.
You can use the code review feature to ensure the code quality of tasks and prevent task execution errors caused by invalid task code. If you enable the code review feature, the node code that is committed can be deployed only after the node code passes the code review. For more information, see Code review.
Click the
icon in the top toolbar to save the node. Click the
icon in the top toolbar to commit the node. In the Submit dialog box, configure the Change description parameter. Then, determine whether to review node code after you commit the node based on your business requirements.
ImportantYou must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the node.
You can use the code review feature to ensure the code quality of tasks and prevent task execution errors caused by invalid task code. If you enable the code review feature, the node code that is committed can be deployed only after the node code passes the code review. For more information, see Code review.
If the workspace that you use is in standard mode, you must click Deploy in the upper-right corner of the node configuration tab to deploy a task on the node to the production environment for running after you commit the task on the node. For more information, see Deploy tasks.
Test nodes
The procedure for committing, deploying, and running a do-while node is the same as that for committing, deploying, and running a common node. However, you cannot test a do-while node in DataStudio.
If the workspace that you use is in standard mode, you cannot directly perform a test to run a do-while node in DataStudio.
To perform a test to run the do-while node and view the result, you must commit and deploy the workflow that contains the do-while node to Operation Center and run the do-while node in Operation Center. If you use the value passed by an assignment node in the do-while node, run both the assignment node and do-while node during the test in Operation Center.
Go to the Cycle Task page to backfill data.
In the upper-right corner of the DataStudio page for the desired node, click Operation Center to go to the Operation Center page.
In the left-side navigation pane of the Operation Center page, choose .
Select the desired node. In the DAG on the right side of the Cycle Task page, right-click the
do_while_testnode and choose .The SQL script logic in this example does not depend on the Data Timestamp parameter. Therefore, you can use the default value of the Data Timestamp parameter to run the SQL script.
On the data backfilling page for the node, click
do_while_testin the Name column. Then, right-click the node name in the DAG on the right and select View Runtime Log.
View data in the result table.
Create an ODPS SQL node and execute the following SQL statement:
SELECT * FROM orders_stat;Returned result:
| stat_day | stat_type | order_total | order_amount_total | |----------|-----------|-------------|--------------------| | 20220101 | 60d | 0 | 0 | | 20220101 | 30d | 0 | 0 | | 20220101 | 90d | 0 | 0 | | 20220201 | 60d | 0 | 0 | | 20220201 | 30d | 0 | 0 | | 20220201 | 90d | 0 | 0 | | 20220301 | 30d | 0 | 0 | | 20220301 | 60d | 0 | 0 | | 20220301 | 90d | 0 | 0 | | 20220401 | 30d | 0 | 0 | | 20220401 | 90d | 0 | 0 | | 20220401 | 60d | 0 | 0 | | 20220501 | 30d | 0 | 0 | | 20220501 | 90d | 0 | 0 | | 20220501 | 60d | 0 | 0 | | 20220601 | 60d | 0 | 0 | | 20220601 | 30d | 0 | 0 | | 20220601 | 90d | 0 | 0 | | 20220701 | 90d | 0 | 0 | | 20220701 | 30d | 0 | 0 | | 20220701 | 60d | 0 | 0 | | 20220801 | 60d | 0 | 0 | | 20220801 | 90d | 0 | 0 | | 20220801 | 30d | 0 | 0 | | 20220901 | 90d | 0 | 0 | | 20220901 | 60d | 0 | 0 | | 20220901 | 30d | 0 | 0 | | 20221001 | 90d | 4 | 3040 | | 20221001 | 60d | 4 | 3040 | | 20221001 | 30d | 4 | 3040 | | 20221101 | 30d | 16 | 8860 | | 20221101 | 90d | 20 | 11900 | | 20221101 | 60d | 20 | 11900 | | 20221201 | 90d | 20 | 11900 | | 20221201 | 60d | 16 | 8860 | | 20221201 | 30d | 0 | 0 |