All Products
Search
Document Center

MaxCompute:Use a do-while node to perform complex data analytics

Last Updated:Jan 22, 2025

If you want to process a group of data or a task multiple times until a condition is met, you can use do-while nodes to automatically repeat the process. This simplifies the procedure of complex tasks and improves the execution efficiency and reliability of tasks. This topic describes how to use a do-while node. In this topic, a do-while node is used to collect statistics on order data in the e-commerce industry.

Background information

DataWorks provides do-while nodes. You can rearrange the workflow inside a do-while node, write the logic to be executed in a loop in the node, and then configure an end node to determine whether to exit from looping. You can use a do-while node alone, or use a do-while node together with an assignment node to loop through the result set passed by the assignment node. This topic describes how to use a do-while node to collect statistics on order data.

Example:

  • Raw data

    In the e-commerce industry, the daily order volume may be large. Therefore, an order table that is partitioned by day is created in DataWorks. The orders that are created on the same day are stored in the same partition. The partition is named in the YYYYMMDD format. For example, the data of all orders that are created on September 1, 2022 is stored in the partition 20220901.

    | id | user_id | order_amount | ds       |
    |----|---------|--------------|----------|
    | 1  | 1001    | 500          | 20220901 |
    | 2  | 1002    | 1500         | 20220901 |
    | 7  | 1003    | 890          | 20221021 |
    | 8  | 1004    | 240          | 20221021 |
  • Data requirements

    Collect the statistics on the order amount and the number of orders based on the first day of each month in the last month (30 days), last two months (60 days), and last three months (90 days) in 2022. For example, stat_day=20220901 stat_type=30d indicates the statistics for the last month (30 days) of September 1, 2022.

    | stat_day | stat_type | order_total | order_amount_total |
    |----------|-----------|-------------|--------------------|
    | 20220901 | 30d       | 10          | 0                  |
    | 20220901 | 60d       | 20          | 0                  |
    | 20220901 | 90d       | 30          | 0                  |
  • Requirement analysis

    • A total of 12 data records need to be generated. The data records indicate the statistics of the 12 months of 2022.

    • During statistical collection of each record, MaxCompute must query all partitions in the last month (30 days), last two months (60 days), and last three months (90 days). The range of partitions that need to be queried is dynamically calculated based on the first day of each month.

    • To meet the preceding business requirements, you can perform 12 loops to calculate the data of each month by using a do-while node. You can also use three SQL nodes to calculate the data of the last month (30 days), the last two months (60 days), and the last three months (90 days). ${dag.loopTimes} specifies the current month of looping. This helps calculate the partition for the first day of the current month and the partitions for the first day of the last month, the first day of the last two months, and the first day of the last three months in the SQL nodes.

Prerequisites

Prepare data

  1. Log on to the MaxCompute console. In the left-side navigation pane, choose DataWorks > Data Development to go to the DataStudio page of DataWorks.

  2. In the left-side navigation pane, click Scheduled Workflow.

  3. Click the desired workflow and choose MaxCompute > Data Analytics. Then, right-click Data Analytics and choose Create Node > ODPS SQL to create an ODPS SQL node, such as init_test_data.

    image.png

  4. In the code editor for the init_test_data node, enter the following SQL script:

    -- Create an order table.
    CREATE TABLE orders
    (
        id            BIGINT
        ,user_id      BIGINT -- The user ID.
        ,order_amount BIGINT -- The order amount.
    )
    PARTITIONED BY 
    (
        ds            STRING -- The order date.
    )
    ;
    
    -- Insert order data.
    INSERT INTO orders PARTITION (ds = '20220901') VALUES (1,1001,500) ,(2,1002,1500);
    INSERT INTO orders PARTITION (ds = '20220905') VALUES (3,1005,260) ,(4,1002,780);
    INSERT INTO orders PARTITION (ds = '20221010') VALUES (5,1003,890) ,(6,1004,240);
    INSERT INTO orders PARTITION (ds = '20221021') VALUES (7,1003,890) ,(8,1004,240);
    INSERT INTO orders PARTITION (ds = '20221025') VALUES (9,1002,260) ,(10,1007,780);
    
    -- Create an order statistics table.
    CREATE TABLE orders_stat
    (
        stat_day            STRING -- The statistical date, which is the first day of each month.
        ,stat_type          STRING -- The statistical type.
        ,order_total        BIGINT -- The total number of orders.
        ,order_amount_total BIGINT -- The order amount.
    )
    ;
  5. Commit the init_test_data node to the production environment and backfill data.

    1. Click the 保存 and 提交 icons in sequence on the toolbar of the current node to save and commit the node. When you commit a node, enter a change description as prompted and specify whether to perform code review and smoke testing.

      Note
      • You can commit the node only after you configure the Rerun and Parent Nodes parameters on the Properties tab.

      • If the code review feature is enabled, a node can be deployed only after the code of the node is approved by a specified reviewer. For more information, see Code review.

      • To ensure that the node you created can be run as expected, we recommend that you perform smoke testing before you deploy the node. For more information, see Perform smoke testing.

    2. If the workspace that you use is in standard mode, click Deploy in the upper-right corner to deploy the node after you commit it. For more information, see Workspace in standard mode and Deploy nodes.

    3. In the upper-right corner of the DataStudio page for the current node, click Operation Center. In the left-side navigation pane of the Operation Center page, choose Cycle Task Maintenance > Cycle Task. In the list of cycle tasks in the middle pane of the Cycle Task page, click init_test_data in the Name column. Then, right-click the init_test_data node in the directed acyclic graph (DAG) on the right and select Test to run a task test to initialize data.

  6. Freeze a node.

    Note

    This step performs one-time data initialization. After the node is executed once, you can pause (freeze) the node to prevent continuous execution.

    In the list of cycle tasks in the middle pane of the Cycle Task page, right-click the node and select Freeze.

Procedure

Create a do-while node

Important

Make sure that the current workspace is associated with a MaxCompute compute engine. Otherwise, you cannot perform the following operations. For more information, see Associate a MaxCompute compute engine with a workspace.

  1. Go to the DataStudio page.

    Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.

  2. Create a do-while node.

    1. On the DataStudio page, move the pointer over the 新建 icon and choose Create Node > General > do-while.

      Alternatively, you can find the workflow in which you want to create a do-while node, click the workflow name, right-click General, and then choose Create Node > do-while.

    2. In the Create Node dialog box, configure the Name and Path parameters.

    3. Click Confirm.

Create a Shell node and specific ODPS SQL nodes

  1. Double-click the name of the do-while node. The configuration tab of the node appears.

  2. Delete the default sql node. Right-click the sql node in the middle of the do-while node and click Delete Node. In the Delete message, click OK.

  3. In the left-side navigation pane, choose MaxCompute > ODPS SQL to create three ODPS SQL nodes and choose General > Shell to create one Shell node.

    1. The three ODPS SQL nodes are named 30_day, 60_day, and 90_day.

    2. The Shell node is named echo.

  4. Configure the node dependencies between the ODPS SQL nodes and the Shell node. On the canvas of the do-while node, drag lines to configure the start node as the ancestor node of the echo node, configure the echo node as the ancestor node of the 30_day, 60_day, and 90_day nodes, and configure the end node as the descendant node of the three nodes.

Write node code

Important
  • The ${dag.loopTimes} variable is a reserved variable of the system. This variable specifies the current number of loops, and the value of this variable starts from 1. All inner nodes of the do-while node can reference this variable. For more information about built-in variables, see Built-in variables and Examples of variable values.

  • After you modify the code in the Shell node, you must save the modification. No message is displayed when you submit the modification. If you do not save the modification, the code cannot be updated to the latest version at the earliest opportunity.

  1. Write the code of the echo node. This node is used to generate logs for the current number of loops. Double-click the echo node. On the configuration tab for the Shell node, enter the following code:

    #!/bin/bash
    echo "loop times: ${dag.loopTimes}"
  2. Write the code of the 30_day node. This node is used to collect the order data of the last month for the first day of each month in 2022. Double-click the 30_day node. On the configuration tab for the ODPS SQL node, enter the following code:

    INSERT INTO orders_stat
    SELECT  CONCAT('2022',LPAD(${dag.loopTimes},2,'0'),'01') AS stat_day
            ,'30d' AS stat_type
            ,COUNT(id) AS order_total
            ,nvl(SUM(order_amount),0) AS order_amount_total
    FROM    orders
    WHERE   
    -- The first day of the last month.
    ds >= REPLACE(ADD_MONTHS(TO_DATE(CONCAT('2022',LPAD(${dag.loopTimes},2,'0'),'01'),'yyyyMMdd'),-1),'-','') 
    -- The first day of the current month.
    AND ds < CONCAT('2022',LPAD(${dag.loopTimes},2,'0'),'01')
    ;

    The following DML statement and functions are used in the preceding code:

  3. Configure the 60_day node. This node is used to collect the order data of the last two months for the first day of each month in 2022.

    INSERT INTO orders_stat
    SELECT  CONCAT('2022',LPAD(${dag.loopTimes},2,'0'),'01') AS stat_day
            ,'60d' AS stat_type
            ,COUNT(id) AS order_total
            ,nvl(SUM(order_amount),0) AS order_amount_total
    FROM    orders
    WHERE   
    -- The first day of the last two months.
    ds >= REPLACE(ADD_MONTHS(TO_DATE(CONCAT('2022',LPAD(${dag.loopTimes},2,'0'),'01'),'yyyyMMdd'),-2),'-','') 
    -- The first day of the current month.
    AND ds < CONCAT('2022',LPAD(${dag.loopTimes},2,'0'),'01')
    ;
  4. Configure the 90_day node. This node is used to collect the order data of the last three months for the first day of each month in 2022.

    INSERT INTO orders_stat
    SELECT  CONCAT('2022',LPAD(${dag.loopTimes},2,'0'),'01') AS stat_day
            ,'90d' AS stat_type
            ,COUNT(id) AS order_total
            ,nvl(SUM(order_amount),0) AS order_amount_total
    FROM    orders
    WHERE   
    -- The first day of the last three months.
    ds >= REPLACE(ADD_MONTHS(TO_DATE(CONCAT('2022',LPAD(${dag.loopTimes},2,'0'),'01'),'yyyyMMdd'),-3),'-','') 
    -- The first day of the current month.
    AND ds < CONCAT('2022',LPAD(${dag.loopTimes},2,'0'),'01')
    ;
  5. Configure the end node. This node controls when to exit from looping.

    1. Double-click the end node. The configuration tab of the node appears.

    2. Select Python from the Language drop-down list.

    3. Enter the following code to define the condition for exiting from looping for the do-while node:

      Important
      • The maximum number of loops for a do-while node is 128. Therefore, the maximum value of ${dag.loopTimes} is 128.

      • Do-while nodes do not support concurrent execution. A loop can start only if the previous loop ends.

      if ${dag.loopTimes} < 12:
          print True
      else:
          print False

      ${dag.loopTimes} < 12 indicates that a do-while node can be repeated only for 12 times. One loop is performed for data statistics of one month. Therefore, data of a total of 12 months in 2022 is collected.

Commit nodes

Double-click the do-while node. On the configuration tab of the node, perform the following steps:

Important
  • You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the node.

  • You can use the code review feature to ensure the code quality of tasks and prevent task execution errors caused by invalid task code. If you enable the code review feature, the node code that is committed can be deployed only after the node code passes the code review. For more information, see Code review.

  1. Click the 保存 icon in the top toolbar to save the node.

  2. Click the 提交 icon in the top toolbar to commit the node.

    In the Submit dialog box, configure the Change description parameter. Then, determine whether to review node code after you commit the node based on your business requirements.

    Important
    • You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the node.

    • You can use the code review feature to ensure the code quality of tasks and prevent task execution errors caused by invalid task code. If you enable the code review feature, the node code that is committed can be deployed only after the node code passes the code review. For more information, see Code review.

    If the workspace that you use is in standard mode, you must click Deploy in the upper-right corner of the node configuration tab to deploy a task on the node to the production environment for running after you commit the task on the node. For more information, see Deploy tasks.

Test nodes

The procedure for committing, deploying, and running a do-while node is the same as that for committing, deploying, and running a common node. However, you cannot test a do-while node in DataStudio.

Note

If the workspace that you use is in standard mode, you cannot directly perform a test to run a do-while node in DataStudio.

To perform a test to run the do-while node and view the result, you must commit and deploy the workflow that contains the do-while node to Operation Center and run the do-while node in Operation Center. If you use the value passed by an assignment node in the do-while node, run both the assignment node and do-while node during the test in Operation Center.

  1. Go to the Cycle Task page to backfill data.

    1. In the upper-right corner of the DataStudio page for the desired node, click Operation Center to go to the Operation Center page.

    2. In the left-side navigation pane of the Operation Center page, choose Cycle Task Maintenance > Cycle Task.

    3. Select the desired node. In the DAG on the right side of the Cycle Task page, right-click the do_while_test node and choose Run > Backfill Data for Current Node.

      The SQL script logic in this example does not depend on the Data Timestamp parameter. Therefore, you can use the default value of the Data Timestamp parameter to run the SQL script.

    4. On the data backfilling page for the node, click do_while_test in the Name column. Then, right-click the node name in the DAG on the right and select View Runtime Log.

  2. View run logs of the do-while node.

    1. On the data backfilling page for the node, right-click the node name in the DAG on the right and select View Internal Nodes.

      You can view run logs of a do-while node only if you view the inner nodes of the do-while node.

      The inner workflow of the do-while node is divided into three parts:

      • The left pane of the view displays the rerun history of the do-while node. A record is generated each time a do-while node instance is run.

      • The middle pane of the view displays a loop record list that shows all existing loops of the do-while node and the status of each loop.

      • The right pane of the view displays the details about each loop. You can click a record in the loop record list to view the details of each instance in the loop.

    2. On the inner node page, click a loop that is finished in the middle pane, right-click the desired node in the right pane, and then select View Runtime Log.

  3. View run logs for the nth loop.

    On the inner node page, click Loop 3 in the middle pane to view run logs of the echo node in the third loop.

    The preceding example shows that a do-while node works based on the following application logic:

    1. The system starts a loop from the start node.

    2. Other nodes inside the do-while node run in sequence based on the dependencies configured for them.

    3. The system executes the conditional statement defined in the code of the end node for exiting from looping.

    4. The system records the number of loops that are run, and the next loop starts if the conditional statement returns True in the run logs of the end node.

    5. The entire looping process ends if the conditional statement returns False in the run logs of the end node.

  4. View data in the result table.

    Create an ODPS SQL node and execute the following SQL statement:

    SELECT * FROM orders_stat;

    Returned result:

    | stat_day | stat_type | order_total | order_amount_total |
    |----------|-----------|-------------|--------------------|
    | 20220101 | 60d       | 0           | 0                  |
    | 20220101 | 30d       | 0           | 0                  |
    | 20220101 | 90d       | 0           | 0                  |
    | 20220201 | 60d       | 0           | 0                  |
    | 20220201 | 30d       | 0           | 0                  |
    | 20220201 | 90d       | 0           | 0                  |
    | 20220301 | 30d       | 0           | 0                  |
    | 20220301 | 60d       | 0           | 0                  |
    | 20220301 | 90d       | 0           | 0                  |
    | 20220401 | 30d       | 0           | 0                  |
    | 20220401 | 90d       | 0           | 0                  |
    | 20220401 | 60d       | 0           | 0                  |
    | 20220501 | 30d       | 0           | 0                  |
    | 20220501 | 90d       | 0           | 0                  |
    | 20220501 | 60d       | 0           | 0                  |
    | 20220601 | 60d       | 0           | 0                  |
    | 20220601 | 30d       | 0           | 0                  |
    | 20220601 | 90d       | 0           | 0                  |
    | 20220701 | 90d       | 0           | 0                  |
    | 20220701 | 30d       | 0           | 0                  |
    | 20220701 | 60d       | 0           | 0                  |
    | 20220801 | 60d       | 0           | 0                  |
    | 20220801 | 90d       | 0           | 0                  |
    | 20220801 | 30d       | 0           | 0                  |
    | 20220901 | 90d       | 0           | 0                  |
    | 20220901 | 60d       | 0           | 0                  |
    | 20220901 | 30d       | 0           | 0                  |
    | 20221001 | 90d       | 4           | 3040               |
    | 20221001 | 60d       | 4           | 3040               |
    | 20221001 | 30d       | 4           | 3040               |
    | 20221101 | 30d       | 16          | 8860               |
    | 20221101 | 90d       | 20          | 11900              |
    | 20221101 | 60d       | 20          | 11900              |
    | 20221201 | 90d       | 20          | 11900              |
    | 20221201 | 60d       | 16          | 8860               |
    | 20221201 | 30d       | 0           | 0                  |

References