This topic describes how to run a loop for SQL tasks in Data Lake Analytics (DLA).

  1. Log on to the DataWorks console. Click Workspaces in the left-side navigation pane. On the page that appears, find your workspace and click Data Integration in the Actions column.
  2. On the Welcome to Data Integration page, click Connection in the Data Store section. In the left-side navigation pane, click Data Source. On the page that appears, click New data source in the upper-right corner. In the Add data source dialog box, click Data Lake Analytics(DLA) in Big Data Storage.
  3. In the Add Data Lake Analytics(DLA) data source dialog box, configure the parameters.

    The following table describes the required parameters.

    Data Source Name The name of the data source. We recommend that you specify an informative name that is easy to identify.
    Data source description The description of the data source. This parameter is optional.
    Connection Url The endpoint of DLA. The value is in the format of IP address:Port number. For more information about how to obtain the IP address and port number, see Create an endpoint.
    Database The name of the database in Object Storage Service (OSS) to which DLA is connected. In this topic, set this parameter to dataworks_demo.
    User name The username that is used to log on to DLA.
    Password The password of the username.
  4. Modify an IP address whitelist of DLA in DataWorks.

    DataWorks allows you to add only the data sources whose IP addresses or CIDR blocks are included in an IP address whitelist of DLA to DLA. Therefore, you must add the IP addresses or Classless Inter-Domain Routing (CIDR) blocks to the IP address whitelist of DLA based on the region where DLA is deployed.

    China (Hangzhou),,,,,,,,,,,,
    China (Shanghai),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
    China (Shenzhen),,,,,,,,,,
    China (Hong Kong),,,,,,,,
    Singapore (Singapore),,,,,,,,,,,,,,,,,,,,,,,,
    Australia (Sydney),,,,,,,,,
    China (Beijing),,,,,,,,,,,,,,
    US (Silicon Valley),,,,
    US (Virginia),,,,,
    Malaysia (Kuala Lumpur),,,,,,,
    Germany (Frankfurt),,,,,,,,,,,,,,,,,,
    Japan (Tokyo),,,,,,,,,
    UAE (Dubai),,,,,
    India (Mumbai),,,,,,,
    UK (London),
    Indonesia (Jakarta),,,,,,,
    China North 2 Ali Gov 1 If the CIDR block or cannot be added to the IP address whitelist, add the following IP addresses:,,,,,,,,,,,,, and
  5. After you configure the preceding parameters, find your resource group and click Test connectivity in the Operation column. After the connectivity test succeeds, click Complete.

Create a workflow and nodes

  1. Log on to the DataWorks console. In the left-side navigation pane, click Workspaces. On the page that appears, find your workspace and click Data Analytics in the Actions column.
  2. In the left-side panel, right-click Business Flow and select Create Workflow to create a workflow that is used to run a loop.
  3. Create an assignment node and a do-while node for the workflow that you have created.

Configure the assignment node

  1. Double-click the date set node. On the page that appears, select SHELL for Language, write the required date values as an array, and then save the settings.

    Use only commas (,) to separate date values.

    echo "20190424,20190425,20190426,20190427,20190428,20190429,20190430"
  2. Click the Scheduling Configurations tab to configure an upstream node for the assignment node. You can use the root node of the current workspace as the upstream node. For example, if the workspace name is dla_project, the upstream node is dla_project_root.
  3. Click Save.

Configure the do-while node

  1. Double-click the do-while node. On the page that appears, configure the node.
  2. Create a DLA task.
  3. Click the Scheduling Configurations tab. On the page that appears, specify the node dependency and context. Configure the assignment node as the upstream node. The input of the do-while node is the output of the assignment node.

Configure the DLA_SQL node

INSERT INTO finished_orders
FROM    orders
WHERE   pure_date = ${dag.input[${dag.offset}]}
  • The value of pure_date is read from the assignment node. One value is read from the output array of the assignment node every time.
  • dag.offset is a reserved variable of DataWorks. This variable indicates the loop offset. For example, the offset is 0 when the loop is run for the first time, 1 for the second time, and 2 for the third time. This way, the offset is n-1 when the loop is run for the nth time.
  • dag.input is a variable that indicates the context of the do-while node that is configured to run the loop. If internal nodes of the do-while node need to reference the value of the context, you can use dag.$ctxKey. In this topic, Key in dag.$ctxKey is input. Therefore, you can use {dag.input} to reference the value.
  • The initial input of the dataset of the dag.input[$dag.offset] node is a table. You can use an offset to obtain a row of data from the table. The offset is incremented with the number of times a loop is performed. Therefore, the final output data may be {dag.input[0]} and ${dag.input[1]}. This ensures that all data in the dataset is traversed.

Configure the end node

To control the loop termination, the end node compares dag.loopTimes with dag.input.length. If the value of dag.loopTimes is less than the value of dag.input.length, True is returned to continue the loop. If the value of dag.loopTimes is greater than or equal to the value of dag.input.length, False is returned to terminate the loop. dag.input.length is a variable that indicates the number of rows in the array of the context parameter input. The system automatically delivers this variable based on the context configured for the end node.

if ${dag.loopTimes} < ${dag.input.length}: 
 print True
 print False

On the Scheduling Configurations tab, configure the DLA_SQL node as the upstream node of the end node.

After you configure and save the preceding settings, the loop flow diagram of the do-while node is changed.

Publish tasks

DataWorks DataStudio does not support the do-while node. You must run the do-while node in Operation Center after you submit the node.

On the Date and Data_cleanse_SQL tabs, click Submit to submit tasks. Select all nodes when you submit tasks on the Data_cleanse_SQL tab.

Run tasks

  1. Go to Operation Center of DataWorks and choose Cycle Task Maintenance > Cycle Task. On the page that appears, view the tasks that you have submitted in the task list.
  2. Right-click the date set node and choose Add Data > Current and Downstream Nodes to manually run the two tasks.

    After you run the tasks, you can view the running status of each node.