DataWorks is a platform that can be used to process and analyze large amounts of data in offline mode. It uses MaxCompute as a computing and storage engine to offer fully hosted services for visual workflow development, scheduling, and O&M. In DataWorks, tasks can be hosted and scheduled by time or dependency. This topic describes how to use DataWorks to schedule DLA tasks.

Purpose

Task scheduling requires dependencies between tasks. In this topic, two DLA tasks are created in DataWorks to demonstrate the dependencies between tables and tasks, as shown in the following figure.

  • Task 1

    DataWorks executes the o_orderstatus = 'F' statement to filter the finished orders in the orders table and then writes the finished orders into the finished_orders table.

  • Task 2

    DataWorks executes the o_totalprice > 10000 statement to filter the orders whose total price is greater than USD 10,000 from the finished_orders table, and then writes these orders into the high_value_finished_orders table.

The orders table is stored in the oss://dlaossfile1/dla/ path, of which dla indicates the object and dlaossfile1 indicates the bucket. You can click the orders.txt file in the preceding path to download source data. The empty files finished_orders.txt and high_value_finished_orders.txt are stored in the oss://dlaossfile1/dla/finished_orders/ path.

Prerequisites

  1. The DLA, DataWorks, and OSS services are activated and deployed in the same region. In this topic, the three services are all deployed in the China (Hangzhou) region.
  2. A workspace is created. For more information, see Create a workspace. In this topic, the liujing_dataworks_test workspace is created.
  3. An endpoint is created in DLA. For more information, see Set an endpoint.

Procedure

  1. Create an OSS schema in DLA by using the following statement:
    CREATE SCHEMA dataworks_demo with DBPROPERTIES(
      CATALOG = 'oss',  
    LOCATION = 'oss://dlaossfile1/dla/');
  2. Create the orders, finished_orders, and high_value_finished_orders table for the OSS file in DLA by using the following statements:
    • Statement to create the orders table
      CREATE EXTERNAL TABLE IF NOT EXISTS orders (
          O_ORDERKEY INT,
           O_CUSTKEY INT,
           O_ORDERSTATUS STRING,
           O_TOTALPRICE DOUBLE,
           O_ORDERDATE DATE,
           O_ORDERPRIORITY STRING,
           O_CLERK STRING,
           O_SHIPPRIORITY INT,
           O_COMMENT STRING
      )
       ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
       STORED AS TEXTFILE LOCATION 'oss://dlaossfile1/dla/';
    • Statement to create the finished_orders table
       CREATE EXTERNAL TABLE IF NOT EXISTS finished_orders (
            O_ORDERKEY INT,
              O_TOTALPRICE DOUBLE
        )
         ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
         STORED AS TEXTFILE
         LOCATION 'oss://dlaossfile1/dla/finished_orders/';
    • Statement to create the high_value_finished_orders table
        CREATE EXTERNAL TABLE IF NOT EXISTS high_value_finished_orders (
            O_ORDERKEY INT,
             O_TOTALPRICE DOUBLE
        )
         ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'   
      STORED AS TEXTFILE
        LOCATION 'oss://dlaossfile1/dla/finished_orders/';
  3. Add a DLA data source in DataWorks.
    1. Log on to the DataWorks console. In the left-side navigation pane, click Workspaces. On the page that appears, find your workspace and click Data Integration in the Actions column.
    2. On the Welcome to Data Integration page, click the Connection icon. In the left-side navigation pane, click Data Source. On the page that appears, click New data source in the upper-right corner. In the Add data source dialog box, click the Data Lake Analytics(DLA) icon in Big Data Storage.

    3. In the Add Data Lake Analytics(DLA) data source dialog box, specify the parameters shown in the following figure.

      The following table describes the parameters that you must specify.

      Data Source Name The name of the data source. We recommend that you specify an informative name for easy management.
      Description The description of the data source. This parameter is optional.
      Connection Url The DLA endpoint in the Address:Port format. For more information about how to obtain Address:Port, see t1916498.html#topic-2566550.
      Database The name of the Object Storage Service (OSS) database that you create in DLA. In this topic, set this parameter to dataworks_demo.
      User name The username that is used to access DLA.
      Password The password that is used to access DLA.
    4. Modify the IP address whitelist for DLA in DataWorks.

      DataWorks allows you to add only the DLA data sources whose IP addresses are specified in an IP address whitelist of DataWorks. Therefore, you must add the IP addresses or Classless Inter-Domain Routing (CIDR) blocks of your DLA region to the IP address whitelist.

      China (Hangzhou) 100.64.0.0/8,11.193.102.0/24,11.193.215.0/24,11.194.110.0/24,11.194.73.0/24,118.31.157.0/24,47.97.53.0/24,11.196.23.0/24,47.99.12.0/24,47.99.13.0/24,114.55.197.0/24,11.197.246.0/24,11.197.247.0/24
      China (Shanghai) 11.193.109.0/24,11.193.252.0/24,47.101.107.0/24,47.100.129.0/24,106.15.14.0/24,10.117.28.203,10.117.39.238,10.143.32.0/24,10.152.69.0/24,10.153.136.0/24,10.27.63.15,10.27.63.38,10.27.63.41,10.27.63.60,10.46.64.81,10.46.67.156,11.192.97.0/24,11.192.98.0/24,11.193.102.0/24,11.218.89.0/24,11.218.96.0/24,11.219.217.0/24,11.219.218.0/24,11.219.219.0/24,11.219.233.0/24,11.219.234.0/24,118.178.142.154,118.178.56.228,118.178.59.233,118.178.84.74,120.27.160.26,120.27.160.81,121.43.110.160,121.43.112.137,100.64.0.0/8
      China (Shenzhen) 100.106.46.0/24,100.106.49.0/24,10.152.27.0/24,10.152.28.0/24,11.192.91.0/24,11.192.96.0/24,11.193.103.0/24,100.64.0.0/8,120.76.104.0/24,120.76.91.0/24,120.78.45.0/24
      China (Hong Kong) 10.152.162.0/24,11.192.196.0/24,11.193.11.0/24,100.64.0.0/8,11.192.196.0/24,47.89.61.0/24,47.91.171.0/24,11.193.118.0/24,47.75.228.0/24
      Singapore (Singapore) 100.106.10.0/24,100.106.35.0/24,10.151.234.0/24,10.151.238.0/24,10.152.248.0/24,11.192.153.0/24,11.192.40.0/24,11.193.8.0/24,100.64.0.0/8,100.106.10.0/24,100.106.35.0/24,10.151.234.0/24,10.151.238.0/24,10.152.248.0/24,11.192.40.0/24,47.88.147.0/24,47.88.235.0/24,11.193.162.0/24,11.193.163.0/24,11.193.220.0/24,11.193.158.0/24,47.74.162.0/24,47.74.203.0/24,47.74.161.0/24,11.197.188.0/24
      Australia (Sydney) 11.192.100.0/24,11.192.134.0/24,11.192.135.0/24,11.192.184.0/24,11.192.99.0/24,100.64.0.0/8,47.91.49.0/24,47.91.50.0/24,11.193.165.0/24,47.91.60.0/24
      China (Beijing) 100.106.48.0/24,10.152.167.0/24,10.152.168.0/24,11.193.50.0/24,11.193.75.0/24,11.193.82.0/24,11.193.99.0/24,100.64.0.0/8,47.93.110.0/24,47.94.185.0/24,47.95.63.0/24,11.197.231.0/24,11.195.172.0/24,47.94.49.0/24,182.92.144.0/24
      US (Silicon Valley) 10.152.160.0/24,100.64.0.0/8,47.89.224.0/24,11.193.216.0/24,47.88.108.0/24
      US (Virginia) 11.193.203.0/24,11.194.68.0/24,11.194.69.0/24,100.64.0.0/8,47.252.55.0/24,47.252.88.0/24
      Malaysia (Kuala Lumpur) 11.193.188.0/24,11.221.205.0/24,11.221.206.0/24,11.221.207.0/24,100.64.0.0/8,11.214.81.0/24,47.254.212.0/24,11.193.189.0/24
      Germany (Frankfurt) 11.192.116.0/24,11.192.168.0/24,11.192.169.0/24,11.192.170.0/24,11.193.106.0/24,100.64.0.0/8,11.192.116.14,11.192.116.142,11.192.116.160,11.192.116.75,11.192.170.27,47.91.82.22,47.91.83.74,47.91.83.93,47.91.84.11,47.91.84.110,47.91.84.82,11.193.167.0/24,47.254.138.0/24
      Japan (Tokyo) 100.105.55.0/24,11.192.147.0/24,11.192.148.0/24,11.192.149.0/24,100.64.0.0/8,47.91.12.0/24,47.91.13.0/24,47.91.9.0/24,11.199.250.0/24,47.91.27.0/24
      UAE (Dubai) 11.192.107.0/24,11.192.127.0/24,11.192.88.0/24,11.193.246.0/24,47.91.116.0/24,100.64.0.0/8
      India (Mumbai) 11.194.10.0/24,11.246.70.0/24,11.246.71.0/24,11.246.73.0/24,11.246.74.0/24,100.64.0.0/8,149.129.164.0/24,11.194.11.0/24
      UK (London) 11.199.93.0/24,100.64.0.0/8
      Indonesia (Jakarta) 11.194.49.0/24,11.200.93.0/24,11.200.95.0/24,11.200.97.0/24,100.64.0.0/8,149.129.228.0/24,10.143.32.0/24,11.194.50.0/24
      China North 2 Ali Gov If the CIDR block 11.194.116.0/24,100.64.0.0/8 cannot be added, add the following IP addresses: 11.194.116.160, 11.194.116.161, 11.194.116.162, 11.194.116.163, 11.194.116.164, 11.194.116.165, 11.194.116.167, 11.194.116.169, 11.194.116.170, 11.194.116.171, 11.194.116.172, 11.194.116.173, 11.194.116.174, and 11.194.116.175.
    5. After you specify the preceding parameters, click Test connectivity. After the connectivity test is passed, click Complete.
  4. Create a DLA scheduling task in DataWorks.
    1. Log on to the DataWorks console. In the left-side navigation pane, click Workspaces. On the page that appears, find your workspace and click Data Analytics in the Actions column.
    2. In the left-side pane, right-click Business process and select New business process. In this topic, the dla_test_1 business process is created.
    3. Create a DLA task. In this topic, the finished_orders task is created.

    4. In the left-side pane, unfold Business process and click the finished_orders task. On the finished_orders tab, unfold Customize and click Data Lake Analytics.
      Note You can repeat this step to create multiple tasks. In this topic, the finished_orders and high_value_finished_orders tasks are created.
  5. Run a DLA scheduling task in DataWorks.

    • Task 1: DataWorks uses o_orderstatus = 'F' to filter the finished orders in the orders table and then writes the finished orders into the finished_orders table.
        insert into finished_orders
        select O_ORDERKEY, O_TOTALPRICE
        from orders
        where O_ORDERSTATUS = 'F';
    • Task 2: DataWorks uses o_totalprice > 10000 to filter the orders whose total price is greater than USD 10,000 from the finished_orders table and writes these orders into the high_value_finished_orders table.
         insert into high_value_finished_orders
         select * from finished_orders
         where O_TOTALPRICE > 10000;

What to do next

  • Task configuration

    In DataWorks, you can configure tasks to be triggered by time or dependency. You can also configure multiple tasks to be executed at a specified time based on specified dependencies.

    To run the finished_orders task at 02:00 every day, specify the parameter shown in the following figure.

    To run the high_value_finished_orders task after the finished_orders task is running normally, specify the time parameter shown in the following figure.

  • Task publish

    After you configure tasks, you can publish and maintain these tasks. For more information, see Deploy a node.