All Products
Search
Document Center

Use DataWorks to shcedule tasks in DLA

Last Updated: Oct 18, 2019

Background

DataWorks uses MaxCompute as the computing and storage engine to support visual workflow development, managed task scheduling, and O&M. This platform provides offline processing and analysis of massive amounts of data, and supports the managed scheduling of all tasks based on time and dependency. Data Lake Analytics (DLA) users can run DLA tasks by using the powerful managed task scheduling feature of DataWorks. In addition, DataWorks provides extensive task management functions, including task development, task dependency management, task scheduling, and task O&M.

Currently, only the region China (Hangzhou) allows you to schedule DLA tasks by using DataWorks. This feature will be launched in other regions soon.

Purpose

An important feature of task scheduling is the dependencies between tasks. This topic demonstrates this feature by creating two DLA tasks in DataWorks. The following figure shows the dependencies between tables and tasks.

1

  • Task 1

    Run o_orderstatus = 'F' to query the finished orders in the orders table, and then insert the query result into the finished_orders table.

  • Task 2

    Run o_totalprice > 10000 to query the orders with a total price of over 10,000 in the finished_orders table, and then insert the query result into the high_value_finished_orders table.

As the data source, the orders table is stored in the dla object of the OSS bucket dlaossfile1. That is, the storage path is oss://dlaossfile1/dla/. Click orders.txt to download the source data. The empty files finished_orders.txt and high_value_finished_orders.txt are stored in oss://dlaossfile1/dla/finished_orders/.

Follow these steps to schedule DLA tasks by using DataWorks:

  1. Create an OSS connection in DLA.

  2. Create external tables orders, finished_orders, and high_value_finished_orders in DLA.

  3. Add the DLA data source in DataWorks.

  4. Create a DLA scheduling task in DataWorks.

  5. Run the DLA scheduling task in DataWorks.

Prerequisites

  1. DLA, DataWorks, and OSS have been activated and are located in the same region. In this example, the three services are in China (Shanghai).

  2. You have created a project. If not, create one by following the instructions provided in Preparations for DataWorks.

    The project used in this example is liujing_dataworks_test.

  3. You have activated the service account for DLA in the current region. If not, activate it by following the instructions provided in Get started with DLA.

  4. You have created an endpoint in DLA. If not, create one by following the instructions provided in Create an endpoint.

Procedure

Step 1: Create an OSS connection in DLA

In this example, execute the following statement to create the OSS connection:

  1. CREATE SCHEMA dataworks_demo with DBPROPERTIES(
  2. CATALOG = 'oss',
  3. LOCATION = 'oss://dlaossfile1/dla/'
  4. );

Step 2: Create external tables that point to the OSS file in DLA

In this example, execute the following statements to create the external tables that point to the OSS file:

  • Table orders
  1. CREATE EXTERNAL TABLE IF NOT EXISTS orders (
  2. O_ORDERKEY INT,
  3. O_CUSTKEY INT,
  4. O_ORDERSTATUS STRING,
  5. O_TOTALPRICE DOUBLE,
  6. O_ORDERDATE DATE,
  7. O_ORDERPRIORITY STRING,
  8. O_CLERK STRING,
  9. O_SHIPPRIORITY INT,
  10. O_COMMENT STRING
  11. )
  12. ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
  13. STORED AS TEXTFILE
  14. LOCATION 'oss://dlaossfile1/dla/';
  • Table finished_orders
  1. CREATE EXTERNAL TABLE IF NOT EXISTS finished_orders (
  2. O_ORDERKEY INT,
  3. O_TOTALPRICE DOUBLE
  4. )
  5. ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
  6. STORED AS TEXTFILE
  7. LOCATION 'oss://dlaossfile1/dla/finished_orders/';
  • Table high_value_finished_orders

    1. CREATE EXTERNAL TABLE IF NOT EXISTS high_value_finished_orders (
    2. O_ORDERKEY INT,
    3. O_TOTALPRICE DOUBLE
    4. )
    5. ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
    6. STORED AS TEXTFILE
    7. LOCATION 'oss://dlaossfile1/dla/finished_orders/';

Step 3: Add the DLA data source in DataWorks

  1. Log on to the DataWorks console, and click Data Integration of the corresponding project.

  2. On the page that appears, click Add Data Source and select Data Lake Analytics(DLA) as the data source.

    1

    1

  3. In the Add Data Source Data Lake Analytics(DLA) dialog box, configure the expected parameters.

    1

    The following table describes the parameters.

    Parameter Description
    Connection Name An informative name that describes the data source.
    Description Optional. The description of the data source.
    URL The DLA endpoint, in the format of Address:Port. You can view endpoint information to obtain Address:Port.
    Databases The name of the OSS connection you create in DLA. In this example, set it to dataworks_demo.
    Username The name of the service account for DLA access.
    Password The password of the service account for DLA access.
  4. After you configure the preceding parameters, click Test Connection. After the test succeeds, click OK.

Step 4: Create a DLA scheduling task in DataWorks

  1. Log on to the DataWorks console, and click Data Analytics of the corresponding project.

  2. Right-click Business Flow to create a business process named dla_test_1.

  3. Create a DLA task named finished_orders.

    1

  4. Click the finished_orders task, and select the DLA data source in the Select a data source field.

You can repeat the preceding steps to create multiple tasks. In this example, two tasks, finished_orders and high_value_finished_orders, are created.

Step 5: Run the DLA scheduling tasks in DataWorks

1

  • Task 1: Run o_orderstatus = 'F' to query the finished orders in the orders table, and then insert the query result into the finished_orders table.

    1. insert into finished_orders
    2. select O_ORDERKEY, O_TOTALPRICE
    3. from orders
    4. where O_ORDERSTATUS = 'F';
  • Task 2: Run o_totalprice > 10000 to query the orders with a total price of over 10,000 in the finished_orders table, and then insert the query result into the high_value_finished_orders table.

    1. insert into high_value_finished_orders
    2. select * from finished_orders
    3. where O_TOTALPRICE > 10000;

Subsequent operations

Configure tasks

DataWorks can trigger tasks by time or dependency, and allows multiple tasks to run at the specified time based on their specified dependency. For more information, see Scheduling configuration.

For example, to run the finished_orders task at 2:00 every morning, perform the following configuration:

1

To run the high_value_finished_orders task after the finished_orders task is completed, perform the following configuration:

1

Publish tasks

You can publish and maintain the tasks that you configured in DataWorks. For more information, see Task management.