All Products
Search
Document Center

Background and preparations

Last Updated: Jun 17, 2020

Background

Data Lake Analytics (DLA) is a serverless big data analysis service, which allows you to use standard SQL statements to cleanse the data that is stored in Alibaba Cloud Object Storage Service (OSS) and Table Store. For example, you can use DLA to cleanse historical data in OSS by day. DataWorks uses MaxCompute as the computing and storage engine to support visual workflow development, managed node scheduling, and O&M. This platform provides offline processing and analysis of massive amounts of data, and supports the managed scheduling of all nodes based on time and dependency. DLA users can schedule and run DLA nodes by using the powerful managed node scheduling feature of DataWorks. In addition, DataWorks provides extensive node management functions, including node development, node dependency management, node scheduling, and node O&M.

When using DLA to cleanse historical data in OSS by day, you can use DataWorks to schedule DLA data cleansing tasks cyclically because the SQL statements used to cleanse data are fixed and different dates must be loaded upon each execution. For the above scenarios, we need to do the following in DataWorks:

  • Deploy an assignment node that exports date values and serves as the input of the downstream do-while node.

  • Deploy a do-while node that contains one or more SQL statements used to cleanse data. The date value is a variable and the input value of each loop is provided by the assignment node.

An important feature in node scheduling is the dependency between nodes. To demonstrate this feature, cyclic scheduling in DataWorks is used to query the completed order o_orderstatus = 'F' from the orders table and write the query results into the finished_orders table in this tutorial to describe the detailed procedure.

Prerequisites

Before using DataWorks to schedule DLA tasks cyclically, you need to prepare test data in OSS, create an OSS schema and tables in DLA, and create a DataWorks project.

Step 1: Prepare test data in OSS.

  1. Activate OSS.

  2. Create a bucket.

  3. Upload test data.

    Click orders to download test data.

Step 2: Create an OSS schema.

  1. CREATE SCHEMA dataworks_demo with DBPROPERTIES(
  2. CATALOG = 'oss',
  3. LOCATION = 'oss://bucket-name/dla/'
  4. );

LOCATION: indicates the log file storage directory in the OSS bucket, which must end with a slash (/).

Step 3: Create OSS tables.

  • orders table
  1. CREATE EXTERNAL TABLE IF NOT EXISTS orders (
  2. O_ORDERKEY INT,
  3. O_CUSTKEY INT,
  4. O_ORDERSTATUS STRING,
  5. O_TOTALPRICE DOUBLE,
  6. O_ORDERDATE DATE,
  7. O_ORDERPRIORITY STRING,
  8. O_CLERK STRING,
  9. O_SHIPPRIORITY INT,
  10. O_COMMENT STRING
  11. )
  12. ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
  13. STORED AS TEXTFILE
  14. LOCATION 'oss://bucket-name/dla/';
  • finished_orders table
  1. CREATE EXTERNAL TABLE IF NOT EXISTS finished_orders (
  2. O_ORDERKEY INT,
  3. O_TOTALPRICE DOUBLE
  4. )
  5. ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
  6. STORED AS TEXTFILE
  7. LOCATION 'oss://bucket-name/dla/finished_orders/';

Step 4: Create a DataWorks project.

You have created a project. If not, create one by following the instructions provided in Preparation for DataWorks. In this tutorial, the project is dla_project.

Currently, only the China (Hangzhou) region allows you to schedule DLA tasks by using DataWorks. This feature will be launched in other regions soon.