All Products
Search
Document Center

Data Lake Analytics - Deprecated:Background information and preparations

Last Updated:Sep 02, 2022

This topic describes the background information and preparations for cyclically scheduling Presto tasks of Data Lake Analytics (DLA).

Background information

DLA is a big data analytics service that uses serverless architecture. It uses standard SQL statements to cleanse data stored in Object Storage Service (OSS) and Tablestore (OTS). For example, you can use DLA to cleanse historical OSS data on the specified dates. DataWorks uses MaxCompute as a computing and storage engine. DataWorks is a platform that is used to process and analyze large amounts of data in offline mode. DataWorks offers fully hosted services for visualized workflow development, scheduling, and O&M. In DataWorks, tasks can be managed and scheduled by time or dependency. DLA users can use DataWorks to schedule and run DLA tasks, manage task dependencies, and perform O&M.

If you want to cleanse historical OSS data by using DLA on the specified dates, you can use DataWorks to cyclically schedule DLA data cleansing tasks by specifying date values in the fixed SQL statements. To cyclically schedule DLA data cleansing tasks, you must perform the following operations on DataWorks:

  • Deploy an assignment node that generates a date value and provides the date value as the input for a downstream do-while node.

  • Deploy a do-while node that provides an SQL statement or a group of SQL statements to cleanse data. The date value is a variable that is provided by the assignment node.

Task dependencies are required for scheduling tasks. This example demonstrates how to query finished orders from the orders table and write the finished orders to the finished_orders table by cyclically scheduling DLA tasks.

Preparations

Before you use DataWorks to cyclically schedule DLA tasks, you must make the following preparations:

Step 1: Prepare test data in OSS

  1. Activate OSS

  2. Create buckets

  3. Upload objects

    Click orders to download the test data.

Step 2: Create an OSS schema

CREATE SCHEMA dataworks_demo with DBPROPERTIES(
  CATALOG ='oss',
  LOCATION ='oss://bucket-name/dla/'
);

LOCATION: the directory in the OSS bucket. This directory is used to save files and must end with a forward slash (/).

Step 3: Create OSS tables

  • orders table

CREATE EXTERNAL TABLE IF NOT EXISTS orders (
    O_ORDERKEY INT,
    O_CUSTKEY INT,
    O_ORDERSTATUS STRING,
    O_TOTALPRICE DOUBLE,
    O_ORDERDATE DATE,
    O_ORDERPRIORITY STRING,
    O_CLERK STRING,
    O_SHIPPRIORITY INT,
    O_COMMENT STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE 
LOCATION 'oss://bucket-name/dla/';
  • finished_orders table

 CREATE EXTERNAL TABLE IF NOT EXISTS finished_orders (
      O_ORDERKEY INT,
      O_TOTALPRICE DOUBLE
)
  ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
  STORED AS TEXTFILE 
  LOCATION 'oss://bucket-name/dla/finished_orders/';

Step 4: Create a DataWorks workspace