Data Lake Analytics (DLA) is an end-to-end, on-demand, serverless data lake analytics and computing service. It offers a cost-effective platform to run Extract-Transform-Load (ETL), machine learning, streaming, and interactive analytics workloads. DLA can be used with various data sources, such as Object Storage Service (OSS) and databases.

Features

DLA provides an end-to-end data lake analytics and computing solution for data stored on OSS. It offers the following features to help you tackle the challenges you face:

  • End-to-end data lake solution that enables efficient data access, ETL, machine learning, and interactive analytics. DLA provides a platform for you to build a data lake and offers the serverless Presto-compatible SQL engine and the serverless Spark engine.
  • Secure data processing. DLA is best suited to prevent data misuse because all tables in databases and stored data of DLA have their respective security solutions.
  • Cost-effective data processing. DLA is the preferred choice for you because it is a serverless cloud native data processing solution.
  • Smooth evolution solution. DLA ensures smooth evolution from a Hadoop system to a data lake solution.

Support for both serverless SQL and serverless Spark engines

The DLA serverless SQL engine is developed based on the open source Apache Presto engine. All the computing work is completed by the memory. It delivers a high-performance and interactive analytics experience so that SQL queries can be responded within seconds. The DLA serverless Spark engine is developed based on the open source Apache Spark engine and is compatible with all Apache Spark APIs.

We recommend that you use the DLA serverless Spark engine in the following scenarios:

  • You need to customize code or SQL statements cannot meet your business requirements.
  • Large volumes of data needs to be cleansed, for example, one terabyte to one petabyte of data stored on OSS is cleansed once in a day.
  • A wide range of algorithms need to be supported. The DLA serverless Spark engine supports a complete library of Spark algorithms.
  • Streaming is required.

Concepts

  • DLA is a region-based system. Account systems and metadata systems of different regions are completely isolated.
  • DLA uses the pay-as-you-go (post-paid) billing method, which charges fees based on the number of bytes scanned or based on the number of compute units (CUs) used. DLA supports only the serverless SQL engine if the billing method is based on the number of bytes scanned. DLA supports both the serverless SQL and serverless Spark engines if the billing method is based on the number of CUs used.
    • Based on the number of bytes scanned: After you run an SQL query, you are charged based on the number of bytes scanned.
    • Based on the number of CUs used: You are charged based on the number of CUs used. One CU equals 1 CPU core and 4 GB of memory.
  • Virtual cluster (VC) is the abstraction of underlying resources. You can configure network connections and some basic information for a VC.
    • A VC must be established when the billing method is based on the number of CUs used.
    • If the billing method is based on the number of bytes scanned, you are not charged for the establishment of VCs. Instead, you are charged only for the number of bytes scanned. This ensures that your queries can be immediately responded even when you have not purchased resources.
  • Accounts: DLA provides two types of accounts: DLA accounts and RAM users. You can bind DLA accounts with RAM users.
  • Metadata: DLA supports various metadata, such as databases, tables, columns, and views. Each database corresponds to only one data source. Metadata can be securely accessed by using the serverless SQL and Spark engines.
  • Permissions: Fine-grained permissions at the database or table level are supported.
  • Syntax standards:
    • DDL: Refer to the Hive standards.
    • DCL: Refer to the MySQL database standards.
    • DML: The serverless SQL engine is compatible with the Presto standards, whereas the serverless Spark engine is compatible with the Spark standards.