Next-generation enterprise-level cloud Data analysis service: Data Lake Analytics-Alibaba Cloud Developer Community

0.1 traditional data warehouse

  • data comes from various systems such as cloud data storage, NoSQL, and relational databases, such as OSS, Table Store, NAS, apsaradb for HBase, RDS, and PolarDB;
  • extract and synchronize data to the data warehouse system in real time, minutes, hours, and days;
  • perform real-time, scheduled scheduling, and aggregate computing and analysis in data warehouses.

Pay attention to the real-time synchronization of data warehouses and the ability to analyze massive amounts of data in real time. You can focus on Alibaba Cloud's AnalyticDB AnalyticDB(https://www.aliyun.com/product/ads) to create a real-time data warehouse solution on the cloud.

Compare the differences between traditional Data warehouses and Data Lake solutions and application scenarios from various feature dimensions such as Data, Schema, price, performance, Data quality, users, and analysis workload: (note: part of the content is translated in: https://amazonaws-china.com/big-data/what-is-a-data-lake/)

0.2 Cloud Data Lake Analysis Layering

the architecture for building data Lake analysis scenarios on the cloud is clearly layered, and is divided into three layers from bottom to top:

On the cloud, users can easily and directly interact with three-tier cloud products and services to build their own cloud data Lake analysis scenarios and solutions.

1.1 Data Lake Analytics Ecological Layering

based on Cloud Data Lake analysis layer in the previous section, Data Lake Analytics ecosystem layer is divided into Result layer, Serverless analysis layer, Data Source layer, and BI SaaS layer.

  • Source layer: supports parallel Federated analysis of data such as OSS, NoSQL(TableStore), and RDS(MySQL, PostgreSQL, and SQL Server);
  • Serverless analysis layer: the core computing and analysis layer of the Data Lake Analytics;
  • Result layer: Data Lake Analytics the built-in multi-source and multi-channel ETL capabilities to return the analysis results to RDBMS relational databases such as OSS, NoSQL, RDS, and other Data caching systems;
  • BI SaaS-layer: the compatibility is continuously enhanced. Currently, you can use Qlik, Tableau, Microstrategy, Alibaba Cloud QuickBI, and other mainstream MySQL client tools to connect to Data Lake Analytics services as MySQL Data sources for analysis.

1.2 Product Features

in a single Alibaba Cloud region, the deployment architecture of the Data Lake Analytics service is as follows:

3.1 Examples of target users and application scenarios

0) people who have data analysis requirements for data stored in OSS; 1) cloud developers and analysts who are familiar with SQL; 2) temporary exploration, analysis, (3) users and enterprises seeking to build Data Lake on OSS.

3.2 Typical procedure

0) you can directly upload Data generated by your business, such as files in Log, CSV, and JSON formats, to OSS, and then use Data Lake Analytics to directly point to files or folders for table creation and query, use BI tools to analyze and display business data;

1) users have data in Parquet, ORC, RCFile, Avro and other formats on Legacy systems in other Hadoop ecosystems, and directly copy and upload the data to OSS, then, use Data Lake Analytics to directly point to files or folders for table creation and query, and use BI tools for business Data analysis and presentation;

(2) to provide better query performance and lower storage costs for subsequent data analysis on OSS, you can convert the data format, for example, convert to Parquet or ORC format to improve the cost performance of repeated data analysis.

When the TPC-H data size is 1GB, the proportion of the data size of each data format is shown in the following figure.

3.3 Data backup and query analysis of DBS combined with Data Lake Analytics

  • DBS ( https://www.aliyun.com/product/dbs) is a data backup service of apsaradb, which provides full and incremental data backup;
  • DBS backup data is stored in OSS, resulting in low storage costs;
  • in the past, DBS did not have the ability to analyze backup data. DBS combined with DLA can directly analyze historical backup data on OSS without backup and restoration;
  • DBS combines with DLA to greatly improve the user experience of backing up and restoring databases on the cloud and even analyzing historical data.

OSS Select is a simple single-file query and analysis service (https://yq.aliyun.com/articles/593910) developed by the OSS team that is close to OSS storage. Because OSS Select is closer to OSS, in addition to the sharded index optimization and Data deserialization optimization of CSV files OSS Select Data Lake Analytics, the performance of the OSS Select is improved by 50% to 90% in the analysis scenario of CSV files, in extreme scenarios where large file data is filtered, the performance is improved dozens of times. In the workload test of TPC-H SF 10 (original 10GB data), DLA enables the query calculation push-down function of OSS Select and compares the queries that are not enabled with the following figure.

security is always the standard for cloud Data access and Operation. Data Lake Analytics follow the cloud security practices at all levels of cloud products:

in the overall upstream and downstream Data flow ecosystem of Data Lake Analytics, OSS, Table Store, AnalyticDB, and RDS(MySQL, PostgreSQL, and SQL Server) are mainly supported.

A list of Source and destination data sources and sinks:

Data Lake Analytics(https://www.aliyun.com/product/datalakeanalytics) product beta QR code:

technical introduction, tutorials, and usage

scenarios and cases

Selected, One-Stop Store for Enterprise Applications
Support various scenarios to meet companies' needs at different stages of development

Start Building Today with a Free Trial to 50+ Products

Learn and experience the power of Alibaba Cloud.

Sign Up Now