0.1 traditional data warehouse
- data comes from various systems such as cloud data storage, NoSQL, and relational databases, such as OSS, Table Store, NAS, apsaradb for HBase, RDS, and PolarDB;
- extract and synchronize data to the data warehouse system in real time, minutes, hours, and days;
- perform real-time, scheduled scheduling, and aggregate computing and analysis in data warehouses.
Pay attention to the real-time synchronization of data warehouses and the ability to analyze massive amounts of data in real time. You can focus on Alibaba Cloud's AnalyticDB AnalyticDB(https://www.aliyun.com/product/ads) to create a real-time data warehouse solution on the cloud.
Compare the differences between traditional Data warehouses and Data Lake solutions and application scenarios from various feature dimensions such as Data, Schema, price, performance, Data quality, users, and analysis workload: (note: part of the content is translated in: https://amazonaws-china.com/big-data/what-is-a-data-lake/)
the architecture for building data Lake analysis scenarios on the cloud is clearly layered, and is divided into three layers from bottom to top:
- layer 1: a variety of south-facing multi-mode cloud native data storage and database services. Users can store and store their own data for a series of cloud-native data storage services and cloud database systems. A variety of options include: cost-effective object cloud storage (easy to store files, structured, semi-structured, unstructured raw data, multimedia files),NoSQL(TableStore, cloud HBase, etc.),RDS, PolarDB, and other cloud database services.
Layer 2: The cloud-native Data Lake analysis service layer built for the south-facing multi-mode architecture. This layer emphasizes an important feature of Cloud-Native Cloud Native, that is, Serverless Serverless, which is the basic service for building SaaS on the Cloud. In addition to Data Lake analysis scenarios, there are also more common Serverless PaaS(https://serverless.aliyun.com/) and FaaS (for example, function compute of Alibaba Cloud: https://www.aliyun.com/product/fc). This article focuses on the Data Lake analysis capabilities of Serverless analysis scenarios. In general, we summarize these three aspects:
- first, Elastic. It emphasizes elasticity capability, and can be on-demand, timely, flexible, predictive, and intelligent hybrid load processing capability;
- second, Resilient. Users do not have perceived High Availability capabilities, including transparent rolling upgrades, fast Failover, and cross-zone disaster tolerance;
- third, Federated & Analytical. Multi-model-oriented Federated analysis capabilities, including analysis and docking capabilities for multiple formats and systems, comprehensive analysis functions, and excellent interactive analysis performance and experience, including data and computing parallel processing capabilities, good interface compatibility.
- Layer 3: Data analysis application and visualization application layer. The business logic based on the Data Lake analysis service layer also includes Data analysis tools and products on the cloud DataV( https://data.aliyun.com/visual/datav ),QuickBI( https://data.aliyun.com/product/bi ), or in the future in the cloud market (https://market.aliyun.com/ ) other data analysis products and tools that are online.
- Source layer: supports parallel Federated analysis of data such as OSS, NoSQL(TableStore), and RDS(MySQL, PostgreSQL, and SQL Server);
- Serverless analysis layer: the core computing and analysis layer of the Data Lake Analytics;
- Result layer: Data Lake Analytics the built-in multi-source and multi-channel ETL capabilities to return the analysis results to RDBMS relational databases such as OSS, NoSQL, RDS, and other Data caching systems;
- BI SaaS-layer: the compatibility is continuously enhanced. Currently, you can use Qlik, Tableau, Microstrategy, Alibaba Cloud QuickBI, and other mainstream MySQL client tools to connect to Data Lake Analytics services as MySQL Data sources for analysis.
1.2 Product Features
in a single Alibaba Cloud region, the deployment architecture of the Data Lake Analytics service is as follows:
- Admin Proxy is a control node of the product. It is deployed in the OXS area and is responsible for interacting with Alibaba Cloud sales systems, metering systems, ALB systems, SLA systems, cloudmonitor systems, SLS systems, and RAM systems, complete the control and service functions of all cloud product attributes, provide POP API, and provide service portals for the product console;
- all service roles in DLA run on ECS instances in the VPC, including Resource Manager,Frontnode,Computenode, and Meta Store;
- Resource Manager is the brain that Data Lake Analytics Resource scheduling and is responsible for service process initiation, maintenance, and version upgrade;
- Resource Manager initializes DLA services in each Alibaba Cloud region by calling ROS templates customized for DLA;
- Resource Manager is also responsible for controlling the water level of compute service nodes and ECS instances based on the Resource utilization and busy conditions of cluster services. ECS instances are automatically scaled by calling ESS, effectively control cluster resource utilization;
- Frontnode is the entry point of DLA query and analysis service. After multiple Frontnode are mounted to the EIP of ALB, load balancing is provided for querying access connections;
- the Computenode is a stateless query task compute node;
- the Meta Store service is the metadata center of the DLA cluster. Based on the storage RDS for MySQL in the VPC, it provides unified metadata storage and query services for other service roles.
- The service interfaces provided by ALB support SingleTunnel VPC, classic network IPv4, and classic network IPv6 service portals (Data Lake Analytics is one of the first products on Alibaba Cloud that support IPv6 service portals: https://www.aliyun.com/solution/ip/).
3.1 Examples of target users and application scenarios
0) people who have data analysis requirements for data stored in OSS; 1) cloud developers and analysts who are familiar with SQL; 2) temporary exploration, analysis, (3) users and enterprises seeking to build Data Lake on OSS.
3.2 Typical procedure
0) you can directly upload Data generated by your business, such as files in Log, CSV, and JSON formats, to OSS, and then use Data Lake Analytics to directly point to files or folders for table creation and query, use BI tools to analyze and display business data;
1) users have data in Parquet, ORC, RCFile, Avro and other formats on Legacy systems in other Hadoop ecosystems, and directly copy and upload the data to OSS, then, use Data Lake Analytics to directly point to files or folders for table creation and query, and use BI tools for business Data analysis and presentation;
(2) to provide better query performance and lower storage costs for subsequent data analysis on OSS, you can convert the data format, for example, convert to Parquet or ORC format to improve the cost performance of repeated data analysis.
When the TPC-H data size is 1GB, the proportion of the data size of each data format is shown in the following figure.
3.3 Data backup and query analysis of DBS combined with Data Lake Analytics
- DBS ( https://www.aliyun.com/product/dbs) is a data backup service of apsaradb, which provides full and incremental data backup;
- DBS backup data is stored in OSS, resulting in low storage costs;
- in the past, DBS did not have the ability to analyze backup data. DBS combined with DLA can directly analyze historical backup data on OSS without backup and restoration;
- DBS combines with DLA to greatly improve the user experience of backing up and restoring databases on the cloud and even analyzing historical data.
OSS Select is a simple single-file query and analysis service (https://yq.aliyun.com/articles/593910) developed by the OSS team that is close to OSS storage. Because OSS Select is closer to OSS, in addition to the sharded index optimization and Data deserialization optimization of CSV files OSS Select Data Lake Analytics, the performance of the OSS Select is improved by 50% to 90% in the analysis scenario of CSV files, in extreme scenarios where large file data is filtered, the performance is improved dozens of times. In the workload test of TPC-H SF 10 (original 10GB data), DLA enables the query calculation push-down function of OSS Select and compares the queries that are not enabled with the following figure.
- access control of cloud accounts and sub-accounts in the console;
- DLA user-created service accounts use KMS envelope encryption technology to ensure the security of user service accounts;
- use the ACL mechanism of database experience to authorize and control object access;
- all service roles in the sales area are deployed in VPCs to ensure network isolation between computing and service instances;
- cross-cloud services are accessed through RAM role authorization and STS.
in the overall upstream and downstream Data flow ecosystem of Data Lake Analytics, OSS, Table Store, AnalyticDB, and RDS(MySQL, PostgreSQL, and SQL Server) are mainly supported.
A list of Source and destination data sources and sinks:
Data Lake Analytics(https://www.aliyun.com/product/datalakeanalytics) product beta QR code:
technical introduction, tutorials, and usage
- Data Lake Analytics-new changes in the Data analysis era
- to empower Data, Alibaba Cloud launched Serverless Data analysis engine-Data Lake Analytics
- Data Lake practices based on DataLakeAnalytics
- tutorial: use Data Lake Analytics and OSS to analyze TPC-H datasets in CSV format
- tutorial: Data Lake Analytics + OSS Data file format processing
- instructions for Data Lake Analytics in OSS LOCATION
- tutorial: how to use Data Lake Analytics to create a partitioned table
- use Data Lake Analytics to cleanse Data from OSS to AnalyticDB
- OLAP on TableStore: Data Lake Analytics-based Serverless SQL big Data analysis
- how to use Data Lake Analytics to analyze Table Store Data on Alibaba Cloud
- tutorial: use Data Lake Analytics to read/write RDS Data