×
Community Blog Data Lake House: Technical Principle Analysis of Hologres, Accelerating Cloud DLF

Data Lake House: Technical Principle Analysis of Hologres, Accelerating Cloud DLF

This article analyzes the technical principles of the Hologres high-performance analytics engine to accelerate the query of Cloud Data Lake Formation (DLF).

By Xuefu Zhang, a Senior Technical Expert in Hologres Research and Development at Alibaba

Hologres is a comprehensive real-time data warehouse developed by Alibaba Cloud. This cloud-native system integrates real-time services and big data analysis scenarios. It is compatible with the PostgreSQL protocol and seamlessly integrates with the big data ecosystem. It can use the same data architecture to support real-time writing, query, and offline federated analysis. It simplifies the architecture of the business, provides real-time decision-making capabilities, and enables big data to exert greater commercial value. With the development of business and technology, Hologres is optimizing its core competitiveness from the birth of Alibaba Group to commercialization on the cloud. Alibaba Cloud wants everyone learn more about Hologres! So, we plan to launch a series of articles revealing the underlying technology principles of Hologres, including high-performance storage engines, high-efficiency query engines, high throughput writing, and high QPS query. Please stay tuned.

This article analyzes the technical principles of the Hologres high-performance analytics engine to accelerate the query of Cloud Data Lake Formation (DLF).

With the increasing acceptance of cloud services, cloud users are more willing to store their collected data in low-cost object storage services, such as OSS and S3. In addition, the cloud-based data management method has been promoted, and metadata has been stored on Alibaba Cloud DLF. The combination of OSS and DLF have created a new way to build data lakes. The scale of this cloud-based data lake collection is growing, and now, there is a demand for the data lake house. The data lake house architecture is based on external storage in open formats and high-performance query engines, making the data architecture flexible, scalable, and pluggable.

Hologres is seamlessly integrated with DLF in data lake house scenarios where data is stored in OSS, managed by DLF, and can be queried without importing and exporting data. Hologres is compatible with all types of files supported by DLF and enables second-level and sub-second-level interactive analysis of petabytes of offline data. All of this is inseparable from the DLF-Access engine of Hologres, which uses DLF-Access to access DLF metadata and OSS data. In addition, the performance of accessing DLF and OSS is improved by the high-performance distributed execution engine HQE.

Hologres DLF/OSS query acceleration provides the following benefits:

  • High Performance: It can accelerate the query of DLF/OSS data within seconds. It can run ad hoc query tasks in OLAP scenarios, which meets most analysis scenarios such as reports.
  • Low Cost: A large amount of data stored in DLF/OSS is directly accessed by Hologres without migrating or importing data. Based on enjoying the low cost of cloud data lakes, it avoids the cost of data migration.
  • Compatibility: You can access various DLF file formats in a high-performance and fully compatible manner, including CSV, Parquet, ORC, Hudi, and Delta. Hudi and Delta are supported in version 1.3. These file formats are compatible with common data lakes.

An Introduction to Data Lake House Architecture

As shown in the figure, the data lake house architecture of Hologres to access DLF/OSS is concise.

1

Data in the cloud data lake is stored in OSS, and metadata is stored in DLF. When Hologres executes a Query to accelerate the query of DLF or OSS data, the Hologres architecture is listed below:

  • Frontend receives SQL requests, parses and transforms the SQL statements, and obtains Meta and other related information from the DLF-Access request through RPC.
  • HQE (Hologres Engine) uses the DLF-Access to obtain the specific data related to OSS/DLF and returns it to the frontend.

DLF-Access is a distributed data access engine that consists of multiple equal processes and can scale out. Any process can perform two roles:

  • Processes Meta-related requests and is mainly responsible for obtaining tables, partition metadata, and file sharding
  • Responsible for specific read or write data requests, involving column cropping, data conversion, data encapsulation, and other functions

DLF Technology Innovation of External Tables

Based on the DLF-Access architecture, it can accelerate the query of DLF/OSS data with high performance based on the following technical innovation advantages:

1) Abstract Distributed External Tables

Combined with the distributed features of DLF/OSS, Hologres abstracts a distributed external table to access data, including traditional or cloud data lakes.

2) Seamless Interoperability with DLF Meta

The Meta of DLF-Access and Data Lake Formation can be seamlessly interoperable. Meta and Data can be obtained in real-time. You can use Import Foreign Schema commands to synchronize the metadata of DLF to the external table of Hologres to create external tables automatically.

3) Reading and Converting Vectorized Data

DLF-Access can make full use of the features of data lake storage files for vectorized data reading and conversion, improving performance.

4) Return Shared Data Formats

The converted data format is the shared Apache Arrow format. Hologres can directly use the returned data to avoid the overheads of serialization and deserialization of additional data.

5) Block Mode IO

DLF-Access and Hologres transmit data in blocks to avoid network latency and load. The default value is 8192 rows of data.

6) Programming Language Isolation

Hologres is a cloud-native OLAP engine developed in C/C++. DLF cloud data lakes are highly compatible with traditional open-source data lakes, while open-source data lakes mostly provide Java-based libraries. The use of an independent DLF-Access engine architecture can isolate different programming languages, which avoids high-cost conversion between native engines and virtual machines while maintaining support for flexible and diverse data format of data lakes.

DLF External Tables Upgrade to HQE

As mentioned before, Hologres uses DLF-Access to perform queries to accelerate DLF external tables. The performance of queries can be good. However, there is a layer of RPC interaction between Hologres. When the amount of data is large, the network will have certain bottlenecks.

Therefore, based on the existing capabilities of Hologres, Hologres V1.3 optimizes the execution engine and supports the Hologres HQE to read DLF tables directly. This improves the read performance by more than 30% compared with earlier versions.

This is mainly due to the following aspects:

1) The RPC interaction and an additional data redistribution are saved, which improves the performance.

2) Reuse the Block Cache of Hologres to avoid accessing storage during multiple queries. This avoids storage I/O and directly accesses data from memory to accelerate queries better.

3) The existing Filter Pushdown capability can be reused to reduce the amount of data that needs to be processed.

4) Pre-read and Cache are implemented at the underlying I/O layer to accelerate the performance during Scan.

Support for Transactional Data Lakes (Hudi and Delta)

Currently, DLF supports table formats used in transaction data lakes, such as Hudi and Delta. Hologres uses DLF-Access to directly read data from these tables without adding any additional operations. It meets the requirements of the real-time data lake house architecture.

Summary

Hologres integrates with DLF/OSS by DLF-Access and makes full use of their advantages. With the goal of extreme performance, Hologres directly accelerates the query of data in cloud data lakes, making it easier and more efficient for users in interactive analytics. It reduces the analysis cost and realizes the data lake house solution.

We will launch a series of articles in the future to reveal the underlying principles of Hologres. Please stay tuned!

Highlights of Previous Articles

0 1 0
Share on

Hologres

10 posts | 2 followers

You may also like

Comments