Databricks Data Insight Open Course - An Introduction to Delta Lake (Commercial Edition)

By Li Jiexing, Senior Software Engineer of Databrick

1. Design Background of Lakehouse Search Engine

1.1 Data Warehouse and Lakehouse

The Data Management system has evolved from the early Data Warehouse to Lakehouse. Lakehouse can store structured, semi-structured, and unstructured data at the same time. It supports scenarios of streaming analytics, BI, data science, and machine analysis.

1.2 Lakehouse's Challenges in Query Performance

The data warehouse architecture can control the storage and query of data, so the query system can be designed at the same time. The data storage structure of the query system can be adapted to achieve superior query performance.

Under the Lakehouse architecture, data is stored in an open storage structure (such as Parquet format), so more systems can access data easily. However, the open storage format is not necessarily suitable for query operations, and the query system does not have enough statistical data to achieve an efficient query.

How does Lakehouse achieve efficient query performance with an open storage format?

1.3 Solution

Databricks Lakehouse has designed a new search engine with excellent SQL performance in both the Data Lake storage system and file format to solve these problems.

Its SQL performance optimization is achieved through the following techniques:

A. High-Speed Cache: Putting hot data into the high-speed cache

B. Establishment of Auxiliary Data Structures such as the collection of statistical data and indexing

C. Data Layout Optimization: Minimize I/O

D. Dynamic File Pruning: Minimize I/O

2. SQL Performance Optimization Techniques in Lakehouse

2.1 High-Speed Cache

Most workloads focus on accessing some hot data. Data Warehouse often uses SSD and memory as caches to speed up queries of hot data.

Lakehouse can cache the hot data with the same optimized data structure as a data warehouse to improve query performance.

As shown in the figure, using SSD as a cache in Databricks can increase the data reading speed by more than three times and using Delta engine can increase it by more than seven times.

2.2 Establishment of Auxiliary Data Structures

Even if the data is stored in Parquet format, many additional data structures can be established to speed up queries and maintain additional data.

Example One: Data Skippin g in Parquet Files

In Parquet files, the minimum/maximum statistics for each data file in the table are maintained, which helps skip some irrelevant data when performing queries.

As shown in the following figure, the query conditions are year=2020 and uid=24000. With the minimum/maximum statistics, the query information only exists in file3, so the reading of file1 and file2 can be skipped.

Example Two: Indexing Parquet Files

As shown in the following figure, if the query condition is type="DELETE_ACCOUNT" - you can use the index established on the type to go to the corresponding data and avoid reading irrelevant data.

Example Three: Create Bloom Filter on Parquet Files

Bloom Filter can be established for each file. Bloom Filter can quickly determine whether the table file contains the data to be queried. If not, the file can be skipped. It reduces the amount of scanned data and improves query performance.

Bloom Filter Principle:

Bloom Filter uses one or more hash tables to calculate the hash value of the data record in each file. Its starting value is 0, which is 1 when the hash value is mapped to the corresponding position, so the position with the value of 0 can be skipped when querying. It is also possible that all the corresponding positions are 1. At this time, the data may not be in this file (false positive). You can control the false positive rate by controlling the number of hash functions and the size of the Bloom Filter.

2.3 Data Layout

A. Small File Problem

Frequent MERGE, UPDATE, and INSERT operations executed in Delta Lake may generate many small files. These small files will reduce the reading performance of the system and increase the overhead of metadata operations.

Different techniques are used in Lakehouse to reduce small files.

Optimize Delta Table Write

As shown in the following figure, in the open-source Spark, when each executor writes data to a partition, it creates a table file for writing. Then, many small files are generated in one partition.

Databricks optimizes the writing process of Delta tables. A special executor is used to merge writing to the partition by other executors for each partition. This avoids the generation of small files.

Automatic Merging of Small Files

After data is written to the Delta table each time, the system checks the number of table files in the Delta table. If the number of small files (whose size is smaller than 128MB) in the Delta table reaches the threshold, the system merges the small files in the Delta table into a large file.

Manual Merging of Small Files

In addition to automatic merging, Databricks provides the Optimize command to enable users to merge small files manually and optimize the table structure to make the structure of table files more compact.

B. Query Time Issues

The query run time mainly depends on the amount of data accessed. If the Parquet format is used, you can reduce run time by optimizing the data layout within the table.

Sort Table File Data

Sort the data stored in table files and store a certain amount of data in each table file. In the following figure, file1 stores uid=0...1000, and file2 stores uid=1001...2000. As such, irrelevant table files can be skipped as needed during queries to reduce the number of file scans.

Z-Ordering Optimization

In actual queries, some queries need to look at the data of column1 in a certain range, and some queries need to look at the data of column2 in a certain range, or more. At this time, it is not enough to sort column1.

Z-Ordering can store the associated information in the same set of files in multiple dimensions (like col 1-4 in the following figure) to reduce unnecessary file reading.

2.4 Dynamic File Pruning

Dynamic file pruning is abbreviated as DFP. Let's take the following query as an example.

SELECT sum(ss_quantity) 
FROM store_sales 
JOIN item ON ss_item_sk = i_item_sk 
WHERE i_item_id ='AAAAAAAAICAAAAAA'

Query Description: Connect the store_sales table with the item table if item_sk values are equal and item_id is equal to one fixed value.

DFP Disabled

As shown in the preceding figure, a full table scan is performed on store_sales and then the rows in the filtered item table are joined. Although only more than 46,000 data records are returned, over 8.6 billion data records in the store_sales table are scanned.

DFP Enabled

After DFP is enabled, the item table is scanned to query i_item_id='AAAAAAAAICAAAAAA 'data rows in the table items. Then, the i_item_sk values of these data rows are used as the query conditions for the ss_item_sk of the table store_sales and are filtered in the SCAN stage of the table store_sales to skip many irrelevant data. Only more than 6.6 million store_sales entries are scanned, which is nearly 99% less than when DFP is not enabled.

From the results, after enabling DFP, the query achieved a 10-fold performance improvement.

This feature was tested on the TPC-DS (please see the figure below). The test found that after DFP was enabled, the TPC-DS query speed increased 4.5 to 8 times.

4.5 Combination Optimization

The preceding optimization technologies are used to work together, so data reading in Lakehouse is performed in the high-speed cache. An auxiliary data structure is established to reduce I/O for non-cached data reading through data layout optimization, and the Lakehouse engine can provide query performance similar to a data warehouse.

As shown in the following figure, the query performance of Delta Engine is similar to DW1 and exceeds DW2 and DW3.

3. Delta Clones

Delta Clones is an important technology of Lakehouse, which can efficiently copy large data sets and support different needs of testing, sharing, and Machine Learning.

3.1 What Is Clone?

Clone equals to copy. It is a copy of the original data at a given point in time.

It has the same metadata as the source table: the same table structure, constraints, column descriptions, statistics, and partitions.

There are two cloning methods: shallow and deep.

3.2 Deep Clone

Deep clone will completely copy the metadata and data files of the source table and generate a new and independent table.

A. Deep Clone Statements

Run the CREATE TABLE statement in SQL:

# SQL 
CREATE TABLE delta.`path/to/copy` CLONE customers

Run the DeltaTable statement in Python and Scala statements:

# Python and Scala
DeltaTable
.forName("park", "customers")
.clone("path/to/copy")

B. Characteristics of Deep Clone

Compared with the source table, the clone table has a separate history record.
Any changes to the source table during or after the cloning process will not be reflected in the cloning table.

3.3 Shallow Clone

Shallow cloning only copies the metadata of the table to be cloned. The data files of the table itself are not copied.

A. Shallow Clone Statement

Similar to the deep clone statement, except that the SHALLOW CLONE statement is added in SQL. isShallow=true is added in Python and Scala.

# SQL 
CREATE TABLE delta.`path/to/copy` SHALLOW CLONE customers

# Python and Scala 
DeltaTable 
.forName("spark", "customers") 
.clone("path/to/copy", isShallow=true)

B. Characteristics of Shallow Clone

Shallow clone is not self-contained, which means it is not a data source. If the source file data is deleted, the shallow clone data may be unavailable.
Shallow clone does not copy metadata related to flow transactions or COPY INTO.

3.4 Scenarios for Cloning

There are many applicable scenarios for cloning, such as data storage, short-term experiments, data sharing, and disaster recovery. Deep clone is required in other scenarios except for short-term experiments that use shallow clone.

Community