Delta Lake is a data lake solution developed by Databricks. Delta is a core component of this solution. Delta provides features that can be used to write data to data lakes, manage data, query data, and read data from data lakes. You can build easy-to-use and secure data lakes by using Delta and other upstream and downstream tools.

Background information

The traditional data lake solution allows you to use a big data storage engine to build a data lake and store various types of data in the data lake. The big data storage engine can be an object storage service such as Alibaba Cloud Object Storage Service (OSS) or an on-premises storage such as Hadoop Distributed File System (HDFS). You can connect the data lake to an analytics engine such as Spark and Presto to analyze the data stored in the data lake. This solution has the following disadvantages:
  • If data ingestion fails, it may lead to dirty data and job failures. Dirty data cleanup and job recovery are troublesome.
  • Data quality controls are unavailable due to the lack of extract, transform, load (ETL) operations.
  • No transactions can be used to isolate reads and writes. Streaming reads and writes cannot be isolated from batch reads and writes.
Delta solves these issues.
  • Delta provides a data management layer on top of the big data storage layer. This data management layer functions similarly to the metadata management module for a database. Delta metadata is stored with data and is visible to users. For more information, see Figure 1.
  • Based on metadata management, Delta provides atomicity, consistency, isolation, durability (ACID) transactions to prevent malformed data ingestion and isolate reads and writes during data ingestion. This also ensures that dirty data is not generated.
  • Field information is stored in metadata. Delta verifies data during data ingestion to ensure data quality.
  • The transaction feature isolates streaming reads and writes from batch reads and writes.
Note ACID refers to the four basic elements for execution of database transactions. The four basic elements are atomicity, consistency, isolation, and durability.
Figure 1. Data warehouses and data lakes
delta_data
The following table compares a data warehouse, a traditional data lake, and Delta Lake.
Item Data warehouse Data lake Delta lake
Architecture Integrated or separated computing and storage Separated computing and storage Separated computing and storage
Storage management Stringent and non-general format Native format General and lightweight format
Scenario Reports and analytics Reports, analytics, and data science Reports, analytics, and data science
Flexibility Low High Medium
Data quality and reliability High Low Medium
Transaction Supported Not supported Supported
Performance High Low Medium
Scalability Depending on specific scenarios High High
Typical users Managerial staff Managerial staff and data scientists Managerial staff and data scientists
Costs High Low Low

Scenarios

Delta is an ideal solution to manage cloud-based data lakes. You can use Delta in the following scenarios:
  • Real-time query: Upstream data is ingested to Delta in real time to support real-time query. ACID transactions isolate data writes from queries to avoid dirty reads.
  • Data deletes and updates due to General Data Protection Regulation (GDPR) requests: Traditional data lakes do not support data deletes or updates. If you want to delete or update data, you must manually delete the original data, and then write the updated data into the storage. Delta supports data deletes and updates.
  • Real-time data synchronization based on change data capture (CDC): You can run a streaming job to merge upstream data updates to Delta Lake in real time by using the merge feature provided by Delta.
  • Data quality control: Delta provides the schema validation feature that allows you to remove abnormal data during data ingestion. You can also use this feature to process abnormal data.
  • Data evolution: You may need to change your data schema. Delta provides an API for you to change your data schema.
  • Real-time machine learning: In machine learning scenarios, you may require a significant amount of time to clean, convert, and characterize data. You also need to separately process historical data and real-time data. Delta simplifies these workflows. Delta provides a complete and reliable real-time stream to clean, convert, and characterize data. You do not need to separately process historical data and real-time data.

EMR Delta

EMR Delta has higher performance than open source Delta. For example, EMR Delta supports more SQL statements and provides the optimize feature. The following table lists the basic features of Delta Lake and compares EMR Delta with open source Delta of Delta Lake 0.5.0.

Feature EMR Delta Open source Delta
SQL
  • ALTER
  • CONVERT
  • CREATE
  • CTAS
  • DELETE
  • DESC HISTORY
  • INSERT
  • MERGE
  • OPTIMIZE
  • UPDATE
  • VACUUM
  • CREATE
    Note CREATE TABLE syntax: CREATE TABLE <tbl> USING delta LOCATION <delta_table_path>
    • You can create a table only in an existing Delta directory.
    • Do not specify a schema when you create a table.
  • CONVERT
  • DESC HISTORY
  • VACUUM
API
  • batch read/write
  • streaming read/write
  • optimize
  • delete
  • update
  • merge
  • convert
  • history
  • vacuum
  • batch read/write
  • streaming read/write
  • delete
  • update
  • merge
  • convert
  • history
  • vacuum
Hive connector Supported Supported
Presto connector Supported Supported
Parquet Supported Supported
ORC Not supported Not supported
Text format Not supported Not supported
Data skipping Not supported Not supported
Z-order Supported Not supported
Native DeltaLog Supported Not supported