Apache Hudi is a data lake storage format that supports updating, deleting, and consuming changed data on Hadoop file systems.
Hudi table types
Hudi supports the following two table types:
-
Copy On Write
Stores data in the Parquet file format. Updates to a Copy On Write table require a rewrite operation.
-
Merge On Read
Stores base data in columnar format (Parquet) and incremental data in row-based format (Avro). A compaction operation periodically merges incremental data into columnar files based on a configurable policy.
The following table compares the two table types.
|
Trade-off |
Copy On Write |
Merge On Read |
|
Data Latency |
High |
Low |
|
Query Latency |
Low |
High |
|
Update cost (I/O) |
High (rewrites the entire Parquet file) |
Low (appends to delta logs) |
|
Parquet File Size |
Small (high update I/O overhead) |
Large (low update overhead) |
|
Write Amplification |
High (high write amplification) |
Low (depends on the merge policy) |
Hudi query types
Hudi supports the following three query types:
-
Snapshot Queries
Queries data from the latest commit snapshot. For a Merge On Read table, a snapshot query merges the base columnar data with the real-time log data. For a Copy On Write table, a snapshot query returns the latest version of the Parquet data.
Both Copy On Write and Merge On Read tables support this query type.
-
Incremental Queries
Queries data written after a specific commit.
Both Copy On Write and Merge On Read tables support this query type.
-
Read Optimized Queries
Returns the most recent compacted data. For Merge On Read tables, this query type reduces latency by skipping the online merge of log data, at the expense of data freshness.
The following table compares the query types.
|
Trade-off |
Snapshot Queries |
Read Optimized Queries |
|
Data Latency |
Low |
High |
|
Query Latency |
High for Merge On Read tables |
Low |
Scenarios
-
Near-real-time data ingestion
Hudi supports insert, update, and delete operations. You can ingest log data from services such as Kafka and Simple Log Service (SLS) into Hudi in real time. Hudi also supports real-time synchronization of change data from database binary logging (binlog).
Hudi optimizes small files that are generated during data writes. This makes Hudi more efficient on HDFS than traditional file formats.
-
Near-real-time data analytics
Hudi works with multiple analytics engines, including Hive, Spark, Presto, and Impala. As a file format, Hudi requires no additional service processes, which makes it lightweight.
-
Incremental data processing
Hudi supports incremental queries, which allow you to use Spark Streaming to retrieve data that changed after a specific commit and consume changed data from HDFS.