Hudi - E-MapReduce - Alibaba Cloud Documentation Center

Hudi table types

Hudi supports the following two table types:

Copy On Write

Stores data in the Parquet file format. Updates to a Copy On Write table require a rewrite operation.
Merge On Read

Stores base data in columnar format (Parquet) and incremental data in row-based format (Avro). A compaction operation periodically merges incremental data into columnar files based on a configurable policy.

The following table compares the two table types.

Trade-off	Copy On Write	Merge On Read
Data Latency	High	Low
Query Latency	Low	High
Update cost (I/O)	High (rewrites the entire Parquet file)	Low (appends to delta logs)
Parquet File Size	Small (high update I/O overhead)	Large (low update overhead)
Write Amplification	High (high write amplification)	Low (depends on the merge policy)

Hudi supports the following three query types:

Snapshot Queries

Queries data from the latest commit snapshot. For a Merge On Read table, a snapshot query merges the base columnar data with the real-time log data. For a Copy On Write table, a snapshot query returns the latest version of the Parquet data.

Both Copy On Write and Merge On Read tables support this query type.
Incremental Queries

Queries data written after a specific commit.

Both Copy On Write and Merge On Read tables support this query type.
Read Optimized Queries

Returns the most recent compacted data. For Merge On Read tables, this query type reduces latency by skipping the online merge of log data, at the expense of data freshness.

The following table compares the query types.

Near-real-time data ingestion

Hudi supports insert, update, and delete operations. You can ingest log data from services such as Kafka and Simple Log Service (SLS) into Hudi in real time. Hudi also supports real-time synchronization of change data from database binary logging (binlog).

Hudi optimizes small files that are generated during data writes. This makes Hudi more efficient on HDFS than traditional file formats.
Near-real-time data analytics

Hudi works with multiple analytics engines, including Hive, Spark, Presto, and Impala. As a file format, Hudi requires no additional service processes, which makes it lightweight.
Incremental data processing

Hudi supports incremental queries, which allow you to use Spark Streaming to retrieve data that changed after a specific commit and consume changed data from HDFS.