qualitatively, the three are Data Lake Data storage intermediate layers, and their Data management functions are based on a series of meta files. The role of a meta file is similar to that of a catalog or wal in a database. It functions as schema management, transaction management, and data management. Unlike databases, these meta files are stored in the storage engine together with data files, which can be directly seen by users. This method directly inherits the tradition that data is visible to users in big data analysis, but it also increases the risk of data being accidentally damaged. Once a user accidentally deletes the meta directory, the table is destroyed and it is difficult to restore it.
The Meta file contains the schema information of the table. Therefore, the system can master Schema changes and support Schema evolution. Meta files also have the transaction log function (the file system must support atomicity and consistency). All table changes generate a new meta file, so the system supports ACID and multiple versions, and provides access history functions. In these aspects, the three are the same.
Let's talk about the differences between the three.
first Hudi. Hudi's design goal is just like its name, Hadoop Upserts Deletes and Incrementals (formerly Hadoop Upserts anD Incrementals), emphasizing that it mainly supports Upserts, Deletes and Incremental data processing, anD its main writing tools are Spark HudiDataSource API and its own DeltaStreamer, three data writing methods are supported: UPSERT,INSERT, and BULK_INSERT. The support for Delete is also supported by specifying certain options during writing, and does not support pure delete operations.
The typical usage is to write upstream data to Hudi through Kafka or Sqoop and DeltaStreamer. DeltaStreamer is a permanent service that continuously pulls data from upstream and writes data to hudi. Write is done in batches, and you can set the scheduling interval between batches. The default interval is 0, which is similar to the As-soon-as-possible policy Spark Streaming. As data is continuously written, small files are generated. For these small files, DeltaStreamer can automatically trigger the task of merging small files.
In terms of queries, Hudi supports Hive, Spark, and Presto.
In terms of performance, Hudi designed
HoodieKey, something similar to the primary key.
HoodieKeyMin/Max statistics, BloomFilter, are used to quickly locate the file where the Record is located. In the specific Upserts, if
HoodieKeyif it does not exist in the BloomFilter, insert is performed. Otherwise, confirm
HoodieKeyindicates whether it really exists. If it really exists, update is executed. Based on
HoodieKeythe upserts method of + BloomFilter is relatively efficient. Otherwise, the full table Join is required to implement upserts. For query performance, it is generally required to generate filter conditions based on query predicates and push them to datasource. Hudi does not do much in this aspect, and its performance is completely based on the predicate push-down and partition prune functions provided by the engine.
Another feature of Hudi is its support for Copy On Write and Merge On Read. The former merges data when writing, which has poor write performance but higher read performance. The latter performs merge during reading to check the read performance, but writes data in a timely manner, so the latter can provide near real-time data analysis capabilities.
Iceberg does not have a similar
HoodieKeythe design does not emphasize the primary key. As mentioned above, if there is no primary key, operations such as update, delete, and merge must be implemented through Join, which requires an SQL-like execution engine. Iceberg do not bind an engine or have its own engine. Therefore, Iceberg do not support update, delete, or merge. If users need to update data, the best way is to find out which partition need to update, and then pass overwrite way override data. The quickstart and Spark interfaces provided by Iceberg official website only refer to the method of writing data to Spark dataframe API using Iceberg, and do not mention other data ingestion methods. As for writing using Spark Streaming, the code implements the corresponding
StreamWriteSupport, it should support streaming writing, but it seems that the official website does not explicitly mention this point. The support for stream writing means that there is a problem with small files. The official website does not mention how to merge small files. I suspect that Iceberg may not have a good production ready for stream writing and small file merging, so it is not mentioned (purely personal speculation).
In terms of query, Iceberg supports Spark and Presto.
Iceberg has done a lot of work on query performance. It is worth mentioning that its hidden partition function. Hidden partition means that for the data input by the user, the user can select some of the columns for appropriate transformation (Transform) to form a new column as the partition column. This partition column is only used to partition data and is not directly reflected in the schema of the table. For example, if you have a timestamp column, you can use hour(timestamp) to generate a new partition column of timestamp_hour. timestamp_hour is invisible to users and is only used to organize data. The Partition column contains the statistics of the partition column, such as the data range contained in the partition. When the user query, you can partition statistics do partition prune.
Except hidden partition,Iceberg Also ordinary column column did information collection. These statistics are complete, including the size of the column, the value count of the column, the null value count, and the maximum and minimum values of the column. This information can be used to filter data during query.
Iceberg provides an API for creating tables. You can use this API to specify the description, schema, and partition information, and then create tables in the Hive catalog.
finally, let's talk about Delta. Delta is a Data Lake storage layer that integrates stream and batch operations. It supports update, delete, and merge. Because it comes from Databricks, all data writing methods of spark, including batch and streaming based on dataframe, Insert and Insert Overwrite of SQL, are supported (open source SQL writing is not currently supported, EMR supports). Similar to Iceberg, Delta does not emphasize the primary key, so the implementation of update, delete, and merge is based on the spark join function. In terms of data writing, Delta and Spark are strongly bound, which is different from Hudi: Hudi does not bind Spark (you can use Spark, you can also use Hudi's own writing tool to write data).
In terms of queries, open-source Delta currently supports Spark and Presto. However, Spark is indispensable because the delta log processing requires Spark. This means that if you want to use Presto to query Delta, you need to run a Spark job. What's more, Presto queries are based on
SymlinkTextInputFormat. Before querying, run the Spark job to generate such a Symlink file. If the table data is updated in real time, it means that a SparkSQL is run before the query, and then Presto is run. In this case, why don't they all be done in the SparkSQL? This is a very painful design. EMR has made improvements in this aspect to support DeltaInputFormat. Users can directly use Presto to query Delta data without starting a Spark task in advance.
In terms of query performance, the open-source Delta has hardly been optimized. Not to mention the hidden partition of the Iceberg, the statistics of common columns are not included. Databricks reserved Data Skipping technologies that they were proud. I have to say that this is not a good thing for the promotion of Delta. The EMR team is doing some work in this area, hoping to make up for the lack of ability in this area.
Delta is inferior to Hudi in data merge and Iceberg in query. Does it mean that Delta is useless? Actually otherwise. One of the advantages of Delta is its integration with Spark (although it is still not perfect at present, it will be much better after Spark-3.0), especially its stream-batch integration design, the multi-hop data pipeline supports multiple scenarios, such as analysis, Machine learning, and CDC. Flexible use and perfect scenario support are its greatest advantages over Hudi and Iceberg. In addition, Delta is known as an improved version of Lambda architecture and Kappa architecture. You do not need to care about the stream batch or architecture. At this point, Hudi and Iceberg are beyond their power.
from the analysis above, we can see that the three engines have different original scenarios. For incremental upserts, Hudi Iceberg focuses on high-performance analysis and reliable data management, delta is used to process data in a stream-batch manner. This kind of scenario also causes the design differences among the three. Especially Hudi, its design is more obvious compared with the other two. With the development of time, the three are constantly filling up their missing abilities, and may converge with each other and invade each other's territory in the future. Of course, it is also possible to pay attention to the scenes of their own expertise and build up their own advantages barriers. Therefore, it is still unknown who wins and who loses in the end.
The following table summarizes the three dimensions. Note that the capabilities listed in this table only represent the end of 2019.
note: the content in this article may be wrong due to my own level. Readers are also welcome to criticize and correct it!