By Hu Zheng (Ziyi) - a technical expert from Alibaba, an Apache Iceberg Committer, and a member of Apache HBase PMC. Currently, he is mainly responsible for the solution design and development of the Flink Data Lake. He is also the author of HBase Principle and Practice.
On January 27, 2021, Apache Iceberg version 0.11.0 was released with the following core functions implemented:
1) Apache Iceberg supports changes of partitions in the Core API. It also adds the SortOrder specification to the Iceberg Format V2 - used mainly to aggregate columns with a high hash into a few files to reduce the number of small files significantly. The read efficiency is improved at the same time. The reason is that after data is written by sort, the range of min-max at the file level and page level is smaller, which helps efficient data filtering.
2) In terms of integrating Flink and Iceberg, the community has achieved the following goals:
write.distribution-mode=hashis supported to write data in the Flink Iceberg Sink, which can reduce a large number of small files from the source.
3) For integrating Spark3 with Iceberg, the community supports a large number of high-level SQLs:
4) In terms of ecological integration, the community has achieved the following goals:
In the following sections, we discuss the details of Apache Flink integration in Apache Iceberg 0.11.0.
In Apache Iceberg 0.10.0, the following features are supported in the Flink SQL:
1) Write data to the Apache Iceberg table in streaming tasks;
2) Write data to the Apache Iceberg table in batch tasks; and
3) Read Apache Iceberg tables in batch tasks.
The reading of Apache Iceberg tables by Flink streaming tasks has been integrated into the Apache Iceberg 0.11.0. As such, users can quickly implement the data transmission and ETL operations between different Iceberg tables. If an original table A needs to be processed or widened into table B, the Streaming Reader of Apache Iceberg would be a good choice.
Additionally, Netflix states that they use Flink Streaming Reader to implement backfill and bootstrap of historical data. This means that Iceberg should be integrated into FLIP-27 in the future. For now, Netflix has provided some related practices and design work for reference.
Currently, Flink SQL and the DataStream API are both available (we recommend Flink SQL). Users can start the Flink SQL client by reading the documentation and then can start a streaming task to access the incremental data of Apache Iceberg, described as follows:
-- Submit the Flink job in the streaming mode for the current session. SET execution. Type = streaming; -- Enable this switch because streaming read SQL will provide few job options in Flink SQL hint options. SET table.dynamic-table-options.enabled= true; -- Read all the records from the iceberg current snapshot, and then read incremental data starting from that snapshot. SELECT * FROM sample /*+ OPTIONS ('streaming'=' true','monitor-interval'='1 s')*/ ; -- Read all incremental data starting from the snapshot-id '3821550127947089987' (records from this snapshot will be excluded). SELECT * FROM sample /*+ OPTIONS ('streaming'=' true','monitor-interval'='1 s ','start-snapshot-id'='3821550127947089987')*/ ;
In the Batch Source and Streaming Source of Flink, the pushdown of the Iceberg table with the Limit and Filter operations has been implemented. This means that when dealing with the following SQL that reads the Apache Iceberg table,
SELECT * FROM sample LIMIT 10;
the data can be filtered at the storage layer without reading the data from the storage layer and then throwing it to the computing engine. This significantly improves the data access efficiency.
The filter pushdown operation is similar. Currently, the following filter pushdown operations are supported, which cover nearly all typical filter pushdown operations:
SELECT * FROM sample WHERE data = 'a'; SELECT * FROM sample WHERE data != 'a'; SELECT * FROM sample WHERE data >= 'a'; SELECT * FROM sample WHERE data <= 'a'; SELECT * FROM sample WHERE data < 'a'; SELECT * FROM sample WHERE data > 'a'; SELECT * FROM sample WHERE data = 'a' AND id = 1; SELECT * FROM sample WHERE data = 'a' OR id = 1; SELECT * FROM sample WHERE data IS NULL; SELECT * FROM sample WHERE NOT (id = 1); SELECT * FROM sample WHERE data LIKE 'aaa%';
This feature is prevalent in the Apache Flink community, which mainly targets two core scenarios:
1) Users want to import binlog from a relational database to the Apache Iceberg data lake to provide near-real-time data analysis.
2) Users want to import the upsert stream generated by Flink streaming job AGG to the Apache Iceberg data lake. And they want to provide near-real-time data reports with the help of Apache Iceberg's storage capability and Apache Flink's analysis capability.
Generally speaking, each of the open source solutions available has its shortcomings: With Hive MR, only the T+1 data timeliness can be guaranteed; With Apache Kudu, HDFS and Cloud Object Storage will disconnect; With HBase, row-based storage will cause inadequate analysis capability; With Spark and Delta, the Apache Flink cannot assume full play in stream computing. Users expected Apache Iceberg to solve these problems.
We have generally divided the integration of CDC and Upsert by Flink and Iceberg into two stages:
1) In the first phase, Flink can successfully write the CDC and Upsert data into Apache Iceberg and read a correct result; and
2) In the second stage, Flink and Iceberg can pass the stability and performance tests based on a massive data set, guaranteeing the stability and performance of the whole procedure for production.
So far, we have reached the first phase in version 0.11.0 as the streaming task has been able to write the CDC and Upsert data into Apache Iceberg. And partners in China, such as Autohome and Bilibili, have verified the accuracy based on a medium volume data set.
A series of features related to performance and stability are planned for Apache Iceberg 0.12.0. Version 0.12.0 will be iconic with its Iceberg CDC and Upsert functions ready for production.
When writing data files of the file system in Flink streaming tasks, small files may turn out to be the problem. If the data on the source end is written to the partition without any shuffle or cluster, each task will write a large number of partitions and buckets. Consequently, there will be multiple data writing tasks for each partition, and each task generates at least one file. In a data lake architecture, such as Apache Iceberg, every checkpoint of Flink rolls over file writers to submit txn. With checkpoints submitted in minutes, a large number of small files will definitely be generated.
Currently, Apache Iceberg provides three methods to solve the problem of small files:
write.distribution-mode=hash in the Iceberg table. For example:
CREATE TABLE sample ( ID BIGINT, Data STRING ) PARTITIONED BY( data) WITH( 'write.distribution-mode'= 'hash' );
This ensures that each record is written after the shuffle is performed according to the partition key. One task at most writes each partition, which significantly reduces the number of small files. However, there is another problem: the data skew. Many business tables are partitioned by time field, and new data is written by time. Consequently, new data may be written into the same partition, resulting in a data write hotspot. We recommend users set buckets with hash under partitions so that the data of each partition is evenly distributed to each bucket. As such, only one task at most writes each bucket, which solves both the problems: small files and the data write hotspot.
Currently, creating buckets through SQL is not supported in Flink 1.11. However, users can utilize Java API to add buckets to the above table partitioned according to a data field, as follows:
Table .updateSpec() .addField( Expressions.bucket(" ID", 32 bytes) .commit();
2) Perform Major Compaction on the Apache Iceberg table regularly to merge small files in the Apache Iceberg table. This is currently a Flink batch task, which users can submit through the Java API. For more information, see reference documentation.
3) Use plug-in operators to merge small files after each Flink Sink streaming task automatically. This feature is not available in the community version yet but will be released in version 0.12.0 since it involves the compaction of format v2.
Since Apache Flink's integration with Apache Iceberg, two versions have been released in the community. The read and write capability of Flink and Iceberg featuring stream-batch unification has been implemented in both versions.
So far, there have been many successful cases related to Flink and Iceberg launching at home and abroad:
In the future Apache Iceberg 0.12.0, the core functions in the figure above have been planned. Essentially, we will provide better support for CDC and Upsert scenarios using Flink and Iceberg. We will do more work to optimize stability, performance, and usability.
Finally, we talk about the status quo of Apache Iceberg applied in the computing ecosystem.
Apache Iceberg's 0.11.0 release shows more and more advantages in ecosystem integration as a unified and universal data lake table format. Since Apache Iceberg 0.11.0 does not lean towards specific computing engines in terms of table format, the integration of computing engines is in full swing. Almost all mainstream computing engines in the big data ecosystem are connected to Iceberg in different degrees:
1) Contributors from Netflix, Tencent, and Apple significantly promote the integration of Spark and Iceberg. Tencent, Netflix, and Apple have many Spark PMCs and Spark committers in the Apache Spark community, who are very influential in the Spark and Iceberg communities. We believe that the integration of Apache Iceberg and Spark can be comparable to Databricks Delta (commercial version) in the future.
2) The Alibaba Flink team, Netflix, and the large Flink user base (both inside and outside China) are constantly promoting the integration of Flink and Iceberg;
3) The AWS Presto team and Trino team are continuously promoting the integration of Presto and Iceberg. The AWS Presto team has selected Iceberg as their table format for data lake construction. At the same time, the AWS team has done a lot of work in opening up the ecosystem of Iceberg with S3 and Glue. Apache Iceberg has become an essential part of the AWS data lake ecosystem; and
4) Cloudera has determined to select Apache Iceberg to build the commercial version of their data lake. Those who have used Hadoop may be familiar with this company. It is one of the companies with the best commercial version of Hadoop. In the future, it will launch public cloud services based on Apache Iceberg, which will bring users a perfected data lake integration experience for Flink, Spark, Hive, and Impala. Here, we focus on Apache Impala. In interactive analysis scenarios, Cloudera relies heavily on its open source Apache Impala (in fact, Impala performs better than Presto in the big data benchmark test). Apache Iceberg's perfect abstraction of the storage layer and its inclusiveness for diversified computing engines are among the core elements that persuade Cloudera to select Apache Iceberg.
Apache Flink Community China - May 14, 2021
Apache Flink Community China - June 15, 2021
Apache Flink Community China - June 8, 2021
Apache Flink Community China - January 11, 2021
Apache Flink Community China - March 29, 2021
Apache Flink Community China - March 26, 2021
ApsaraDB for HBase is a NoSQL database engine that is highly optimized and 100% compatible with the community edition of HBase.Learn More
Alibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.Learn More
Alibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.Learn More
Mitigate the scalability problem of single machine relational databases for large-scale online databases.Learn More
More Posts by Apache Flink Community China