The performance is improved by about 7 times! Integration of Apache Flink and Apache Hive-Alibaba Cloud Developer Community

Introduction: as Flink is becoming more and more popular in the application scenarios of stream computing, if Flink can handle the application scenarios of batch computing at the same time, it can reduce the development and maintenance costs when users use Flink, and can enrich the Flink ecosystem. SQL is a common tool in batch computing, so Flink also uses SQL as the main interface for batch computing. This topic describes Flink's batch processing design and Hive integration. It is mainly divided into the following three points:

  1. design Architecture
  2. project progress
  3. performance Test

design Architecture

first, I would like to share with you the design architecture of Flink batch processing.

1. Background

Flink improves batch processing by reducing customer maintenance and update costs and improving the Flink ecosystem. SQL is a very important tool in batch computing. Therefore, we want to use SQL as the main interface in batch computing. Therefore, we have optimized the functions of Flink SQL. Currently, the Flink SQL needs to be optimized as follows:

  • A complete metadata management system is required.
  • Lack of support for DDL (data definition language DDL is used to create various objects in the database, such as tables, views, indexes, synonyms, and clusters).
  • It is not very convenient to connect with external systems, especially Hive, because Hive is the earliest SQL engine in the big data field, so Hive has a wide user base. Some new SQL tools, for example, both Spark SQL and Impala provide the function of connecting with Hive, so that users can better migrate their applications from Hive. Therefore, connecting with Hive is also very important for Flink SQL.

2. Objectives

therefore, we need to achieve the following goals:

  • define a unified Catalog interface, which is a prerequisite for Flink SQL to be more convenient to connect with the external. If you have used Flink TableSource and TableSink to connect to external system tables, you will find that both writing programs and configuring yaml files are different from traditional SQL usage. Therefore, we do not want Hive users to interact with Hive by defining Flink SQL and TableSouces TableSink migration. Therefore, we provide a new set of Catalog interfaces to interact with Hive in a way closer to traditional SQL.
  • Provides memory-based and persistent implementations. Memory-based is the original method of Flink. The lifecycle of all user metadata is bound to his Session (Session). After the Session (Session) ends, all metadata is gone. To interact with Hive, you must provide a persistent Catalog.
  • Supports Hive interoperability. With Catalog, users can access Hive metadata through Catalog, and provide Data Connector to allow users to read and write Hive actual Data through Flink, so that Flink can interact with Hive.
  • Flink is supported as a Hive Computing Engine (long-term target), such as Hive On Spark and Hive On Tez.

3. Newly designed Catalog API(FlIP-30)

when you submit a request through SQL Client or Table API, Flink creates a TableEnvironment, TableEnvironment creates an instance CatalogManager is loaded and configured Catalog, and Catalog supports multiple metadata types, such as table, database, function, view, and partition, in version 1.9.0, Catalog have two implementations:

  • one is memory-based GenericinMemoryCatalog.
  • The other is HiveCatalog, which HiveCatalog to operate Hive metadata through interaction between HiveShim and Hive Metasotre. HiveShim is used to deal with the incompatibility of Hive Hive Metastore in large versions.

In this way, you can create multiple Catalog or access multiple Hive Metastore to query data across Catalog.

4. Read and write Hive data

with metadata, we can implement Flink SQL Data Connector to read and write actual Hive Data. The data written by Flink SQL must be compatible with the data format of Hive, that is, Hive can normally read data written by Flink, which is the same in turn. To achieve this, we reuse a large number of Hive APIs such as Input/Output Format and SerDe. One is to reduce code redundancy and the other is to maintain compatibility as much as possible.

The specific implementation classes for reading Hive table Data in Data Connect are: HiveTableSource, HiveTableInputFormat. The specific implementation classes for writing Hive tables are: HiveTableSink, HiveTableOutputFormat.

Project progress

next, I would like to share with you the current situation of Flink 1.9.0, the new features of Flink 1.10.0, and the future work.

1. Current status of Flink 1.9.0

Flink SQL is released as a trial feature in version 1.9.0. Its features are not perfect:

  • the supported data types are incomplete. (Data types with parameters in 1.9.0 are basically not supported, such as DECIMAL and CHAR.)
  • the support for partitioned tables is not complete. Only partitioned tables can be read, not partitioned tables.
  • Table INSERT OVERWRITE is not supported.

2. New features in Flink 1.10.0

Flink SQL has been further developed in version 1.10.0 to integrate features with Hive.

  • Supports reading and writing static partitions and dynamic partitioned tables.
  • INSERT OVERWRITE is supported at both the table and Partition levels.
  • More data types are supported. (Supported except for UNION type)
  • supports more DDL. (CREATE TABLE/DATABASE)
  • you can call Hive built-in functions in Flink. (Hive has about 200 built-in functions)
  • more Hive versions are supported. (Hive 1.0.0~3.1.1)
  • many performance optimizations have been made, such as Project, Predicate Pushdown, and vector reading ORC data.

3. Module interface

to enable users to call Hive built-in functions in Flink SQL, we introduced a Module interface in Flink 1.10. This Module allows users to easily connect the built-in functions of the external system to the system.

  • Similar to Catalog, you can configure modules by Table API or Yaml files.
  • You can load multiple modules at the same time. When you parse a function in Flink, you can query the parsing of the function in multiple modules based on the Module loading sequence. That is, if two modules contain Function with the same name, the Module loaded first will provide the definition of the Function.
  • Currently, the Module has two implementations. CoreModule provides Flink Native built-in functions, HiveModule Hive built-in functions.

4. Future work

the future work is mainly to complete the functions first, including:

  • View support (may be completed in 1.11).
  • The ease of use of SQL CLI is continuously improved. Now, you can turn pages to display query results, and then scroll display is supported. And supports the non-interactive use of Hive-e -f.
  • Supports all common Hive DDL types, such AS CREATE TABLE.
  • Compatible with Hive syntax, allowing projects in Hive to be smoothly migrated in Flink.
  • The remote connection mode of SQL CLI is supported, which is similar to the remote connection mode of hiveserver2.
  • Supports Streaming writing Hive data.

Performance Test

the following is the test environment and results of Flink's comparison with HiveMR in batch jobs.

1. Test environment

first, our test environment uses a physical cluster of 21 nodes, one Master node and 20 Slave nodes. The hardware configuration of the node is 32 cores, 64 threads, 256 memory, Port Aggregation is performed on the network, and each machine has 12 HDD hard disks.

2. Test tools

the test tool uses Hortonworks hive-testbench, an open-source tool in github. We used this tool to generate a 10TB TPC-DS test dataset, and then tested the dataset Flink SQL through TPC-DS and Hive respectively.

On the one hand, we compared the performance of Flink and Hive, and on the other hand, we verified that Flink SQL can access Hive data well. In the test, Hive version 3.1.1 is used, and Flink uses the Master branch code.

3. Test results

the test results Flink SQL about 7 times higher performance than the Hive On MapReduce. This benefits from a series of optimizations made by Flink SQL, such as scheduling optimization and execution plan optimization. Generally, if you use Hive On MapReduce, migrating to Flink SQL will greatly improve performance.

Provides the latest performance comparison details and ideas: Performance Comparison between Flink 1.10 and Hive 3.0

author introduction:

li Rui (days away) , Alibaba technical expert, member of the Apache Hive PMC, previously worked in Intel, IBM and other companies, mainly involved in open source projects such as Hive, HDFS, Spark, etc.

Wang Gang (Qiao burning) , Alibaba senior development engineer, Flink Contributor. After graduating from the computer department of Zhejiang University, he once worked on the data platform of mushroom Street, and engaged in the development of data exchange system. Currently, Alibaba focuses on building Flink and Hive ecosystems.

Selected, One-Stop Store for Enterprise Applications
Support various scenarios to meet companies' needs at different stages of development

Start Building Today with a Free Trial to 50+ Products

Learn and experience the power of Alibaba Cloud.

Sign Up Now