Many data scientists, analysts, and general business intelligence (BI) users rely on interactive Structured Query Language (SQL) queries to explore data. Flink SQL, as a core Flink module, functions as a distributed SQL query engine. Flink SQL allows users to perform union queries on disparate data sources and write complex SQL queries for data analysis in one single program. Flink SQL features high query efficiency with the integration of cost-based optimizers (CBOs), column-oriented storage, and code generation. The fault tolerance mechanism and high scalability of the Flink runtime architecture also allow Flink SQL to easily process large amounts of data.
In addition to excellent performance, enhanced usability is one of the highlights of Flink SQL in Flink 1.11. To provide a better user experience, the Flink community made the following improvements:
These features are introduced in the following sections.
Table definitions must be updated in many production scenarios. For example, many users want to add watermark definitions to the existing definitions of external tables, such as Hive metastore. If you want to merge data from multiple tables into a destination table in the extracting, transforming, and loading (ETL) process, the schema definitions of the destination table would be the collection of the definitions of all upstream tables. In this case, a new approach is required to help you merge these definitions.
Starting with Flink 1.11, you can use the LIKE syntax to easily append new definitions to existing table definitions.
The following code demonstrates an example of adding a watermark definition to base_table:
CREATE [TEMPORARY] TABLE base_table ( id BIGINT, name STRING, tstmp TIMESTAMP, PRIMARY KEY(id) ) WITH ( 'connector': 'kafka' ) CREATE [TEMPORARY] TABLE derived_table ( WATERMARK FOR tstmp AS tsmp - INTERVAL '5' SECOND ) LIKE base_table;
The derived_table definition is equivalent to the following definition:
CREATE [TEMPORARY] TABLE derived_table ( id BIGINT, name STRING, tstmp TIMESTAMP, PRIMARY KEY(id), WATERMARK FOR tstmp AS tsmp - INTERVAL '5' SECOND ) WITH ( 'connector': 'kafka' )
In contrast, the new syntax does not necessitate repeated schemas, and allows users to simply add attributes instead.
Some may ask, are we only allowed to append or create new attributes for existing and new tables? Is there a way to overwrite or exclude attributes? The answer is yes. The LIKE syntax provides a flexible strategy for handling table attributes.
The LIKE syntax supports the use of different keywords to categorize table attributes:
Additionally, the following strategies can be applied to different attributes:
The following table illustrates the supported combinations:
Consider the following example statement:
CREATE [TEMPORARY] TABLE base_table ( id BIGINT, name STRING, tstmp TIMESTAMP, PRIMARY KEY(id) ) WITH ( 'connector': 'kafka', 'scan.startup.specific-offsets': 'partition:0,offset:42;partition:1,offset:300', 'format': 'json' ) CREATE [TEMPORARY] TABLE derived_table ( WATERMARK FOR tstmp AS tsmp - INTERVAL '5' SECOND ) WITH ( 'connector.starting-offset': '0' ) LIKE base_table (OVERWRITING OPTIONS, EXCLUDING CONSTRAINTS);
The resulting table in the preceding example will be equivalent to a table created with the following statement:
CREATE [TEMPORARY] TABLE derived_table ( id BIGINT, name STRING, tstmp TIMESTAMP, WATERMARK FOR tstmp AS tsmp - INTERVAL '5' SECOND ) WITH ( 'connector': 'kafka', 'scan.startup.specific-offsets': 'partition:0,offset:42;partition:1,offset:300', 'format': 'json' )
For more information, visit https://ci.apache.org/projects/flink/flink-docs-master/dev/table/sql/create.html#create-table
In production, making temporary adjustments to table options is a common requirement, such as for queries or to be displayed on the terminal. Take a look at the following Kafka table for example:
create table kafka_table ( id bigint, age int, name STRING ) WITH ( 'connector' = 'kafka', 'topic' = 'employees', 'scan.startup.mode' = 'timestamp', 'scan.startup.timestamp-millis' = '123456', 'format' = 'csv', 'csv.ignore-parse-errors' = 'false' )
Now consider making the following adjustments:
In previous versions of Flink, you could only use statements such as ALTER TABLE to modify columns in an existing table. However, starting from Flink 1.11, you can specify, override, or dynamically add table options in WITH clauses by using dynamic table options.
The syntax is as follows:
table_name /*+ OPTIONS('k1'='v1', 'aa.bb.cc'='v2') */
The key/value pair of the OPTIONS dynamic option overrides the original table options, which means you can use it for different kinds of SQL statements. The following code is an example:
CREATE TABLE kafka_table1 (id BIGINT, name STRING, age INT) WITH (...) ; CREATE TABLE kafka_table2 (id BIGINT, name STRING, age INT) WITH (...) ; -- override table options in query source select id, name from kafka_table1 /*+ OPTIONS('scan.startup.mode'='earliest-offset') */; -- override table options in join select * from kafka_table1 /*+ OPTIONS('scan.startup.mode'='earliest-offset') */ t1 join kafka_table2 /*+ OPTIONS('scan.startup.mode'='earliest-offset') */ t2 on t1.id = t2.id; -- override table options for INSERT target table insert into kafka_table1 /*+ OPTIONS('sink.partitioner'='round-robin') */ select * from kafka_table2;
There is no context restriction on the use of dynamic table options, which allows you to append definitions to any reference table as needed. Dynamic table options provide further convenience. When appended to a specified table, dynamic table options are automatically added to the original table definition.
Dynamic table options are disabled by default because they may change the semantics of the query. You can enable this function by setting the config option:
// instantiate table environment TableEnvironment tEnv = ... // access flink configuration Configuration configuration = tEnv.getConfig().getConfiguration(); // set low-level key-value options configuration.setString("table.dynamic-table-options.enabled", "true");
For more information, visit https://ci.apache.org/projects/flink/flink-docs-master/dev/table/sql/hints.html
As Flink SQL evolves, it keeps providing more comprehensive support for more statements, which now leads to problems with older APIs:
In Flink 1.11, the SQL interfaces provided by TableEnvironment are organized to offer support for consistent execution semantics, the acquisition of values returned from parallel SQL executions, and multiline statements.
TableResult is returned through the new TableEnvironment#executeSql interface, and can be iterated to provide execution results. The structure of returned data varies depending on the executed statements. For example, the SELECT statement returns query results while the INSERT statement submits jobs to the cluster asynchronously.
The new TableEnvironment#createStatementSet interface allows you to add multiple INSERT statements and execute them together. In multi-sink scenarios, the Blink planner optimizes the execution plans in a targeted manner.
The following table lists some examples that showcase the usage of old and new APIs:
For more information, visit https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=134745878
Starting from Version 1.11, Flink allows users to write SQL statements in Hive syntax when Hive dialect is used. The Hive dialect supports DDL statements, databases, tables, views, and functions. You can use the Hive dialect with HiveCatalog to write SQL statements in Hive syntax.
Before you use the Hive dialect, you must set the dialect:
EnvironmentSettings settings = EnvironmentSettings.newInstance()...build(); TableEnvironment tableEnv = TableEnvironment.create(settings); // to use hive dialect tableEnv.getConfig().setSqlDialect(SqlDialect.HIVE); // use the hive catalog tableEnv.registerCatalog(hiveCatalog.getName(), hiveCatalog); tableEnv.useCatalog(hiveCatalog.getName());
Then, you can execute DDL statements in Hive syntax, such as the table creation statement:
create external table tbl1 ( d decimal(10,0), ts timestamp) partitioned by (p string) location '%s' tblproperties('k1'='v1'); create table tbl2 (s struct<ts:timestamp,bin:binary>) stored as orc; create table tbl3 ( m map<timestamp,binary> ) partitioned by (p1 bigint, p2 tinyint) row format serde 'org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe'; create table tbl4 ( x int, y smallint) row format delimited fields terminated by '|' lines terminated by '\n';
The Flink community is planning to provide Hive syntax compatibility for Doctrine Query Language (DQL). More syntax compatibility will be featured in Flink 1.12.
For more information, visit https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/hive_dialect.html
In Flink 1.11, connector properties are simplified and new property keys are introduced, which are more concise and understandable than the old keys. The major changes are as follows:
The following example shows how to create a Kafka table:
CREATE TABLE kafkaTable ( user_id BIGINT, item_id BIGINT, category_id BIGINT, behavior STRING, ts TIMESTAMP(3) ) WITH ( 'connector' = 'kafka', 'topic' = 'user_behavior', 'properties.bootstrap.servers' = 'localhost:9092', 'properties.group.id' = 'testGroup', 'format' = 'csv', 'scan.startup.mode' = 'earliest-offset' )
For more information. visit https://cwiki.apache.org/confluence/display/FLINK/FLIP-122%3A+New+Connector+Property+Keys+for+New+Factory
In the earlier versions of Flink, users could create mirror tables for relational databases only through explicit table creation, which requires tracking schema changes of tables and databases in Flink SQL. Flink 1.11 introduces the JdbcCatalog interface that enables users to connect Flink to relational databases, such as Postgres, MySQL, MariaDB, and Amazon Aurora.
Currently, PostgresCatalog is the only implementation of Java Database Connectivity (JDBC) Catalog, which is configured as follows:
CREATE CATALOG mypg WITH( 'type' = 'jdbc', 'default-database' = '...', 'username' = '...', 'password' = '...', 'base-url' = '...' ); USE CATALOG mypg;
You can also use the JdbcCatalog interface to connect Flink to other relational databases for more JDBC Catalog implementations.
Flink 1.11 features enhanced support for Python UDFs in PyFlink, including defining Python UDFs with SQL DDL and SQL-CLI, vectorized Python UDFs, and user-defined metrics.
Support for Python UDFs was first introduced in Flink 1.10.0, but was limited to the Python Table API. Flink 1.11 provides support for Python UDFs in the SQL DDL syntax, allowing users to define UDFs by using the Table API for Scala and Java or with SQL-CLI.
For example, the following example calls a Python UDF in a Java application:
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); BatchTableEnvironment tEnv = BatchTableEnvironment.create(env); tEnv.getConfig().getConfiguration().setString("python.files", "/home/my/test1.py"); tEnv.getConfig().getConfiguration().setString("python.client.executable", "python3"); tEnv.sqlUpdate("create temporary system function func1 as 'test1.func1' language python"); Table table = tEnv.fromDataSet(env.fromElements("1", "2", "3")).as("str").select("func1(str)"); tEnv.toDataSet(table, String.class).collect();
The performance of vectorized UDFs is much better than UDFs that only support scalar values of standard Python types. Vectorized Python UDFs allow you to take full advantage of popular Python libraries, such as the Pandas and NumPy libraries. To mark a UDF as a Pandas UDF, you simply need to add an extra parameter udf_type="pandas" in the UDF decorator.
The following example shows how to define a vectorized Python scalar function and use it in a query:
@udf(input_types=[DataTypes.BIGINT(), DataTypes.BIGINT()], result_type=DataTypes.BIGINT(), udf_type="pandas") def add(i, j): return i + j table_env = BatchTableEnvironment.create(env) # register the vectorized Python scalar function table_env.register_function("add", add) # use the vectorized Python scalar function in Python Table API my_table.select("add(bigint, bigint)") # use the vectorized Python scalar function in SQL API table_env.sql_query("SELECT add(bigint, bigint) FROM MyTable")
For more information, visit https://ci.apache.org/projects/flink/flink-docs-master/dev/table/python/vectorized_python_udfs.html
Flink 1.11 also provides a comprehensive metric system for Python UDFs that allows users to access and define user scopes and variables.
For more information, visit https://ci.apache.org/projects/flink/flink-docs-master/dev/table/python/metrics.html
The Flink community will focus on improving user experience in the versions to come, such as optimizing the usability of schemas, simplifying the Descriptor API, and enhancing support for Stream DDL statements. Stay tuned for more updates and features!
Apache Flink Community China - February 19, 2021
Apache Flink Community China - July 28, 2020
Apache Flink Community China - January 11, 2021
Apache Flink Community China - March 29, 2021
Apache Flink Community China - July 28, 2020
Alibaba Clouder - December 2, 2020
Resource management and task scheduling for large-scale batch processingLearn More
Fully managed and less trouble database servicesLearn More
Link IoT Edge allows for the management of millions of edge nodes by extending the capabilities of the cloud, thus providing users with services at the nearest location.Learn More
A fully-managed Apache Kafka service to help you quickly build data pipelines for your big data analytics.Learn More
More Posts by Apache Flink Community China