Flink Ecology: a Case to Quickly Get Started With Pyflink

A Case to Quickly Get Started With Pyflink Introduction:

A case to quickly get started with PyFlink.Flink has added support for Python (PyFlink) since version 1.9.0. In the just released Flink 1.10, PyFlink added support for Python UDFs, and now you can register and use custom functions in Table API/SQL . What is the architecture of PyFlink and what scenarios is it suitable for? This article will analyze and demonstrate the case of CDN log analysis in detail.

Author: Sun Jincheng (Golden Bamboo)

A Case to Quickly Get Started With Pyflink.Flink has added support for Python (PyFlink) since version 1.9.0. In the just released Flink 1.10, PyFlink added support for Python UDFs, and now you can register and use custom functions in Table API/SQL. What is the architecture of PyFlink and what scenarios is it suitable for? This article will analyze and demonstrate the case of CDN log analysis in detail.

The need for PyFlink
Flink on Python and Python on Flink

What is PyFlink? This question may make people feel that the answer to the question is too obvious, that is Flink + Python, which is Flink on Python. So what exactly does Flink on Python mean? Then a very easy to think aspect is to allow Python users to enjoy all the functions of Flink. In fact, not only that, the existence of PyFlink also has another very important meaning, Python on Flink, we can run Python's rich ecological computing capabilities on the Flink framework, which will greatly promote the development of the Python ecosystem. In fact, if you dig a little deeper, you will find that this combination is not accidental.

A Case to Quickly Get Started With Pyflink.Python Ecology and Big Data Ecology


The Pythoh ecosystem is inextricably linked with the big data ecosystem. Let’s first see what practical problems are everyone using Python to solve? Through a user survey, we found that most Python users are solving " data analysis" and "machine learning" problems, so these problem scenarios also have good solutions in the field of big data. Then the combination of Python ecology and big data ecology, apart from expanding the audience of big data products, is particularly important to the Python ecology , which is the enhancement of the ability from stand-alone to distributed. I think this is also the analysis of massive data in the era of big data. Strong demand for the Python ecosystem.

A Case to Quickly Get Started With Pyflink.Why Flink and Why Python


Well, the combination of Python ecology and big data is the requirement of the times, so why did Flink choose Python ecology as the entry point for multi-language support instead of Go or R? As a user, why did you choose PyFlink instead of PySpark or PyHive?

First, let's talk about the reasons for choosing Flink:
•First, the most important thing is the architectural advantage. Flink is a stream-batch unified computing engine with pure stream architecture ;
•Second, from the objective statistics of ASF, Flink is the most active open source project in 2019, which means Flink's fresh vitality;
•Third, Flink is not only an open source project, but also has experienced countless times, the experience of the production environment of various big data companies, and is trustworthy.
So let's take a look at why Flink chooses Python instead of other languages when it chooses multi-language support? Let's take a look at the statistics, as follows: The popularity of the Python language is second only to Java and C. In fact, we found that Python's development has been very rapid since 2018, and it is still continuing. Then Java/Scala is Flink's default language, so it seems reasonable to choose Python for Flink's multi-language support. These authoritative statistics, you can view more detailed information at the link I provided.

A Case to Quickly Get Started With Pyflink.At present, the emergence of PyFlink is inevitable of the times, but it is not enough to just think about the meaning of PyFlink's existence, because our ultimate goal is to benefit Flink and Python users and truly solve practical problems. Therefore, we still need to go deeper and explore how PyFlink can be implemented together?

A Case to Quickly Get Started With Pyflink.PyFlink Architecture


After thinking about everything clearly, you must also understand that to implement PyFlink, the first thing to solve is to analyze the core goals to be achieved and the core problems to be solved to achieve the goals. So what is the core goal of PyFlink?
Core goals of PyFlink
We have mentioned it in the previous analysis process, here we will specify the core goals of PyFlink:
1.Export Flink capabilities to Python users, so that Python users can use all Flink capabilities.
2.Run the existing analytical computing functions of the Python ecosystem on Flink, thereby enhancing the Python ecosystem's ability to solve big data problems.
Focusing on these two core goals, let's analyze again. To achieve such goals, what are the core problems that need to be solved?

Pythonization of Flink functions

In order for PyFlink to land, do we need to develop a Python engine on Flink that is the same as the existing Java? The answer is NO, this was tried before Flink 1.8. We have a good design principle to pursue the goal of accomplishing the established goal with the least cost, so the best way is to provide only a layer of Python API to reuse the existing computing engine.
So what kind of Python API do we want to provide for Flink? That's what we know: the high-level TableAPI/SQL and the stateful DataStream API. Well, our thinking is getting closer and closer to the internals of Flink. The next question is, how do we provide a set of Python Table

API and DataStream API? What is the core problem to be solved?

The core problem of Pythonization of Flink functions
The core problem is obviously the handshake between the Python VM and the Java VM, and they need to establish communication, which is the core problem of Flink's multi-language support. Well, facing the core problem, we have to make technical selection. Here we go…

A Case to Quickly Get Started With Pyflink.Flink function Python-based VM communication technology selection
As far as the current communication between Java VM and Python VM is concerned, the more obvious solutions are Apache Beam, a well-known multi-language and multi-engine support project, and Py4J, which specializes in solving the communication problem between Java VM and Python VM. We analyze and compare from different perspectives. First of all, the comparison between Py4J and Beam is like a pangolin with a mountain-piercing function and a powerful elephant. To cross a wall, we can punch a hole or push it to the whole wall. So in the current scenario of VM communication, Beam is a bit complicated. Because Beam puts a lot of effort into generality, it loses a certain degree of flexibility in extreme cases.

A Case to Quickly Get Started With Pyflink.From another perspective, Flink itself has interactive programming requirements, such as FLIP-36, and at the same time, it also needs to ensure the semantic consistency of interface design in various languages while supporting multiple languages. These are under the existing architecture of Beam. It's hard to be satisfied. So under such a thinking, we choose Py4J as the bridge of communication between Java VM and Python VM.

Python-based technical architecture of Flink functions
In fact, if we solve the problem of communication between Python VM and Java VM, we are essentially trying to achieve our first goal, which is to export the existing Flink functions to Python users, which is what we have done in Flink 1.9. Next, we will Take a look at the architecture of the Flink 1.9 PyFlink API, as follows:

We use Py4J to solve the communication problem, start a Gateway in the PythonVM, and the Java VM starts a Gateway Server to accept Python requests, and provide the same objects as the Java API in the Python API, such as TableENV, Table, and so on. In this way, when Python writes the Python API, it is essentially calling the Java API. Of course, the job deployment problem is also resolved in Flink 1.9, and we can submit jobs in various ways such as Python commands, Python shell, and CLI.

So what are the advantages of such an architecture? The first is simplicity, and ensures the consistency of Python API semantics and Java API. Second, Python jobs can achieve the same extreme performance as Java, so what about Java performance? I think everyone is familiar with the fact that last year's Double 11 Flink Java API has the ability to process 2.551 billion data per second.

A Case to Quickly Get Started With Pyflink.Python ecological distribution

OK, after completing the output of existing Flink functions to Python users, let's continue to discuss how to introduce Python ecological functions into Flink, and then distribute Python functions. How to achieve it? Usually we can have the following two approaches:
1.Selecting a representative Python class library and adding its API to PyFlink is a long process, because there are too many ecological libraries in Python, but in any case, before we introduce these APIs, the first problem to be solved Yes, to fix Python execution issues.
2.Combining the current status of the existing Flink Table API and the characteristics of the existing Python class library, we can treat all the existing Python class library functions as user-defined functions (UDFs) and integrate them into Flink. In this way, we found a way to integrate the Python ecosystem into Flink by treating it as a UDF, which is our work in Flink 1.10. So what is the core problem of integration? Yes, as I said earlier, it is the execution problem of Python UDF.
Well, let's make technical selection for this core problem, Here we go…

Python ecological distributed UDF execution technology selection
Solving the problem of Python UDF execution is not just about the communication between VMs, it involves the management of the Python execution environment, the parsing of business data between Java and Python, the output of the Flink State Backend capability to Python, and the monitoring of Python UDF execution. Wait, it's a very complicated question. Faced with such a complex problem, we introduced Apache Beam earlier, which supports multi-engine and multi-language, and the omnipotent elephant can appear. Let's take a look at how Beam solves the Python UDF execution problem:)

In order to solve the problem of multi-language and multi-engine support, Beam highly abstracts an architecture called Portability Framework. As shown in the figure below, Beam can currently support multiple languages such as Java/Go/Python, among which Beam Fu Runners and Execution are resolved below the figure. Issues with the engine and the UDF execution environment. Its core is to use Protobuf to abstract the data structure, use the gRPC protocol for communication, and encapsulate the core gRPC service. So at this time Beam is more like a firefly, illuminating the way for PyFlink to solve the UDF execution problem. :) (Speak more, the firefly has become the mascot of Aapche Beam).
Let's take a look at what gRPC services Beam provides.

As shown in the figure, the Runner part is the operator execution of Java, the SDK Worker part is the execution environment of Python, and Beam has abstracted services such as Control/Data/State/Logging. And these services have been running stably and efficiently on Beam's Flink runner for a long time. So we can stand on the shoulders of giants in the execution of PyFlink UDF:), here we find that Apache Beam has solutions at the API level and at the UDF execution level, while PyFlink uses Py4J at the API level to solve the VM communication problem, Beam's Protability Framework is used in UDF execution requirements to solve UDF execution environment problems.

A Case to Quickly Get Started With Pyflink.This also shows that PyFlink strictly follows the principle of achieving the established goal at the least cost in technical selection, and will always choose the most appropriate technical architecture that is most in line with PyFlink's long-term development in technical selection. (BTW, in the course of working with Beam, I also submitted 20+ optimization patches to the Beam community).

Python ecological distributed UDF technical architecture
In the architecture of UDF, we have to consider not only the communication between Java VM and Python VM, but also the different requirements in the compilation phase and in the running phase. In the diagram we represent the behavior of the Java VM in green and the behavior of the Python VM in blue. First, let's look at the compilation phase, that is, the local design. The local design is a pure API mapping call. We still need to use Py4J to solve the communication problem, that is, as shown in the figure, Python will call Java synchronously every time an API is executed. the corresponding API.

A Case to Quickly Get Started With Pyflink.For UDF support, it is necessary to add UDF registration API, register_function, but registration is not enough, users often rely on some third-party libraries when customizing Python UDF, so we also need to add a method for adding dependencies, which is a Family of add methods, such as add_Python_file(). While writing the Python job, the Java API will be called at the same time. Before submitting the job, the Java side will build the JobGraph. The job is then submitted to the cluster to run through various methods such as the CLI.

Python ecological distributed UDF technical architecture

In the architecture of UDF, we have to consider not only the communication between Java VM and Python VM, but also the different requirements in the compilation phase and in the running phase. In the diagram we represent the behavior of the Java VM in green and the behavior of the Python VM in blue. First, let's look at the compilation phase, that is, the local design. The local design is a pure API mapping call. We still need to use Py4J to solve the communication problem, that is, as shown in the figure, Python will call Java synchronously every time an API is executed. the corresponding API.

For UDF support, it is necessary to add UDF registration API, register_function, but registration is not enough, users often rely on some third-party libraries when customizing Python UDF, so we also need to add a method for adding dependencies, which is a Family of add methods, such as add_Python_file(). While writing the Python job, the Java API will be called at the same time. Before submitting the job, the Java side will build the JobGraph. The job is then submitted to the cluster to run through various methods such as the CLI.

PyFlink scene, how to use it?
knowing so much about PyFlink's architecture and the thinking behind the architecture, let's take a specific scenario case to add some sense of PyFlink!
Scenarios for PyFlink
Before the specific case, let's briefly share some business scenarios that PyFlink can apply to. First of all, since PyFlink is Python+Flink, its applicable scenarios can also be analyzed from both java and Python. The first Java applicable scenario PyFlink is applicable.

The first one is event-driven, such as: brushing, monitoring, etc.;
•The second is data analysis type, such as: inventory, double 11 large screen, etc.;
•The third applicable scenario is the data pipeline, that is, the ETL scenario, such as the parsing of some logs, etc.;
•The fourth scenario, machine learning, such as personalized recommendation, etc.
These can be tried using PyFlink. In addition, there are scenarios unique to the Python ecosystem, such as scientific computing. So with so many application scenarios, what APIs are available in PyFlink?

Installation of PyFlink
Before using specific API development, you must first install PyFlink. At present, PyFlink supports pip install for installation. Here is a special reminder that the specific command is: pip install apache-Flink.

APIs of PyFlink
At present, the PyFlink API is completely aligned with the Java Table API, supports various relational operations, and also has good support for windows, and here is a little mention that some ease-of-use APIs in PyFlink are more powerful than SQL, such as: these are for columns APIs that perform operations. In addition to these APIs, PyFlink provides several ways to define Python UDFs.

A Case to Quickly Get Started With Pyflink.UDF definition of PyFlink
First, ScalarFunction can be extended, which can provide more auxiliary functions, such as adding Metrics. In addition, any method definition supported by the Python language is supported in PyFlink UDF, such as Lambda Function, Named Function and CallableFunction.
After defining the method , we can mark it with the Decorators provided by PyFlink and describe the data types of input and output. Of course, in later versions, we can also further simplify and perform type inference according to the type hint feature of the Python language. For intuition, let's look at an example of a specific UDF definition:

.A Case to Quickly Get Started With Pyflink.Python UDF Definition Example


We define the example of adding two numbers, first import the necessary classes, and then the several definitions we mentioned just now. This is simple and straightforward, let’s cut down on the gossip and start looking at actual cases :)

PyFlink case - Alibaba Cloud CDN real-time log analysis
Here we take an example of Alibaba Cloud CDN real-time log analysis to introduce how to use PyFlink to solve practical business problems. We are all familiar with CDN, in order to accelerate the download of resources. Then the analysis of CDN logs generally has a general architectural pattern, which is to first collect the log data of each edge node, usually to the message queue, and then integrate the message queue with the real-time computing cluster for real-time log analysis, and finally Write the analysis results to the storage system. So in my case today, the architecture is instantiated, the message queue uses Kafka, the real-time computing uses Flink, and finally the data is stored in MySQL.

Description of Alibaba Cloud CDN real-time log analysis requirements
Let's take a look at the requirements of business statistics. For the convenience of introduction, we will simplify the actual statistical requirements. In the example, only grouping by region is performed to collect statistics on resource access, download volume and download speed. For the data format, we only select the core fields, such as: uuid, which represents the unique log identifier, client_ip, which represents the source of access, request_time, which represents the resource download time, and response_size, which represents the size of the resource data. Among them, we found that our requirement is to group by region, but there is no field information of the region in the original log, so we need to define a Python UDF to query the corresponding region according to client_ip. Well, let's first look at how to define this UDF.

Alibaba Cloud CDN real-time log analysis UDF definition

Here we define a UDF of ip_to_province() using the named function method just mentioned, the input is the ip address, and the output is the region name string. We describe here that the input type is a string, and the output type is also a string. Of course, the query service here is for demonstration only. You should replace it with a reliable regional query service in your own production environment.

import re
import json
from pyFlink.table import DataTypes
from pyFlink.table.udf import udf
from urllib.parse import quote_plus
from urllib.request import urlopen

@udf(input_types =[ DataTypes.STRING()], result_type=DataTypes.STRING())
def ip_to_province(ip):
format :
{
'ip' : '27.184.139.25' ,
'pro' : 'Hebei Province' ,
'proCode' : '130000' ,
'city' : 'Shijiazhuang City' ,
'cityCode' : '130100' ,
'region' : 'Lingshou County' ,
'regionCode' : '130126' ,
'addr' : 'Telecommunications in Lingshou County, Shijiazhuang City, Hebei Province' ,
'regionNames' : '' ,
'err' : ''
}
Definition of Alibaba Cloud CDN Real-time Log Analysis Connector
We have completed the requirement analysis and the definition of UDF, and we have started to develop the job. According to the general job structure, we need to define a Source connector to read Kafka data, and a Sink connector to store the calculation results in MySQL. The last step is to write statistical logic.
In particular, PyFlink also supports the writing of SQL DDL. We use a simple DDL description to complete the development of the Source Connector. Wherein connector.type fills in kafka. The same is true for SinkConnector, which can be described with one line of DDL, where connector.type is filled in jdbc. The logic to describe the connector is very simple, let's see if the core statistical logic is just as simple :)

kafka_source_ddl = """
CREATE TABLE cdn_access_log (
uuid VARCHAR ,
client_ip VARCHAR ,
request_time BIGINT ,
response_size BIGINT ,
uri VARCHAR
) WITH (
'connector.type' = 'kafka',
'connector.version' = 'universal',
'connector.topic' = 'access_log',
'connector.properties.zookeeper.connect' = 'localhost:2181',
'connector.properties.bootstrap.servers' = 'localhost:9092',
'format.type' = 'csv',
'format.ignore-parse-errors' = 'true'
)
"""mysql_sink_ddl = """
CREATE TABLE cdn_access_statistic (
province VARCHAR,
access_count BIGINT,
total_download BIGINT,
download_speed DOUBLE
) WITH (
'connector.type' = 'jdbc',
'connector.url' = 'jdbc:mysql://localhost:3306/Flink',
'connector.table' = 'access_statistic',
' connector.username ' = 'root' ,
' connector.password ' = 'root' ,
' connector.write.flush.interval ' = '1s'
)
Alibaba Cloud CDN real-time log analysis core statistical logic
First read the data from the data source, and then you need to convert the client_ip to a specific region using the ip_to_province(ip) we just defined. After that, group by region, count visits, downloads and resource download speed. Finally, the statistical results are stored in the result table. In this statistical logic, we not only use Python UDFs, but also Flink's built-in Java AGG functions, sum and count.

Complete code for Alibaba Cloud CDN real-time log analysis
Let's take a look at the complete code as a whole. First, we need to import the core dependencies, and then we need to create an ENV and set the planner to be used (currently Flink supports both Flink and blink planners). It is recommended that you use the blink planner.
Next, register the table with the ddl of kafka and mysql we just described. Then register the Python UDF. Here is a special reminder that other files that the UDF depends on can also be formulated in the API, so that they will be submitted to the cluster together when the job is submitted. Then comes the core statistical logic, and finally calls executere to submit the job. Such an actual CDN log real-time analysis job has been developed. Let's look at the actual statistics again.

Complete code for Alibaba Cloud CDN real-time log analysis
Let's take a look at the complete code as a whole. First, we need to import the core dependencies, and then we need to create an ENV and set the planner to be used (currently Flink supports both Flink and blink planners). It is recommended that you use the blink planner.
Next, register the table with the ddl of kafka and mysql we just described. Then register the Python UDF. Here is a special reminder that other files that the UDF depends on can also be formulated in the API, so that they will be submitted to the cluster together when the job is submitted. Then comes the core statistical logic, and finally calls executere to submit the job. Such an actual CDN log real-time analysis job has been developed. Let's look at the actual statistics again.

What is the future of PyFlink?
In general, the business development of PyFlink is very simple. You don't need to care about the underlying implementation details. You only need to describe the business logic in the way of SQL or Table API. So, let's take a look at the future of PyFlink as a whole?
PyFlink natively drives Roadmap
The development of PyFlink must always be driven by the core. We aim to export the existing Flink functions to Python users and integrate the Python ecological functions into Flink. The Roadmap of PyFlink is shown in the figure: First, it solves the communication problem between the Python VM and the Java VM, and then exposes the existing Table API functions to Python users to provide the Python Table API. This is the work done in Flink 1.9. Next To prepare for integrating Python functions into Flink, we need to integrate Apache Beam, provide the execution environment of Python UDF, and add the management function of Python's dependence on other class libraries, provide users with User-defined-Funciton interface definitions, and support Python UDFs , which is what Flink 1.10 does.

In order to further expand the distributed functions of the Python ecosystem, PyFlink will provide support for Pandas Series and DataFram, that is, users can directly use Pandas UDFs in PyFlink. At the same time, in order to enhance the ease of use for users and allow users to use PyFlink in more ways, the use of Python UDF in Sql Client will be added later. For machine learning problems for Python users, add Python's ML pipeline API. Monitoring the execution of Python UDF is very critical to the actual production business, so PyFlink will increase the metric management of Python UDF. These points will meet users in Flink 1.11.

But these functions are only the tip of the iceberg of PyFlink's planning, and we will continue to perform performance optimization, graph computing API, Pandas on Flink's native API of Pandas, and so on. . . In this way, the existing functions of Flink are continuously pushed to the Python ecosystem, and the powerful functions of the Python ecosystem are continuously integrated into Flink, thereby completing the original intention of the Python ecosystem distribution.

PyFlink 1.11 Preview
Let's take a quick preview of the key content of PyFlink in Flink 1.11, which is about to meet with you.

Function
We will bring the perspective from afar to the core functions of PyFlink in Flink 1.11. PyFlink will continue to work hard around functions, performance and ease of use. In 1.11, the support of Pandas UDF will be added to the function, so that the Pandas ecological utility class library function Can be used directly in PyFlink, such as cumulative distribution function, CDF, etc.

It will also add support for the ML Pipeline API, so that you can use PyFlink to complete the business requirements of some machine learning scenarios. Here is an example of using pyFlink to complete KMeans.

Performance
PyFlink will also invest more in performance. We use Codegen, CPython, and optimize serialization and deserialization to improve the execution performance of PythonUDF. At present, we initially compare the performance of 1.10 and 1.11, and 1.11 will be better than 1.10. There is a nearly 15x performance improvement.

Ease of use
In terms of ease of use for users, PyFlink will add support for Python UDFs in SQL DDL and SQL Client. Give users more options to use PyFlink.

PyFlink Big Picture (Mission Vision)
A lot has been introduced today, such as what is PyFlink, the meaning of PyFlink's existence, PyFlink API architecture, UDF architecture, as well as the trade-offs behind the architecture and the advantages of the existing architecture, and introduced the case of CDN, introduced PyFlink's Roadmap, preview The focus of PyFlink in Flink 1.11 version, so what else?

So finally let's take a look at the future of PyFlink? Driven by the mission of "Pythonic Flink functions, Python ecological distribution", what kind of layout will PyFlink have? Let's quickly share: PyFlink is a part of Apache Flink, which involves both the Runtime level and the API level.
How will PyFlink develop at these two levels? At the runtime level, PyFlink will build gRPC general services to solve the communication problem between Java VM and Python VM, such as (Control/Data/State, etc.) on top of this framework, Java's Python UDF operator will be abstracted, and Python's execution container will be constructed. , supports a variety of Python Executions, such as Process, Docker and External. It is particularly worth emphasizing that External provides unlimited expansion

capabilities in the form of Sockets, which is crucial for subsequent Python ecological integration.
At the API level, we will be mission-driven and Pythonize all the APIs on Flink. Of course, this is also based on the introduction of Py4J's VM communication framework. PyFlink will gradually increase the support of various APIs, Python Table API, and UDX interface API. , ML Pipeline, DataStream, CEP, Gelly, State, etc. Flink's Java APIs and Python ecosystem users' favorite Pandas APIs, etc. On the basis of these APIs, PyFlink will continue to integrate the ecosystem, such as the integration of notebooks for user-friendly development, Zeppelin, Jupyter, and integration with Alibaba's open source Alink. At present, PyAlink has fully applied what PyFlink provides. , and will be integrated with existing AI system platforms, such as the well-known TensorFlow, etc.

So at this time, I will find that the mission-driven power will continue the lifeline of PyFlink... Of course, the continuation of this life requires more blood integration. Here again, the mission of PyFlink is emphasized: "Flink's capabilities are Python-based, and the Python ecosystem is distributed". At present, the core contributors of PyFlink are continuously active in the community with this mission.

PyFlink core contributors and issue support
At the end of the sharing, I would like to introduce the current core contributors to PyFlink.
The first is Fu Dian. Currently, Fu Dian is the Committer of Flink and two other top-level Apache projects, and has made great contributions to the PyFlink module.
The second is Huang Xingbo, who is currently focusing on UDF performance optimization of PyFlink. He was once the champion of the Ali and Security Algorithm Challenge, and has also achieved good results in AI and middleware performance competitions.
The third is the well-known Cheng Hequn, who has shared with you many times. I believe you still remember the sharing of "Flink Knowledge Graph" that he brought to you.
The fourth is Zhong Wei, who pays attention to PyFlink's UDF dependency management and usability work, and has already made a lot of code contributions. The last one is myself. When you use PyFlink in the future, if you have any questions, you can contact any of us for support.

Function
We will bring the perspective from afar to the core functions of PyFlink in Flink 1.11. PyFlink will continue to work hard around functions, performance and ease of use. In 1.11, the support of Pandas UDF will be added to the function, so that the Pandas ecological utility class library function Can be used directly in PyFlink, such as cumulative distribution function, CDF, etc.

It will also add support for the ML Pipeline API, so that you can use PyFlink to complete the business requirements of some machine learning scenarios. Here is an example of using pyFlink to complete KMeans.

Performance
PyFlink will also invest more in performance. We use Codegen, CPython, and optimize serialization and deserialization to improve the execution performance of PythonUDF. At present, we initially compare the performance of 1.10 and 1.11, and 1.11 will be better than 1.10. There is a nearly 15x performance improvement.

Ease of use
In terms of ease of use for users, PyFlink will add support for Python UDFs in SQL DDL and SQL Client. Give users more options to use PyFlink.

PyFlink Big Picture (Mission Vision)
A lot has been introduced today, such as what is PyFlink, the meaning of PyFlink's existence, PyFlink API architecture, UDF architecture, as well as the trade-offs behind the architecture and the advantages of the existing architecture, and introduced the case of CDN, introduced PyFlink's Roadmap, preview The focus of PyFlink in Flink 1.11 version, so what else?

So finally let's take a look at the future of PyFlink? Driven by the mission of "Pythonic Flink functions, Python ecological distribution", what kind of layout will PyFlink have? Let's quickly share: PyFlink is a part of Apache Flink, which involves both the Runtime level and the API level.
How will PyFlink develop at these two levels? At the runtime level, PyFlink will build gRPC general services to solve the communication problem between Java VM and Python VM, such as (Control/Data/State, etc.) on top of this framework, Java's Python UDF operator will be abstracted, and Python's execution container will be constructed. , supports a variety of Python Executions, such as Process, Docker and External. It is particularly worth emphasizing that External provides unlimited expansion

capabilities in the form of Sockets, which is crucial for subsequent Python ecological integration.
At the API level, we will be mission-driven and Pythonize all the APIs on Flink. Of course, this is also based on the introduction of Py4J's VM communication framework. PyFlink will gradually increase the support of various APIs, Python Table API, and UDX interface API. , ML Pipeline, DataStream, CEP, Gelly, State, etc. Flink's Java APIs and Python ecosystem users' favorite Pandas APIs, etc. On the basis of these APIs, PyFlink will continue to integrate the ecosystem, such as the integration of notebooks for user-friendly development, Zeppelin, Jupyter, and integration with Alibaba's open source Alink. At present, PyAlink has fully applied what PyFlink provides. , and will be integrated with existing AI system platforms, such as the well-known TensorFlow, etc.

So at this time, I will find that the mission-driven power will continue the lifeline of PyFlink... Of course, the continuation of this life requires more blood integration. Here again, the mission of PyFlink is emphasized: "Flink's capabilities are Python-based, and the Python ecosystem is distributed". At present, the core contributors of PyFlink are continuously active in the community with this mission.

PyFlink core contributors and issue support
At the end of the sharing, I would like to introduce the current core contributors to PyFlink.
The first is Fu Dian. Currently, Fu Dian is the Committer of Flink and two other top-level Apache projects, and has made great contributions to the PyFlink module.
The second is Huang Xingbo, who is currently focusing on UDF performance optimization of PyFlink. He was once the champion of the Ali and Security Algorithm Challenge, and has also achieved good results in AI and middleware performance competitions.
The third is the well-known Cheng Hequn, who has shared with you many times. I believe you still remember the sharing of "Flink Knowledge Graph" that he brought to you.
The fourth is Zhong Wei, who pays attention to PyFlink's UDF dependency management and usability work, and has already made a lot of code contributions. The last one is myself. When you use PyFlink in the future, if you have any questions, you can contact any of us for support.

Of course, if you encounter general problems, it is recommended that you email Flink's user list and Chinese user list, so that you can share the problem. Of course, if you encounter a particularly urgent individual problem, you are also very welcome to send an email to the email address of the friends just introduced. At the same time, for the accumulation of problems and effective sharing, I hope that everyone can ask questions on Stackoverflow when they encounter problems. First, search whether the problem you encountered has been answered. If not, please describe it clearly, and finally remind everyone to tag the problem with PyFlink. In this way, we subscribe to respond to your questions in a timely manner.

A Case to Quickly Get Started With Pyflink Summarize
Today, I deeply analyzed the deep meaning of PyFlink; I introduced that the PyFlink API architecture is the core of using the Py4J framework for communication between VMs. The API design maintains the same semantics as the Python API and the Java API; I also introduced the Python UDF architecture to integrate Apache Beam The Portability Framework method to obtain the support of efficient and stable Python UDF, and detailed analysis of the thinking behind the architecture, the choice of technology selection and the advantages of the existing architecture;

A Case to Quickly Get Started With Pyflink.Then, it introduces the applicable business scenarios of PyFlink, and uses the case of real-time analysis of Alibaba Cloud CDN logs to make everyone feel the use of PyFlink;
Finally, I introduced the Roadmap of PyFlink and previewed the key points of PyFlink in the Flink 1.11 version. It is expected that PyFlink 1.11 will have a performance improvement of more than 15 times compared to 1.10. Finally, I shared with you the mission of PyFlink. The mission of PyFlink is " Flink Capability Python " , the Python ecosystem is distributed ” .

A Case to Quickly Get Started With Pyflink.What is left at the end is to provide you with a more effective way to ask for help. If you have any questions, you can throw it to the PyFlink friends I just introduced to you at any time. Then these friends are already in the live broadcast group. What's next? question, we can discuss together. :)

Function
We will bring the perspective from afar to the core functions of PyFlink in Flink 1.11. PyFlink will continue to work hard around functions, performance and ease of use. In 1.11, the support of Pandas UDF will be added to the function, so that the Pandas ecological utility class library function Can be used directly in PyFlink, such as cumulative distribution function, CDF, etc.

It will also add support for the ML Pipeline API, so that you can use PyFlink to complete the business requirements of some machine learning scenarios. Here is an example of using pyFlink to complete KMeans.

Performance
PyFlink will also invest more in performance. We use Codegen, CPython, and optimize serialization and deserialization to improve the execution performance of PythonUDF. At present, we initially compare the performance of 1.10 and 1.11, and 1.11 will be better than 1.10. There is a nearly 15x performance improvement.

Ease of use
In terms of ease of use for users, PyFlink will add support for Python UDFs in SQL DDL and SQL Client. Give users more options to use PyFlink.

PyFlink Big Picture (Mission Vision)
A lot has been introduced today, such as what is PyFlink, the meaning of PyFlink's existence, PyFlink API architecture, UDF architecture, as well as the trade-offs behind the architecture and the advantages of the existing architecture, and introduced the case of CDN, introduced PyFlink's Roadmap, preview The focus of PyFlink in Flink 1.11 version, so what else?

So finally let's take a look at the future of PyFlink? Driven by the mission of "Pythonic Flink functions, Python ecological distribution", what kind of layout will PyFlink have? Let's quickly share: PyFlink is a part of Apache Flink, which involves both the Runtime level and the API level.
How will PyFlink develop at these two levels? At the runtime level, PyFlink will build gRPC general services to solve the communication problem between Java VM and Python VM, such as (Control/Data/State, etc.) on top of this framework, Java's Python UDF operator will be abstracted, and Python's execution container will be constructed. , supports a variety of Python Executions, such as Process, Docker and External. It is particularly worth emphasizing that External provides unlimited expansion

capabilities in the form of Sockets, which is crucial for subsequent Python ecological integration.
At the API level, we will be mission-driven and Pythonize all the APIs on Flink. Of course, this is also based on the introduction of Py4J's VM communication framework. PyFlink will gradually increase the support of various APIs, Python Table API, and UDX interface API. , ML Pipeline, DataStream, CEP, Gelly, State, etc. Flink's Java APIs and Python ecosystem users' favorite Pandas APIs, etc. On the basis of these APIs, PyFlink will continue to integrate the ecosystem, such as the integration of notebooks for user-friendly development, Zeppelin, Jupyter, and integration with Alibaba's open source Alink. At present, PyAlink has fully applied what PyFlink provides. , and will be integrated with existing AI system platforms, such as the well-known TensorFlow, etc.

So at this time, I will find that the mission-driven power will continue the lifeline of PyFlink... Of course, the continuation of this life requires more blood integration. Here again, the mission of PyFlink is emphasized: "Flink's capabilities are Python-based, and the Python ecosystem is distributed". At present, the core contributors of PyFlink are continuously active in the community with this mission.

PyFlink core contributors and issue support
At the end of the sharing, I would like to introduce the current core contributors to PyFlink.
The first is Fu Dian. Currently, Fu Dian is the Committer of Flink and two other top-level Apache projects, and has made great contributions to the PyFlink module.
The second is Huang Xingbo, who is currently focusing on UDF performance optimization of PyFlink. He was once the champion of the Ali and Security Algorithm Challenge, and has also achieved good results in AI and middleware performance competitions.
The third is the well-known Cheng Hequn, who has shared with you many times. I believe you still remember the sharing of "Flink Knowledge Graph" that he brought to you.
The fourth is Zhong Wei, who pays attention to PyFlink's UDF dependency management and usability work, and has already made a lot of code contributions. The last one is myself. When you use PyFlink in the future, if you have any questions, you can contact any of us for support.

Of course, if you encounter general problems, it is recommended that you email Flink's user list and Chinese user list, so that you can share the problem. Of course, if you encounter a particularly urgent individual problem, you are also very welcome to send an email to the email address of the friends just introduced. At the same time, for the accumulation of problems and effective sharing, I hope that everyone can ask questions on Stackoverflow when they encounter problems. First, search whether the problem you encountered has been answered. If not, please describe it clearly, and finally remind everyone to tag the problem with PyFlink. In this way, we subscribe to respond to your questions in a timely manner.

A Case to Quickly Get Started With Pyflink Summarize
Today, I deeply analyzed the deep meaning of PyFlink; I introduced that the PyFlink API architecture is the core of using the Py4J framework for communication between VMs. The API design maintains the same semantics as the Python API and the Java API; I also introduced the Python UDF architecture to integrate Apache Beam The Portability Framework method to obtain the support of efficient and stable Python UDF, and detailed analysis of the thinking behind the architecture, the choice of technology selection and the advantages of the existing architecture;

A Case to Quickly Get Started With Pyflink.Then, it introduces the applicable business scenarios of PyFlink, and uses the case of real-time analysis of Alibaba Cloud CDN logs to make everyone feel the use of PyFlink;
Finally, I introduced the Roadmap of PyFlink and previewed the key points of PyFlink in the Flink 1.11 version. It is expected that PyFlink 1.11 will have a performance improvement of more than 15 times compared to 1.10. Finally, I shared with you the mission of PyFlink. The mission of PyFlink is " Flink Capability Python " , the Python ecosystem is distributed ” .

A Case to Quickly Get Started With Pyflink.What is left at the end is to provide you with a more effective way to ask for help. If you have any questions, you can throw it to the PyFlink friends I just introduced to you at any time. Then these friends are already in the live broadcast group. What's next? question, we can discuss together. :)

A Case to Quickly Get Started With Pyflink.Of course, if you encounter general problems, it is recommended that you email Flink's user list and Chinese user list, so that you can share the problem. Of course, if you encounter a particularly urgent individual problem, you are also very welcome to send an email to the email address of the friends just introduced. At the same time, for the accumulation of problems and effective sharing, I hope that everyone can ask questions on Stackoverflow when they encounter problems. First, search whether the problem you encountered has been answered. If not, please describe it clearly, and finally remind everyone to tag the problem with PyFlink. In this way, we subscribe to respond to your questions in a timely manner.

A Case to Quickly Get Started With Pyflink Summarize
Today, I deeply analyzed the deep meaning of PyFlink; I introduced that the PyFlink API architecture is the core of using the Py4J framework for communication between VMs. The API design maintains the same semantics as the Python API and the Java API; I also introduced the Python UDF architecture to integrate Apache Beam The Portability Framework method to obtain the support of efficient and stable Python UDF, and detailed analysis of the thinking behind the architecture, the choice of technology selection and the advantages of the existing architecture;

A Case to Quickly Get Started With Pyflink.Then, it introduces the applicable business scenarios of PyFlink, and uses the case of real-time analysis of Alibaba Cloud CDN logs to make everyone feel the use of PyFlink;
Finally, I introduced the Roadmap of PyFlink and previewed the key points of PyFlink in the Flink 1.11 version. It is expected that PyFlink 1.11 will have a performance improvement of more than 15 times compared to 1.10. Finally, I shared with you the mission of PyFlink. The mission of PyFlink is " Flink Capability Python " , the Python ecosystem is distributed ” .

A Case to Quickly Get Started With Pyflink.What is left at the end is to provide you with a more effective way to ask for help. If you have any questions, you can throw it to the PyFlink friends I just introduced to you at any time. Then these friends are already in the live broadcast group. What's next? question, we can discuss together. :)

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00