In-depth Analysis of Flink Job Execution: Flink Advanced Tutorials

This article describes two key aspects of the Flink job execution process. It describes how to go from a program to a physical execution plan and how ...

By Yue Meng and compiled by Maohe

This article was prepared based on the live courses on Apache Flink given by Yue Meng, an Apache Flink contributor and R&D engineer for the real-time computing platform of NetEase Cloud Music. It describes two aspects of the Flink job execution process: (1) how to go from a program to a physical execution plan and (2) how to schedule and execute the physical execution plan.

Flink's Four-Level Transformation

Flink implements transformations at four levels: transformation from a program to a StreamGraph, transformation from the StreamGraph to a JobGraph, transformation from the JobGraph to an ExecutionGraph, and transformation from the ExecutionGraph to a physical execution plan. After a program is executed, a directed acyclic graph (DAG) or logical execution graph is generated, as shown in the following figure.

This article first introduces the four levels of transformations and uses case studies to provide a more detailed explanation.

The first-level transformation from a program to a StreamGraph starts with source nodes. Each transform operation generates one StreamNode. Two StreamNodes are connected by a StreamEdge to form a DAG.
The second-level transformation from the StreamGraph to a JobGraph also starts with source nodes. Operators are traversed to find the operators that can be nested. If any operators cannot be nested, a separate JobVertex is created for each of them. Upstream and downstream JobVertexes are connected by a JobEdge to form a DAG at the JobVertex level.
After the JobVertex DAG is submitted to a task, nodes are sorted starting from source nodes. An ExecutionJobVertex is created based on each JobVertex, and an IntermediateResult is created based on the IntermediateDataSet of each JobVertex. The IntermediateResult is used to create an upstream-downstream dependency to form a DAG at the ExecutionJobVertex level, namely, the ExecutionGraph.
Finally, the ExecutionGraph is transformed into a physical execution plan.

Transformation from a Program to a StreamGraph

The process of transforming a program to a StreamGraph is as follows:

Execute the program from StreamExecutionEnvironment.execute. Add transforms to the transformations of the StreamExecutionEnvironment.
Call the generateInternal method of the StreamGraphGenerator to traverse transformations, so as to create StreamNodes and StreamEdges.
Connect the StreamNodes with a StreamEdge.

Use WindowWordCount to view the transformation from code to the StreamGraph. In the flatMap transform, the slot sharing group is set to flatMap_sg, and parallelism is set to 4. In the aggregate operation, the slot sharing group is set to sum_sg and parallelism is set to 3 for sum() and counts(). These settings are intended to help demonstrate the subsequent nesting process, which is related to the parallelism of upstream and downstream nodes and the upstream sharing groups.

Based on the code of WindowWordCount, readTextFile() creates a transform with the ID 1. flatMap() creates a transform with the ID 2. keyBy() creates a transform with the ID 3. sum() creates a transform with the ID 4. counts() creates a transform with the ID 5.

The following figure shows the transform structure. The first transform belongs to flatMap. The second transform belongs to Window. The third transform belongs to SinkTransform. The transform structure also shows the input of each transform.

The concepts of StreamNode and StreamEdge are explained as follows.

A StreamNode is a logical node used to describe an operator and includes key variables such as slotSharingGroup, jobVertexClass, inEdges, outEdges, and transformationUID.
A StreamEdge is a logical edge between two operators and includes key variables such as sourceVertex and targetVertex.

The following figure shows how the WindowWordCount transforms into a StreamGraph. The transformations of StreamExecutionEnvironment include three transforms: Flat Map (ID: 2), Window (ID: 4), and Sink (ID: 5).

During a transform operation, the input is recursively processed to create a StreamNode. Then upstream and downstream StreamNodes are connected by a StreamEdge. Some transform operations, such as PartitionTransformtion, create virtual nodes rather than StreamNodes.

After the transformation, a StreamNode has four transforms: Source, Flat Map, Window, and Sink.

Each StreamNode object contains runtime information, including the parallelism, slotSharingGroup, and execution class.

Transformation from a StreamGraph to a JobGraph

The process of transforming a StreamGraph to a JobGraph is as follows:

Set the scheduling mode and immediately start all nodes in Eager mode.
StreamGraphs in breadth-first mode and generate a hash value of the byte array type for each StreamNode.
Perform recursive search starting from source nodes to find the operators that can be nested. For the nodes that cannot be nested, create a separate JobVertex for each of them. For nodes that can be nested, create a JobVertex for the starting node, serialize other nodes to StreamConfig, and merge them with CHAINED_TASK_CONFIG. Then, connect upstream and downstream JobVertexes with a JobEdge.
Serialize the incoming edge (StreamEdge) of each JobVertex to StreamConfig.
Specify a SlotSharingGroup for each JobVertex based on the group name.
Configure checkpoints.
Add the file storage configuration of cached files to "configuration".
Set ExecutionConfig.

The following conditions must be met for operators to be nested:

The downstream node has only one input.
The operator of the downstream node is not null.
The operator of the upstream node is not null.
The upstream and downstream nodes are located in the same slot sharing group.
The downstream node uses the ALWAYS connection policy.
The upstream node uses the HEAD or ALWAYS connection policy.
The partition function for the edge is an instance of ForwardPartitioner.
The upstream and downstream nodes have the same parallelism.
The nodes are connected.

The preceding figure shows the JobGraph object structure. taskVertices only includes three TaskVertexes: Window, Flat Map, and Source. The Sink operator is nested in the Window operator.

What is the purpose of generating a hash value for each operator?

When a Flink task fails, it's possible to recover each operator from the checkpoint to the pre-failure status based on the JobVertexID, namely, the hash value. The hash value of each operator must remain unchanged when the same task is recovered, so as to obtain the corresponding status.

How is a hash value generated for each operator?

If we specify a hash value for a node, a byte array with a length of 16 is generated based on the hash value. If no hash value is specified for a node, a hash value is generated based on the location of the node.

Three things must be considered in this process:

1) The number of nodes that are processed before the current StreamNode is specified as the ID of this StreamNode and added to the hasher.
2) The system traverses all StreamEdges that are output by the current StreamNode and determines whether this StreamNode can be connected to the target StreamNode of each StreamEdge. If it's possible, then the ID of the target StreamNode is added to the hasher and set to the same value as the ID of the current StreamNode.
3) Bitwise operations are performed on the byte data that is generated in the preceding two steps and the byte data of all the input StreamNodes of the current StreamNode. The resulting byte data is used as the byte array of the current StreamNode, with a length of 16.

Transformation from a JobGraph to an ExecutionGraph and then to a Physical Execution Plan

The process of transforming a JobGraph to an ExecutionGraph and then to a physical execution plan is as follows:

Sort the JobVertexes of a JobGraph starting from source nodes.
In the executionGraph.attachJobGraph(sortedTopology) method, create an ExecutionJobVertex based on each JobVertex. In the ExecutionJobVertex construction method, create an IntermediateResult based on the IntermediateDataSet of each JobVertex. Create an ExecutionVertex based on of the parallelism of JobVertexes. When the ExecutionVertex is created, create IntermediateResultPartitions in the same quantity as the number of IntermediateResults. Connect the created ExecutionJobVertexes to the prior IntermediateResults.
Construct an ExecutionEdge and connect it to the prior IntermediateResultPartition. Finally, transform the ExecutionGraph to a physical execution plan.

Flink Job Execution Process

Flink on YARN Mode

The YARN-based architecture is similar to the Spark on YARN mode. Clients submit applications to the Resource Manager (RM) to run the applications. The RM allocates the first container to run the Application Master (AM), which monitors and manages resources. The Flink on YARN mode is similar to the cluster mode of Spark on YARN, in which a driver runs like a thread of the AM. In Flink on YARN mode, the JobManager is started inside a container to schedule and allocate tasks in a way similar to the scheduling of drivers. The YARN AM and Flink JobManager are inside the same container, which allows the YARN AM to know the Flink JobManager's address and apply for a container to start the Flink TaskManager. After Flink successfully runs in the YARN cluster, the Flink YARN client submits a Flink job to the Flink JobManager and implements mapping, scheduling, and computing.

Disadvantages of Flink on YARN

Resources are allocated statically. A job must acquire the required resources upon startup and hold these resources throughout its lifecycle. As a result, jobs cannot be dynamically adjusted based on load changes. For example, idle resources cannot be released when loads drop, and resources cannot be dynamically scaled out when loads increase.
In On-YARN mode, all containers are of a fixed size, making it impossible to adjust the container structure based on job requirements. For example, CPU-intensive jobs may require more cores but do not require much memory. The fixed structure of containers may cause memory waste.
The interaction with the container management infrastructure is clumsy. A Flink job starts in two steps: 1. Start the Flink daemon; 2. Submit the job. Step 2 is not required if the job is containerized and deployed with a container.
In On-YARN mode, the job management page disappears after the job is completed.
Flink recommends the per job clusters deployment method and supports the session mode that allows multiple jobs to run in a single cluster. However, this is confusing.

Flink 1.5 introduces a new concept called Dispatcher. The Dispatcher receives job submission requests from clients and starts jobs on a cluster manager on behalf of the clients.

The Purpose Dispatcher

The dispatcher serves the following two purposes:

1) It provides some cluster managers with a centralized instance to create and monitor jobs.
2) It assumes the JobManager role in Standalone mode and waits for job submissions. The Dispatcher is optional in YARN and incompatible with Kubernetes.

Flink on YARN Mode with Resource Scheduling Model Refactoring

Dispatcher-free Job Execution

After a client submits a JobGraph and a dependency JAR package to the YARN RM, the YARN RM allocates the first container to start the AM. The AM starts a Flink RM and JobManager. The JobManager applies for slots from the Flink RM based on the ExecutionGraph and physical execution plan that is created based on the JobGraph. The Flink RM manages these slots and requests. If no slots are available, the JobManager applies for a container from the YARN RM. The container is started and registered to the Flink RM. Finally, the JobManager deploys the subtask to a slot of the container.

Dispatcher-based Job Execution

The Dispatcher allows the client to submit jobs to the YARN RM directly through an HTTP server.

The new framework provides four benefits:

The client starts a job directly on YARN, without having to start and then submit jobs to it. In this way, the client receives a response immediately after job submission.
All user-dependent libraries and configuration files are placed in the classpath of applications instead of being loaded by using the dynamic user code classloader.
Containers are requested on-demand and released when they are no longer used.
The on-demand allocation of containers allows different operators to use containers with different profiles, meaning different CPU configurations and memory structures.

The "Single Cluster Job on YARN" Process in the New Resource Scheduling Framework

The Single Cluster Job on YARN mode involves three instance objects:

1) clifrontend

Invoke App code.
A StreamGraph is created and then transformed into a JobGraph.

2) YarnJobClusterEntrypoint (Master)

The YARN RM, MinDispatcher, and JobManagerRunner are started in sequence and follow the distributed collaboration policy.
The JobManagerRunner transforms the JobGraph to an ExecutionGraph and then to a physical execution task, and then deploys the task by applying for slots from the YARN RM. If slots are available, the physical execution task is directly deployed in the slot of the YarnTaskExecutor. If no slots are available, the JobManagerRunner applies for a container from the YARN RM and deploys the physical execution task after the container is started.

3) YarnTaskExecutorRunner (slave)

This instance object receives and runs subtasks.

The following figure shows the task execution code.

How Subtasks Run During Task Execution

The system calls the StreamTask's invoke method. The execution process is as follows:

initializeState() is the operator's initializeState().
openAllOperators() is the operator's open() method.
The system calls the run method for task processing.

The following code shows how the task is processed by the run method of OneInputStreamTask that corresponds to flatMap:

@Override
    protected void run() throws Exception {
        // cache processor reference on the stack, to make the code more JIT friendly
        final StreamInputProcessor<IN> inputProcessor = this.inputProcessor;

        while (running && inputProcessor.processInput()) {
            // all the work happens in the "processInput" method
        }
    }

Data is processed by calling processInput() of the StreamInputProcessor. The following code includes the user's processing logic:

public boolean processInput() throws Exception {
        if (isFinished) {
            return false;
        }
        if (numRecordsIn == null) {
            try {
                numRecordsIn = ((OperatorMetricGroup) streamOperator.getMetricGroup()).getIOMetricGroup().getNumRecordsInCounter();
            } catch (Exception e) {
                LOG.warn("An exception occurred during the metrics setup.", e);
                numRecordsIn = new SimpleCounter();
            }
        }

        while (true) {
            if (currentRecordDeserializer != null) {
                DeserializationResult result = currentRecordDeserializer.getNextRecord(deserializationDelegate);

                if (result.isBufferConsumed()) {
                    currentRecordDeserializer.getCurrentBuffer().recycleBuffer();
                    currentRecordDeserializer = null;
                }

                if (result.isFullRecord()) {
                    StreamElement recordOrMark = deserializationDelegate.getInstance();
                    //处理watermark
                    if (recordOrMark.isWatermark()) {
                        // handle watermark
                        //watermark处理逻辑，这里可能引起timer的trigger
                        statusWatermarkValve.inputWatermark(recordOrMark.asWatermark(), currentChannel);
                        continue;
                    } else if (recordOrMark.isStreamStatus()) {
                        // handle stream status
                        statusWatermarkValve.inputStreamStatus(recordOrMark.asStreamStatus(), currentChannel);
                        continue;
                        //处理latency watermark
                    } else if (recordOrMark.isLatencyMarker()) {
                        // handle latency marker
                        synchronized (lock) {
                            streamOperator.processLatencyMarker(recordOrMark.asLatencyMarker());
                        }
                        continue;
                    } else {
                        //用户的真正的代码逻辑
                        // now we can do the actual processing
                        StreamRecord<IN> record = recordOrMark.asRecord();
                        synchronized (lock) {
                            numRecordsIn.inc();
                            streamOperator.setKeyContextElement1(record);
                            //处理数据
                            streamOperator.processElement(record);
                        }
                        return true;
                    }
                }
            }
            
            //这里会进行checkpoint barrier的判断和对齐，以及不同partition 里面checkpoint barrier不一致时候的，数据buffer，

            final BufferOrEvent bufferOrEvent = barrierHandler.getNextNonBlocked();
            if (bufferOrEvent != null) {
                if (bufferOrEvent.isBuffer()) {
                    currentChannel = bufferOrEvent.getChannelIndex();
                    currentRecordDeserializer = recordDeserializers[currentChannel];
                    currentRecordDeserializer.setNextBuffer(bufferOrEvent.getBuffer());
                }
                else {
                    // Event received
                    final AbstractEvent event = bufferOrEvent.getEvent();
                    if (event.getClass() != EndOfPartitionEvent.class) {
                        throw new IOException("Unexpected event: " + event);
                    }
                }
            }
            else {
                isFinished = true;
                if (!barrierHandler.isEmpty()) {
                    throw new IllegalStateException("Trailing data in checkpoint barrier handler.");
                }
                return false;
            }
        }
    }

streamOperator.processElement(record) calls the user's code processing logic. Here, we assume that the operator is StreamFlatMap.

 @Override
    public void processElement(StreamRecord<IN> element) throws Exception {
        collector.setTimestamp(element);
        userFunction.flatMap(element.getValue(), collector);//用户代码
    }

If you have any questions, please do comment or reach out to your local sales team.

Community

In-depth Analysis of Flink Job Execution: Flink Advanced Tutorials

Flink's Four-Level Transformation

Transformation from a Program to a StreamGraph

Transformation from a StreamGraph to a JobGraph

What is the purpose of generating a hash value for each operator?

How is a hash value generated for each operator?

Transformation from a JobGraph to an ExecutionGraph and then to a Physical Execution Plan

Flink Job Execution Process

Flink on YARN Mode

Disadvantages of Flink on YARN

The Purpose Dispatcher

Flink on YARN Mode with Resource Scheduling Model Refactoring

Dispatcher-free Job Execution

Dispatcher-based Job Execution

The "Single Cluster Job on YARN" Process in the New Resource Scheduling Framework

How Subtasks Run During Task Execution

Read previous post:

Read next post:

Apache Flink Community

You may also like

Comments

Apache Flink Community

Related Products

Message Queue for Apache Kafka

Resource Management

Media Solution

Super App Solution for Telcos