Packaging Issues in Datastream Development

Datastream job development often has JAR package conflicts and other problems. This article mainly explains which dependencies need to be introduced and which need to be packaged into the job JAR during the job development. It can avoid unnecessary dependencies being inserted into the job JAR and possible dependency conflicts.

A Datastream job involves the following dependencies:

The Core Dependencies of Flink and the Dependencies of the Application

Every Flink application depends on a series of related libraries, which includes Flink's API at least. Many applications rely on connector-related libraries such as Kafka and Cassandra. When you run a Flink application, whether it is running in a distributed environment or testing in a local IDE, related dependencies of Flink in the run time are required.

Like most systems running user-defined applications, there are two broad categories of dependencies in Flink:

Flink Core Dependencies: Flink consists of a set of classes and dependencies necessary to run the system, such as coordinators, networks, checkpoints, fault tolerance, APIs, operators (such as windows), and resource management. The collection of all these classes and dependencies forms the core of the Flink run time and must exist when the Flink application starts. These core classes and dependencies are packaged in the flink-dist jar. They are part of Flink's lib folder and part of Flink's underlying container images. These dependencies are to Flink what the core libraries (such as rt.jar or charsets.jar) containing classes like String and List are to Java, which is required for Java to run. The core dependencies of Flink do not contain any connectors or extension libraries, such as CEP, SQL, or ML. This makes the core dependencies of Flink as small as possible to avoid excessive dependencies in the classpath by default and reduce dependency conflicts.
User Application Dependencies: All connectors, formats, or extension libraries are required by a specific user application. Applications are usually packaged into a JAR file that contains application code and required connector and library dependencies. User application dependencies should not include Flink DataStream API and runtime dependencies since they are included in Flink's core dependencies.

Dependency Configuration Steps

1. Add Basic Dependencies

The development of each Flink application requires at least the basic dependency on related APIs.

When you manually configure a project, you need to add a dependency on the Java/Scala API. (Let's take Maven as an example. The same dependency can be used in other building tools such as Gradle and SBT.)

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-streaming-java_2.11</artifactId>
  <version>1.12.3</version>
  <scope>provided</scope>
</dependency>

Attention: All of these dependencies have their scope set to provided. This means they need to be compiled, but they should not be packaged into the application JAR file generated by the project. These dependencies are Flink core dependencies that have been loaded at the run time.

We recommend setting dependencies to provided. If not, the generated JAR will become bloated in the best case since it contains all Flink core dependencies. In the worst case, the Flink core dependencies added to the application JAR file will conflict with some of your dependencies (usually avoided by Flink's reverse class loading mechanism).

Notes on IntelliJ: It is necessary to check the Include dependencies with Provided scope option box in the running configuration to make the application run in IntelliJ IDEA. If this option is unavailable (due to an older version of IntelliJ IDEA), a simple solution is to create a test case that calls the application main() method.

2. Add Dependencies for Connectors and Libraries

Most applications need specific connectors or libraries to run, such as Kafka and Cassandra connectors. These connectors are not part of Flink core dependencies and must be added to the application as additional dependencies.

The following code is an example of adding Kafka connector dependencies (in Maven syntax):

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kafka_2.11</artifactId>
    <version>1.12.3</version>
</dependency>

We recommend packaging the application code and all its dependencies in jar-with-dependencies form into an application jar. This application JAR package can be submitted to an existing Flink cluster or added to the container image of the Flink application.

For projects created from Maven job templates (see Maven job templates below), dependencies will be typed into the JAR package of the application through mvn clean package commands. If no templates are used for configuration, we recommend using the Maven Shade Plugin to build JAR packages that contain dependencies. (The configuration is shown in the appendix.)

Attention: The scope of these application dependencies must be specified as compile for Maven (and other building tools) to package dependencies into application jars. (Unlike core dependencies, the scope of core dependencies must be specified as provided.)

Usage Notes

Scala Version

Different versions of Scala (such as 2.11 and 2.12) are incompatible with each other. Therefore, the Flink version corresponding to Scala 2.11 cannot be used for applications that use Scala 2.12.

All Flink dependencies that depend on (or transmit) Scala are suffixed with the Scala version from which they were built, such as flink-streaming-scala_2.11.

If you use Java for development, you can select any Scala version. If you use Scala for development, you must select the Flink dependency version that matches the Scala version of your application.

Note: Scala versions later than 2.12.8 are incompatible with the previous 2.12.x version. Therefore, the Flink project fails to upgrade its 2.12.x version to versions later than 2.12.8. You can compile the corresponding Scala version of Flink locally. If you want it to work properly, you need to add-Djapicmp.skip to skip the binary compatibility check during building.

Hadoop Dependencies

General Rule: Never add Hadoop-related dependencies to your application except when using the existing Hadoop input/output format with Flink's Hadoop compatible package.

If you want to use Flink with Hadoop, you must include the Flink startup items dependent on Hadoop instead of adding Hadoop as an application dependency. Flink will use Hadoop dependencies specified by HADOOP_CLASSPATH environment variables. You can set them in the following ways:

export HADOOP_CLASSPATH=`hadoop classpath`

There are two main reasons for this design:

Some interactions with Hadoop may occur in the core modules of Flink and before the user application starts, such as setting up HDFS for checkpoints, authenticating through Hadoop's Kerberos token, or deploying on YARN.
Flink's reverse class loading mechanism has many transitive dependencies in core dependencies. This applies to Flink's core dependencies and Hadoop's dependencies. As a result, the application can use different versions of the same dependency without dependency conflicts. (This is a big deal since the Hadoop dependency is high.)

If you need Hadoop dependencies (such as HDFS access) during testing or development within the IDE, configure the scope of these dependencies to test or provided.

Transform Table Connector/Format Resources

Flink uses the Service Provider Interfaces (SPI) mechanism in Java to load the connector/format factory of a table through a specific identifier. The SPI resource files named org.apache.flink.table.factories.Factory for connector/format of each table is located in the same directory: META-INF/services. Therefore, these resource files will overwrite each other when building uber jars for projects that use multiple table connector/format, which will cause Flink to fail to load factory classes.

As a result, the recommended method is to convert these resource files in the META-INF/services directory through the ServicesResourceTransformer of the maven shade plug-in. The following is the content of the pom.xml file for the given example, which contains the connector flink-sql-connector-hive-3.1.2 and flink-parquet format.

    <modelVersion>4.0.0</modelVersion>
    <groupId>org.example</groupId>
    <artifactId>myProject</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <!--  other project dependencies  ...-->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-sql-connector-hive-3.1.2__2.11</artifactId>
            <version>1.13.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-parquet__2.11<</artifactId>
            <version>1.13.0</version>
        </dependency>

    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <executions>
                    <execution>
                        <id>shade</id>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <transformers combine.children="append">
                                <!-- The service transformer is needed to merge META-INF/services files -->
                                <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                                <!-- ... -->
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

After the ServicesResourceTransformer is configured, when the project builds the uber-jar, these resource files in the META-INF/services directory are integrated instead of overwriting each other.

Advanced Tab of Maven for the Job

We highly recommended using this mode for configuration, which can reduce a lot of repeated configuration work.

Prerequisites

The environment requirement is Maven 3.0.4 (or later) and Java 8.x.

Create a Project

Create a project using one of the following methods:

Use Maven archetypes

$ mvn archetype:generate                               \
  -DarchetypeGroupId=org.apache.flink              \
  -DarchetypeArtifactId=flink-quickstart-java      \
  -DarchetypeVersion=1.12.3

This allows you to name the newly created project. It will interactively require you to enter groupId, artifactId, and package names.

Run the quickstart script

$curl https://flink.apache.org/q/quickstart.sh | bash -s 1.12.3

We recommend importing this project into the IDE to develop and test it. IntelliJ IDEA supports Maven projects. If you use Eclipse, you can use the m2e plug-in to import the Maven project. Some Eclipse bundles include the plug-in by default. Otherwise, you need to install it manually.

Note: The default Java JVM heap size may be small for Flink. You have to add it manually. In Eclipse, choose RunConfigurations->Arguments and write to the VM Arguments box:-Xmx800m. We recommend using the Help | Edit Custom VM Options to change JVM options in IntelliJ IDEA. Please see this article for details.

Build a Project

If you want to build /package the project, go to the project directory and run the mvn clean package command. After execution, you will get a JAR file: target/-.jar, which contains your application, connectors, and libraries added to the application as dependencies.

text1

Appendix: Templates for Building JAR Packages with Dependencies

You can use the following shade plug-in definition to build an application JAR that contains all the dependencies required by the connector and library:

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>3.1.1</version>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                    <configuration>
                        <artifactSet>
                            <excludes>
                                <exclude>com.google.code.findbugs:jsr305</exclude>
                                <exclude>org.slf4j:*</exclude>
                                <exclude>log4j:*</exclude>
                            </excludes>
                        </artifactSet>
                        <filters>
                            <filter>
                                <!-- Do not copy the signatures in the META-INF folder.
                                Otherwise, this might cause SecurityExceptions when using the JAR. -->
                                <artifact>*:*</artifact>
                                <excludes>
                                    <exclude>META-INF/*.SF</exclude>
                                    <exclude>META-INF/*.DSA</exclude>
                                    <exclude>META-INF/*.RSA</exclude>
                                </excludes>
                            </filter>
                        </filters>
                        <transformers>
                            <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                <mainClass>my.programs.main.clazz</mainClass>
                            </transformer>
                        </transformers>
                    </configuration>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

Community

Packaging Issues in Datastream Development

The Core Dependencies of Flink and the Dependencies of the Application

Dependency Configuration Steps

1. Add Basic Dependencies

2. Add Dependencies for Connectors and Libraries

Usage Notes

Scala Version

Hadoop Dependencies

Transform Table Connector/Format Resources

Advanced Tab of Maven for the Job

Prerequisites

Create a Project

Build a Project

Appendix: Templates for Building JAR Packages with Dependencies

Read previous post:

Read next post:

Apache Flink Community

You may also like

Comments

Apache Flink Community

Related Products

Realtime Compute for Apache Flink

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Message Queue for Apache Kafka