Flink FAQ - E-MapReduce - Alibaba Cloud Documentation Center

View cluster logs

View logs based on the status of the JobManager:

If the JobManager of the Flink cluster has exited, you can pull the logs to your local machine for viewing by running the yarn logs -applicationId application_xxxx_yy command. You can also view the logs in a web browser by accessing the log link for the finished job in the YARN web UI.
If the JobManager of the Flink cluster is still running, use one of the following methods:
- Access the corresponding Flink web UI to view the logs.
- Use command-line tools. Run yarn logs -applicationId application_xxxx_yy -am ALL -logFiles jobmanager.log to view JobManager logs, or run yarn logs -applicationId application_xxxx_yy -containerId container_xxxx_yy_aa_bb -logFiles taskmanager.log to view TaskManager logs.

Resolve JAR package conflicts

This issue typically causes errors such as NoSuchFieldError/NoSuchMethodError/ClassNotFoundException in the job logs. To troubleshoot and resolve this issue, follow these steps:

Identify the conflicting dependency class. Based on the exception class in the error message, find the dependency JAR package that contains this class. Then, in the directory where your job's pom.xml file is located, run mvn dependency:tree to view the dependency tree and determine its origin.
Exclude the conflicting dependency class.
- If the scope of the JAR package is incorrectly set in the pom.xml file, change the scope to provided to exclude the JAR package.
- If you must use the JAR package that contains the exception class, you can add an exclude rule to remove the specific conflicting class.
- If you must use the exception class and cannot replace it with the corresponding version from the cluster, use the Maven Shade Plugin to shade the class.
Additionally, if multiple versions of a JAR package exist in the classpath, the class version used by the job depends on the class loading order. To confirm which JAR package a specific class is loaded from, you can set the JVM parameter env.java.opts: -verbose:class in the flink-conf.yaml file or specify the dynamic parameter -Denv.java.opts="-verbose:class" to print the loaded classes and their sources.

Note
For a JobManager or TaskManager, this information is printed to the jobmanager.out or taskmanager.out file.

Submit jobs from an external machine

To submit a job to a DataFlow cluster from an external machine, follow these steps:

Ensure that the external machine can connect to the DataFlow cluster over the network.
Configure the Hadoop YARN environment on the client machine that submits the Flink job.

In a DataFlow cluster, the Hadoop YARN software is installed in the /opt/apps/YARN/yarn-current directory, and its configuration files are in the /etc/taihao-apps/hadoop-conf/ directory. You must download the yarn-current directory and the hadoop-conf directory to the client machine.

Then, configure the following environment variables on the client machine.
```
export HADOOP_HOME=/path/to/yarn-current && \
export PATH=${HADOOP_HOME}/bin/:$PATH && \
export HADOOP_CLASSPATH=$(hadoop classpath) && \
export HADOOP_CONF_DIR=/path/to/hadoop-conf
```
Important
Hadoop configuration files, such as yarn-site.xml, use a fully qualified domain name (FQDN) for service addresses like the ResourceManager. For example, master-1-1.c-xxxxxxxxxx.cn-hangzhou.emr.aliyuncs.com. If you submit jobs from an external machine, ensure that these FQDNs can be resolved, or replace the FQDNs with their corresponding IP addresses in the configuration files.
After you complete the configuration, start a Flink job on the external machine. For example, run the command flink run -d -t yarn-per-job -ynm flink-test $FLINK_HOME/examples/streaming/TopSpeedWindowing.jar. You can then see the corresponding Flink job in the YARN web UI of the DataFlow cluster.

Resolve cluster hostnames from an external machine

Use one of the following methods to resolve DataFlow cluster hostnames from an external machine:

Modify the /etc/hosts file on the client machine to add mappings between the hostnames and IP addresses.
Use the DNS service provided by Alibaba Cloud DNS PrivateZone.
If you have your own domain name resolution service, you can also configure the following JVM runtime parameters to use it.
```
env.java.opts.client: "-Dsun.net.spi.nameservice.nameservers=xxx -Dsun.net.spi.nameservice.provider.1=dns,sun -Dsun.net.spi.nameservice.domain=yyy"
```

Check Flink job status

Use the EMR console.

EMR supports Knox, which allows you to access the web UIs of services like YARN and Flink over the internet. You can access the Flink web UI through YARN. For more information, see View the job status on the web UI.
Use an SSH tunnel. For more information, see Create an SSH tunnel to access web UIs of open source components.
Access the YARN REST API directly.
```
curl --compressed -v  -H "Accept: application/json" -X GET "http://master-1-1:8088/ws/v1/cluster/apps?states=RUNNING&queue=default&user.name=***"
```
Note
Ensure that your security group allows access to ports 8443 and 8088 to reach the YARN REST API. Alternatively, ensure the DataFlow cluster and your client node are in the same Virtual Private Cloud (VPC).

Access Flink job logs

For a running job, you can access its logs through the Flink web UI.
For a finished job, you can view its statistics on the Flink HistoryServer or access its logs by running the command yarn logs -applicationId application_xxxx_yyyy. Logs for finished jobs are stored by default in the hdfs:///tmp/logs/$USERNAME/logs/ directory on the HDFS cluster.

Access the Flink HistoryServer

A DataFlow cluster starts a Flink HistoryServer by default on the master-1-1 node (the first machine in the master server group) at port 18082. This server collects statistics for finished jobs. To access it, follow these steps:

Configure a security group rule to allow access to port 18082 on the master-1-1 node.
Access http://$master-1-1-ip:18082 directly.

Important

The Flink HistoryServer does not store the detailed logs of finished jobs. To view logs, use the YARN API or the YARN web UI.

Use commercial connectors

A DataFlow cluster provides many commercial connectors, such as those for Hologres, SLS, MaxCompute, DataHub, Elasticsearch, and ClickHouse. In your Flink jobs, you can use these commercial connectors in addition to open source ones. The following example demonstrates how to use the included Hologres connector.

Job development

Download the JAR package of the commercial connector from the DataFlow cluster (located in the /opt/apps/FLINK/flink-current/opt/connectors directory). Then, install the connector in your local Maven environment by running the following command.
```
mvn install:install-file -Dfile=/path/to/ververica-connector-hologres-1.13-vvr-4.0.7.jar -DgroupId=com.alibaba.ververica -DartifactId=ververica-connector-hologres -Dversion=1.13-vvr-4.0.7 -Dpackaging=jar
```

Add the following dependency to your project's pom.xml file.

<dependency>
    <groupId>com.alibaba.ververica</groupId>
    <artifactId>ververica-connector-hologres</artifactId>
    <version>1.13-vvr-4.0.7</version>
    <scope>provided</scope>
</dependency>

Run the job
- Method 1:
  1. Copy the Hologres connector to a separate directory.
```
hdfs mkdir hdfs:///flink-current/opt/connectors/hologres/
hdfs cp hdfs:///flink-current/opt/connectors/ververica-connector-hologres-1.13-vvr-4.0.7.jar  hdfs:///flink-current/opt/connectors/hologres/ververica-connector-hologres-1.13-vvr-4.0.7.jar
```
  2. When you submit the job, add the following parameter to the command.
```
-D yarn.provided.lib.dirs=hdfs:///flink-current/opt/connectors/hologres/
```
- Method 2:
  1. Copy the Hologres connector to the /opt/apps/FLINK/flink-current/opt/connectors/ververica-connector-hologres-1.13-vvr-4.0.7.jar directory on the job submission client. This directory structure must match the one in the DataFlow cluster.
  2. When you submit the job, add the following parameter to the command.
```
-C file:///opt/apps/FLINK/flink-current/opt/connectors/ververica-connector-hologres-1.13-vvr-4.0.7.jar
```
- Method 3: Package the Hologres connector into your job's JAR package.

Use GeminiStateBackend

A DataFlow cluster provides the enterprise-grade GeminiStateBackend, which offers 3 to 5 times the performance of open source versions. The DataFlow cluster uses GeminiStateBackend by default. For more information about advanced configurations for GeminiStateBackend, see Enterprise-grade state backend configurations.

Use an open source state backend

A DataFlow cluster uses the enterprise-grade GeminiStateBackend by default. If you want to use an open source state backend, such as rocksdb, for a specific job, you can specify it by using the -D flag. For example:

flink run-application -t yarn-application -D state.backend=rocksdb  /opt/apps/FLINK/flink-current/examples/streaming/TopSpeedWindowing.jar

Alternatively, to make this change effective for all subsequent jobs, go to the EMR console, change the value of the state.backend parameter to the desired state backend (for example, rocksdb). Click Save, and then click Deploy Client Configuration.

View client logs

In an EMR cluster environment, the FLINK_LOG_DIR environment variable specifies where Flink client logs are stored. Its default value is /var/log/taihao-apps/flink (the default was /mnt/disk1/log/flink in versions earlier than 3.43.0). If you need to view complete client logs, such as SQL Client logs, you can find the corresponding files in this directory.

Job parameters not taking effect

When you run a Flink job from the command line, place the job parameters after the Flink job JAR package. For example: flink run -d -t yarn-per-job test.jar arg1 arg2.

Resolve "Multiple factories..." error

Cause

This error indicates that the classpath contains multiple implementations of a connector. This typically happens when a connector dependency is added to the job's JAR package while you also manually place the same connector dependency in the $FLINK_HOME/lib directory, causing a dependency conflict.
Solution

The solution is to remove the duplicate dependency. For detailed troubleshooting steps, see What do I do if a job JAR package conflicts with the cluster's Flink JAR package?

Enable JobManager HA

A DataFlow cluster deploys and runs Flink jobs in YARN mode. You can enable high availability (HA) for the JobManager for more stable Flink job execution by following the community's Configuration guide. The following is an example configuration.

high-availability: zookeeper
high-availability.zookeeper.quorum: 192.168.**.**:2181,192.168.**.**:2181,192.168.**.**:2181
high-availability.zookeeper.path.root: /flink
high-availability.storageDir: hdfs:///flink/recovery

Important

After you enable high availability, the JobManager restarts at most once upon failure by default. If you want the JobManager to restart multiple times, you must also set YARN's yarn.resourcemanager.am.max-attempts parameter and Flink's yarn.application-attempts parameter. For more information, see the Apache Flink official documentation. Based on experience, you should also increase the value of the yarn.application-attempt-failures-validity-interval parameter from the default of 10,000 milliseconds (10 seconds) to a larger value, such as 300,000 milliseconds (5 minutes), to prevent the JobManager from restarting continuously.

View Flink job metrics

In the EMR console, navigate to the Monitoring page of the target cluster and click Metric Monitoring.
From the Dashboard drop-down list, select FLINK.
Select the application ID and job ID for the job you want to view. The monitoring metrics for the job will then appear.
Note
- The application ID and job ID options are available only if Flink jobs are running in the cluster.
- Some metrics, such as sourceIdleTime, are generated only if the corresponding source and sink are configured.

Troubleshoot connector issues

For common questions about upstream and downstream storage, see Connectors.

Resolve errors with password-free OSS access

Handle the issue based on the specific error message:

Error message: java.lang.UnsupportedOperationException: Recoverable writers on Hadoop are only supported for HDFS.
- Cause: A DataFlow cluster uses the built-in JindoSDK to support password-free access to OSS and APIs like StreamingFileSink. You do not need to perform additional configuration as described in the community documentation. Doing so may cause a dependency conflict that results in this error.
- Solution: On the job submission machine in your cluster, check the $FLINK_HOME/plugins directory for an oss-fs-hadoop directory. If it exists, delete the directory and resubmit the job.
Error message: Could not find a file system implementation for scheme 'oss'. The scheme is directly supported by Flink through the following plugin: flink-oss-fs-hadoop. .....
- Cause: In EMR clusters of version 3.40 and earlier, machines in the master server group other than master-1-1 may be missing Jindo-related JAR packages.
- Solution:
  - For EMR 3.40 and earlier: Check if Jindo-related JAR packages, such as jindo-flink-4.0.0-full.jar, exist in the $FLINK_HOME/lib directory on the job submission machine. If they are missing, run the following command to copy the required JAR packages to the $FLINK_HOME/lib directory and then resubmit the job.
```
cp /opt/apps/extra-jars/flink/jindo-flink-*-full.jar $FLINK_HOME/lib
```
  - For EMR versions later than 3.40:
    - For Flink on YARN mode: Newer versions have an optimized mechanism for OSS support. Jobs that read from and write to OSS can run normally even if Jindo-related JAR packages are not present in the $FLINK_HOME/lib directory.
    - For other deployment modes: Check if Jindo-related JAR packages, such as jindo-flink-4.0.0-full.jar, exist in the $FLINK_HOME/lib directory of the job submission machine. If they are missing, run the following command to copy them to the $FLINK_HOME/lib directory and then resubmit the job.
```
cp /opt/apps/extra-jars/flink/jindo-flink-*-full.jar $FLINK_HOME/lib
```

Resolve "TaskManager heartbeat timed out" error

Cause

The direct cause is a TaskManager heartbeat timeout. You can check the TaskManager logs for specific error messages to pinpoint the exact reason. Other potential causes include insufficient TaskManager heap memory or an out of memory (OOM) error due to a memory leak in the job code. For more information, see How do I resolve the "java.lang.OutOfMemoryError: GC overhead limit exceeded" error?.
Solution

If you encounter this error, increase the memory allocation or analyze the job's memory usage to further diagnose the issue.

Resolve "GC overhead limit exceeded" error

Cause

This error indicates that the garbage collector (GC) is taking too much time because the memory allocated to the job is insufficient. Common causes include a memory leak in the code (such as a UDF) or the configured memory is insufficient for the job's requirements.
Solution
- Before rerunning the job, specify the following JVM parameter using the -D flag to save a heap dump when an OutOfMemoryError occurs: -D env.java.opts="-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/dump.hprof".
- Add the parameter env.java.opts: -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/dump.hprof to the flink-conf.yaml file to configure heap dumps on OutOfMemoryError.
After the error occurs again, you can analyze the heap dump file specified by HeapDumpPath using tools like MAT or jvisualvm to determine the root cause.

Zero "Records Received" for single-operator jobs

This is normal. The Flink Records Received metric describes data communication between different operators. When a job is optimized into a single operator, this metric will always be 0.

Enable flame graphs for Flink jobs

A flame graph visualizes the CPU consumption of various methods within a process, helping you identify and resolve performance bottlenecks. Flink has supported flame graphs since version 1.13, but the feature is disabled by default to avoid affecting production jobs. If you need to use flame graphs to analyze job performance, go to the Flink service Configure tab in the EMR console. In the flink-conf.yaml file, add a new configuration item with the parameter rest.flamegraph.enabled and set its value to true. For instructions on adding a configuration item, see Manage configuration items.

For more information about flame graphs, see Flame Graphs.

Resolve "NoSuchFieldError: DEPLOYMENT_MODE" error

Cause

Your job's JAR package directly or indirectly includes a flink-core dependency that is incompatible with the Flink version in the cluster, causing a dependency conflict.
Solution
Add the following configuration to your pom.xml file to set the scope of the flink-core dependency to provided. This resolves the issue.
```
<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-core</artifactId>
  
  <version>1.16.1</version>
  <scope>provided</scope>
</dependency>
```
Note
You must change the version to your Flink version.
To further locate the source of this dependency, see What do I do if a job JAR package conflicts with the cluster's Flink JAR package?.