This topic describes common issues and solutions for job execution errors in Realtime Compute for Apache Flink.
What should I do if data in a data link is not consumed after a job starts?
What should I do if a job restarts after running for a period of time?
Why is data stuck in the LocalGroupAggregate node for a long time with no output?
How can I quickly identify the cause of a JobManager failure?
Error: Can not retract a non-existent record. This should never happen.
Error: RESOURCE_EXHAUSTED: gRPC message exceeds maximum size 41943040: 58384051
How do I troubleshoot a job that fails to start?
Problem description
After you click the Start button, the job status changes from Starting to Failed.
Solution
On the Events tab, click the
icon to expand the details and identify the cause from the error message.On the Logs tab, check the startup logs for exceptions and troubleshoot the issue based on the information provided.
If the JobManager starts successfully, you can view the detailed JobManager or TaskManager logs on the Logs tab.
Common errors
Error details
Cause
Solution
ERROR: exceeded quota: resourcequotaThe resource queue has insufficient resources.
Upgrade or downgrade the resources for the resource queue, or reduce the resources required to start the job.
ERROR:the vswitch ip is not enoughThe number of available IP addresses in the project is less than the number of TaskManagers (TMs) required by the job.
Reduce the parallelism, configure the slots properly, or modify the virtual switch of the workspace.
ERROR: pooler: ***: authentication failedThe AccessKey provided in the code is invalid or lacks the required permissions.
Provide a valid AccessKey that has the required permissions.
What do I do if a database connection error pop-up appears on the right side of the page?
Details

Cause
A registered catalog is invalid and cannot be connected to.
Solution
On the Data Management page, view all catalogs. Delete the catalogs that are grayed out and then register them again.
What should I do if data in the pipeline is not consumed after a job runs?
Check network connectivity
If upstream and downstream components are not producing or consuming data, first check the Startup Logs page for error messages. If a timeout error is reported, troubleshoot the network connectivity for the corresponding component.
Check task execution status
On the Overview page, check whether the source is sending data and the sink is receiving data to identify where the issue occurs.

Perform a detailed data link check
Add a print sink table to each data link to troubleshoot the issue.
What do I do if a job restarts after running?
You can troubleshoot the issue on the Logs tab of the job:
Check for exceptions.
On the Exceptions tab, view the thrown exceptions and troubleshoot the issue based on the information provided.
View the JobManager and TaskManager logs.

View the logs of failed TaskManagers.
Some exceptions can cause a TaskManager to fail. The logs of a newly scheduled TaskManager may be incomplete. You can view the logs of the previously failed TaskManager for troubleshooting.

View the logs of historical job runs.
Select the logs from a historical run of the current job to find the cause of the failure.

Why is data stuck in the LocalGroupAggregate node for a long time with no output?
Job code
CREATE TEMPORARY TABLE s1 ( a INT, b INT, ts as PROCTIME(), PRIMARY KEY (a) NOT ENFORCED ) WITH ( 'connector'='datagen', 'rows-per-second'='1', 'fields.b.kind'='random', 'fields.b.min'='0', 'fields.b.max'='10' ); CREATE TEMPORARY TABLE sink ( a BIGINT, b BIGINT ) WITH ( 'connector'='print' ); CREATE TEMPORARY VIEW window_view AS SELECT window_start, window_end, a, sum(b) as b_sum FROM TABLE(TUMBLE(TABLE s1, DESCRIPTOR(ts), INTERVAL '2' SECONDS)) GROUP BY window_start, window_end, a; INSERT INTO sink SELECT count(distinct a), b_sum FROM window_view GROUP BY b_sum;Problem description
Data is stuck in the LocalGroupAggregate node for a long time with no output, and a MiniBatchAssigner node is not present.

Cause
If a job contains both a WindowAggregate and a GroupAggregate, and the time column for the WindowAggregate is the processing time (proctime), the MiniBatch processing mode uses managed memory to cache data if the
table.exec.mini-batch.sizeparameter is not configured or is set to a negative value. This also prevents a MiniBatchAssigner node from being generated.As a result, the compute node cannot receive the watermark message from the MinibatchAssigner node to trigger a calculation and output. The calculation and output are triggered only when one of three conditions is met: the managed memory is full, a checkpoint is about to occur, or the job stops. For more information, see table.exec.mini-batch.size. If the checkpoint interval is too long, data accumulates in the LocalGroupAggregate node and is not output for an extended period.
Solutions
Decrease the checkpoint interval. This allows the LocalGroupAggregate node to automatically trigger an output before a checkpoint occurs. For more information about how to set the checkpoint interval, see Tuning Checkpointing.
Use heap memory to cache data. This allows the LocalGroupAggregate node to automatically trigger an output when the number of cached data records reaches N. To do this, set the
table.exec.mini-batch.sizeparameter to a positive value N. For more information about parameter configuration, see How do I configure custom job running parameters?
What to do if an upstream connector partition receives no data, causing the watermark to stall and window output to be delayed?
For example, consider a Kafka source with five partitions. Two new records arrive every minute, but not every partition receives data in real time. If a source does not receive any elements within the timeout period, it is marked as temporarily idle. As a result, the watermark cannot advance, the window cannot close promptly, and the result is not output in real time.
In this case, you can set a time-to-live (TTL) to indicate that the partition has no data. This excludes the partition from the watermark calculation. When data arrives, the partition is included in the calculation again. For more information, see Configuration.
In the Additional Configurations section, add the following code. For more information, see How do I configure custom job running parameters?.
table.exec.source.idle-timeout: 1sHow to quickly locate the problem if the JobManager is not running?
If the JobManager is not running, you cannot access the Flink UI page. In this case, you can identify the cause of the failure by performing the following steps:
On the page, click the target job name.
Click the Events tab.
Use a keyboard shortcut to search for "error" and retrieve the exception information.
Windows: Ctrl+F
macOS: Command+F

INFO: org.apache.flink.fs.osshadoop.shaded.com.aliyun.oss
Error details

Cause
The storage class is OSS Bucket. When OSS creates a new folder, it first checks whether the folder exists. If the folder does not exist, this INFO message is reported. This message does not affect the Flink job.
Solution
Add
<Logger level="ERROR" name="org.apache.flink.fs.osshadoop.shaded.com.aliyun.oss"/>to the log template. For more information, see Configure job log output.
Error: akka.pattern.AskTimeoutException
Causes
Continuous garbage collection (GC) occurs because of insufficient memory in the JobManager or TaskManager. This causes heartbeats and Remote Procedure Call (RPC) requests between the JobManager and TaskManagers to time out.
The job is large, which means the volume of RPC requests is high. However, the JobManager resources are insufficient, causing a backlog of RPC requests. This leads to timeouts for heartbeats and RPC requests between the JobManager and TaskManagers.
The job timeout parameter is set to a small value. When a connection to a third-party product fails, the system retries the connection multiple times. This prevents the connection failure error from being thrown before the timeout is reached.
Solutions
If the error is caused by continuous GC, check the time consumed and frequency of GC based on the job's memory usage and GC logs. If high-frequency GC or long GC times are found, increase the JobManager and TaskManager memory.
If the error is caused by a large job, increase the CPU and memory resources of the JobManager. You can also increase the values of the
akka.ask.timeoutandheartbeat.timeoutparameters.ImportantWe recommend that you adjust these two parameters only for large-scale jobs. For small-scale jobs, this error is usually not caused by small parameter values.
Set these parameters as needed. If the values are too large, the job recovery time increases if a TaskManager exits unexpectedly.
If the timeout is caused by a failed connection to a third-party product, first increase the values of the following four parameters to allow the third-party error to be thrown. Then, you can resolve the third-party error.
client.timeout: Default: 60 s. Recommended: 600 s.akka.ask.timeout: Default: 10 s. Recommended: 600 s.client.heartbeat.timeout: Default: 180000 s. Recommended: 600000 s. When you enter the value, do not include the unit. Otherwise, a startup error may occur. For example, you can enterclient.heartbeat.timeout: 600000.heartbeat.timeout: Default: 50000 ms. Recommended: 600000 ms. When you enter the value, do not include the unit. Otherwise, a startup error may occur. For example, you can enterheartbeat.timeout: 600000.
For example, the error
Caused by: java.sql.SQLTransientConnectionException: connection-pool-xxx.mysql.rds.aliyuncs.com:3306 - Connection is not available, request timed out after 30000msindicates that the MySQL connection pool is full. You must increase the value of theconnection.pool.sizeparameter in the MySQL WITH options. The default value is 20.NoteYou can determine the minimum values for the preceding parameters based on the timeout error message. For example, in
pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/rpc/dispatcher_1#1064915964]] after [60000 ms]., 60000 ms is the current value ofclient.timeout.
Error: Task did not exit gracefully within 180 + seconds.
Error details
Task did not exit gracefully within 180 + seconds. 2022-04-22T17:32:25.852861506+08:00 stdout F org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully within 180 + seconds. 2022-04-22T17:32:25.852865065+08:00 stdout F at org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1709) [flink-dist_2.11-1.12-vvr-3.0.4-SNAPSHOT.jar:1.12-vvr-3.0.4-SNAPSHOT] 2022-04-22T17:32:25.852867996+08:00 stdout F at java.lang.Thread.run(Thread.java:834) [?:1.8.0_102] log_level:ERRORCause
This error is not the root cause of the job exception. The default value of the
task.cancellation.timeoutparameter, which specifies the timeout for a task to exit, is 180 seconds. During a job failover or exit process, a task's exit may be blocked for some reason. When the blocking time reaches the timeout, Flink determines that the task is stuck and cannot recover. Flink then actively stops the TaskManager where the task is located to allow the failover or exit process to continue. This is why this error appears in the logs.The actual cause may be a problem with your user-defined function (UDF) implementation, such as a long-running block in the close method or a calculation method that does not return for a long time.
Solution
Set the
task.cancellation.timeoutparameter to 0. For more information about how to configure this parameter, see How do I configure custom job running parameters? When the value is 0, a blocked task exit does not time out, and the task waits indefinitely to complete its exit. After you restart the job, if you find that the job is blocked for a long time during failover or exit again, find the task that is in the Cancelling state. Check the task's stack to investigate the root cause, and then resolve the issue accordingly.ImportantThe
task.cancellation.timeoutparameter is used for job debugging. Do not set this parameter to 0 for production jobs.
Error: Can not retract a non-existent record. This should never happen.
Error details
java.lang.RuntimeException: Can not retract a non-existent record. This should never happen. at org.apache.flink.table.runtime.operators.rank.RetractableTopNFunction.processElement(RetractableTopNFunction.java:196) at org.apache.flink.table.runtime.operators.rank.RetractableTopNFunction.processElement(RetractableTopNFunction.java:55) at org.apache.flink.streaming.api.operators.KeyedProcessOperator.processElement(KeyedProcessOperator.java:83) at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:205) at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:135) at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:106) at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:66) at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:424) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:204) at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:685) at org.apache.flink.streaming.runtime.tasks.StreamTask.executeInvoke(StreamTask.java:640) at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:651) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:624) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:799) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:586) at java.lang.Thread.run(Thread.java:877)Causes and solutions
Scenario
Cause
Solution
Scenario 1
Caused by the
now()function in the code.TopN does not support non-deterministic fields in the ORDER BY or PARTITION BY clause. The
now()function returns a different value each time, which prevents the retraction from finding the previous value.Use fields from the source table that produce only deterministic values for the ORDER BY and PARTITION BY clauses.
Scenario 2
The value of the
table.exec.state.ttlparameter is too small. The state is cleared after it expires, and the corresponding keystate cannot be found during retraction.Increase the value of the
table.exec.state.ttlparameter. For more information about how to configure this parameter, see How to configure custom job running parameters?.
Error: The GRPC call timed out in sqlserver
Error details
org.apache.flink.table.sqlserver.utils.ExecutionTimeoutException: The GRPC call timed out in sqlserver, please check the thread stacktrace for root cause: Thread name: sqlserver-operation-pool-thread-4, thread state: TIMED_WAITING, thread stacktrace: at java.lang.Thread.sleep0(Native Method) at java.lang.Thread.sleep(Thread.java:360) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.processWaitTimeAndRetryInfo(RetryInvocationHandler.java:130) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:107) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy195.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1661) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1577) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1574) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1589) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1683) at org.apache.flink.connectors.hive.HiveSourceFileEnumerator.getNumFiles(HiveSourceFileEnumerator.java:118) at org.apache.flink.connectors.hive.HiveTableSource.lambda$getDataStream$0(HiveTableSource.java:209) at org.apache.flink.connectors.hive.HiveTableSource$$Lambda$972/1139330351.get(Unknown Source) at org.apache.flink.connectors.hive.HiveParallelismInference.logRunningTime(HiveParallelismInference.java:118) at org.apache.flink.connectors.hive.HiveParallelismInference.infer(HiveParallelismInference.java:100) at org.apache.flink.connectors.hive.HiveTableSource.getDataStream(HiveTableSource.java:207) at org.apache.flink.connectors.hive.HiveTableSource$1.produceDataStream(HiveTableSource.java:123) at org.apache.flink.table.planner.plan.nodes.exec.common.CommonExecTableSourceScan.translateToPlanInternal(CommonExecTableSourceScan.java:127) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateToPlan(ExecNodeBase.java:226) at org.apache.flink.table.planner.plan.nodes.exec.ExecEdge.translateToPlan(ExecEdge.java:290) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.lambda$translateInputToPlan$5(ExecNodeBase.java:267) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase$$Lambda$949/77002396.apply(Unknown Source) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateInputToPlan(ExecNodeBase.java:268) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateInputToPlan(ExecNodeBase.java:241) at org.apache.flink.table.planner.plan.nodes.exec.stream.StreamExecExchange.translateToPlanInternal(StreamExecExchange.java:87) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateToPlan(ExecNodeBase.java:226) at org.apache.flink.table.planner.plan.nodes.exec.ExecEdge.translateToPlan(ExecEdge.java:290) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.lambda$translateInputToPlan$5(ExecNodeBase.java:267) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase$$Lambda$949/77002396.apply(Unknown Source) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateInputToPlan(ExecNodeBase.java:268) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateInputToPlan(ExecNodeBase.java:241) at org.apache.flink.table.planner.plan.nodes.exec.stream.StreamExecGroupAggregate.translateToPlanInternal(StreamExecGroupAggregate.java:148) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateToPlan(ExecNodeBase.java:226) at org.apache.flink.table.planner.plan.nodes.exec.ExecEdge.translateToPlan(ExecEdge.java:290) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.lambda$translateInputToPlan$5(ExecNodeBase.java:267) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase$$Lambda$949/77002396.apply(Unknown Source) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateInputToPlan(ExecNodeBase.java:268) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateInputToPlan(ExecNodeBase.java:241) at org.apache.flink.table.planner.plan.nodes.exec.stream.StreamExecSink.translateToPlanInternal(StreamExecSink.java:108) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateToPlan(ExecNodeBase.java:226) at org.apache.flink.table.planner.delegation.StreamPlanner$$anonfun$1.apply(StreamPlanner.scala:74) at org.apache.flink.table.planner.delegation.StreamPlanner$$anonfun$1.apply(StreamPlanner.scala:73) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.flink.table.planner.delegation.StreamPlanner.translateToPlan(StreamPlanner.scala:73) at org.apache.flink.table.planner.delegation.StreamExecutor.createStreamGraph(StreamExecutor.java:52) at org.apache.flink.table.planner.delegation.PlannerBase.createStreamGraph(PlannerBase.scala:610) at org.apache.flink.table.planner.delegation.StreamPlanner.explainExecNodeGraphInternal(StreamPlanner.scala:166) at org.apache.flink.table.planner.delegation.StreamPlanner.explainExecNodeGraph(StreamPlanner.scala:159) at org.apache.flink.table.sqlserver.execution.OperationExecutorImpl.validate(OperationExecutorImpl.java:304) at org.apache.flink.table.sqlserver.execution.OperationExecutorImpl.validate(OperationExecutorImpl.java:288) at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor.lambda$validate$22(DelegateOperationExecutor.java:211) at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor$$Lambda$394/1626790418.run(Unknown Source) at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor.wrapClassLoader(DelegateOperationExecutor.java:250) at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor.lambda$wrapExecutor$26(DelegateOperationExecutor.java:275) at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor$$Lambda$395/1157752141.run(Unknown Source) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622) at java.lang.Thread.run(Thread.java:834) at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor.wrapExecutor(DelegateOperationExecutor.java:281) at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor.validate(DelegateOperationExecutor.java:211) at org.apache.flink.table.sqlserver.FlinkSqlServiceImpl.validate(FlinkSqlServiceImpl.java:786) at org.apache.flink.table.sqlserver.proto.FlinkSqlServiceGrpc$MethodHandlers.invoke(FlinkSqlServiceGrpc.java:2522) at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172) at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331) at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:820) at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622) at java.lang.Thread.run(Thread.java:834) Caused by: java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask.get(FutureTask.java:205) at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor.wrapExecutor(DelegateOperationExecutor.java:277) ... 11 moreCause
An overly complex SQL statement can cause a timeout exception.
Solution
In the Additional Configurations section, add the following code to increase the default 120-second timeout limit. For more information, see How do I configure custom job running parameters?.
flink.sqlserver.rpc.execution.timeout: 600s
Error: RESOURCE_EXHAUSTED: gRPC message exceeds maximum size 41943040: 58384051
Error details
Caused by: io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: gRPC message exceeds maximum size 41943040: 58384051 at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:244) at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:225) at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:142) at org.apache.flink.table.sqlserver.proto.FlinkSqlServiceGrpc$FlinkSqlServiceBlockingStub.generateJobGraph(FlinkSqlServiceGrpc.java:2478) at org.apache.flink.table.sqlserver.api.client.FlinkSqlServerProtoClientImpl.generateJobGraph(FlinkSqlServerProtoClientImpl.java:456) at org.apache.flink.table.sqlserver.api.client.ErrorHandlingProtoClient.lambda$generateJobGraph$25(ErrorHandlingProtoClient.java:251) at org.apache.flink.table.sqlserver.api.client.ErrorHandlingProtoClient.invokeRequest(ErrorHandlingProtoClient.java:335) ... 6 more Cause: RESOURCE_EXHAUSTED: gRPC message exceeds maximum size 41943040: 58384051)Cause
The job logic is very complex, which makes the generated JobGraph too large. This can cause a validation error or cause the job to get stuck during startup or shutdown.
Solution
In the Additional Configurations section, add the following code. For more information, see How do I configure custom job running parameters?.
table.exec.operator-name.max-length: 1000
Error: Caused by: java.lang.NoSuchMethodError
Error details
Error: Caused by: java.lang.NoSuchMethodError: org.apache.flink.table.planner.plan.metadata.FlinkRelMetadataQuery.getUpsertKeysInKeyGroupRange(Lorg/apache/calcite/rel/RelNode;[I)Ljava/util/SetCause
If you depend on an internal API from the community, and the version of this API on Alibaba Cloud has been optimized, exceptions such as package conflicts may occur.
Solution
In the Flink source code, only methods explicitly annotated with @Public or @PublicEvolving are public APIs that you can call. Alibaba Cloud guarantees compatibility only for these methods.
Error: java.lang.ClassCastException: org.codehaus.janino.CompilerFactory cannot be cast to org.codehaus.commons.compiler.ICompilerFactory
Error details
Causedby:java.lang.ClassCastException:org.codehaus.janino.CompilerFactorycannotbecasttoorg.codehaus.commons.compiler.ICompilerFactory atorg.codehaus.commons.compiler.CompilerFactoryFactory.getCompilerFactory(CompilerFactoryFactory.java:129) atorg.codehaus.commons.compiler.CompilerFactoryFactory.getDefaultCompilerFactory(CompilerFactoryFactory.java:79) atorg.apache.calcite.rel.metadata.JaninoRelMetadataProvider.compile(JaninoRelMetadataProvider.java:426) ...66moreCauses
The JAR package includes a conflicting janino dependency.
The user-defined function (UDF) JAR or connector JAR includes JARs that start with
flink-, such as flink-table-planner and flink-table-runtime.
Solutions
Analyze whether the JAR package contains org.codehaus.janino.CompilerFactory. Class conflicts can occur because the class loading order differs on different machines. To resolve this issue, perform the following steps:
On the page, click the name of the target job.
On the Deployment Details tab, click Edit to the right of the Running Parameter Configuration section.
In the Additional Configurations text box, enter the following parameter and click Save.
classloader.parent-first-patterns.additional: org.codehaus.janinoReplace the value of the parameter with the conflicting class.
Set the scope for Flink-related dependencies to `provided` by adding
<scope>provided</scope>. This applies mainly to non-Connector dependencies in theorg.apache.flinkgroup that start withflink-.