This topic provides answers to some frequently asked questions about job running errors.
What do I do if data in tasks of a job is not consumed after the job is run?
What do I do if the error message "akka.pattern.AskTimeoutException" appears?
What do I do if the error message "Task did not exit gracefully within 180 + seconds." appears?
What do I do if the error message "The GRPC call timed out in sqlserver" appears?
What do I do if the error message "Caused by: java.lang.NoSuchMethodError" appears?
What do I do if a job cannot be started?
Problem description
After I click Start in the Actions column, the status of the job changes from STARTING to FAILED after a period of time.
Solution
On the Events tab of the job details page, click the
icon on the left side of the time at which the job fails to be started. Then, identify the issue based on the error message. On the Startup Logs subtab of the Logs tab, check whether errors exist. Then, identify the issue based on the error message.
If the JobManager is started as expected, you can view the detailed logs of the JobManager or TaskManager on the Job Manager or Running Task Managers subtab of the Logs tab.
Common errors and troubleshooting
Problem description
Cause
Solution
ERROR:exceeded quota: resourcequotaThe resources in the current queue are insufficient.
Reconfigure resources in the current queue or reduce the resources used for job startup.
ERROR:the vswitch ip is not enoughThe number of available IP addresses in the current namespace is less than the number of TaskManagers generated by starting the job.
Reduce the number of jobs that run in parallel, reconfigure slots that can be allocated to a job, or change the vSwitch of the workspace.
ERROR: pooler: ***: authentication failedThe AccessKey pair of the account specified in the code is invalid or the account is not authorized to start the job.
Enter an AccessKey pair that is valid and has permissions.
What do I do if the error message indicating a database connection error appears on the right side of the development console of Realtime Compute for Apache Flink?
Problem description

Cause
The registered catalog is invalid and cannot be connected correctly.
Solution
View all catalogs on the Catalogs page, delete the catalogs that are dimmed, and then register the related catalogs again.
What do I do if data in tasks of a job is not consumed after the job is run?
Network connectivity troubleshooting
If data is not generated or consumed in the upstream and downstream storages, check whether an error message is reported on the Startup Logs tab. If a timeout error is reported, troubleshoot the error that occurs on the network connection between the storages.
Task execution status troubleshooting
On the Configuration tab, check whether data is read from the source and written to the sink to identify the location of the error.

Tasks troubleshooting
Add a print sink table to each task of the job for troubleshooting.
What do I do if a job is restarted after the job is run?
Troubleshoot the error on the Logs tab of the desired job.
View the exception information.
On the JM Exceptions tab, view the reported error and troubleshoot the error.
View the logs of the JobManager and TaskManagers of the job.

View the logs of failed TaskManagers of the job.
Some exceptions may cause the TaskManagers to fail. As a result, the TaskManager logs that are scheduled are incomplete. You can view the last invalid TaskManager logs for troubleshooting.

View the operational logs of a historical job instance.
View the operational logs of a historical job instance to check the reason for job failure.

Data output is suspended on the LocalGroupAggregate operator for a long period of time. No data output is generated. Why?
Code
CREATE TEMPORARY TABLE s1 ( a INT, b INT, ts as PROCTIME(), PRIMARY KEY (a) NOT ENFORCED ) WITH ( 'connector'='datagen', 'rows-per-second'='1', 'fields.b.kind'='random', 'fields.b.min'='0', 'fields.b.max'='10' ); CREATE TEMPORARY TABLE sink ( a BIGINT, b BIGINT ) WITH ( 'connector'='print' ); CREATE TEMPORARY VIEW window_view AS SELECT window_start, window_end, a, sum(b) as b_sum FROM TABLE(TUMBLE(TABLE s1, DESCRIPTOR(ts), INTERVAL '2' SECONDS)) GROUP BY window_start, window_end, a; INSERT INTO sink SELECT count(distinct a), b_sum FROM window_view GROUP BY b_sum;Problem description
Data output is suspended on the LocalGroupAggregate operator for a long period of time and the MiniBatchAssigner operator is not contained in the topology of the job.

Cause
The job includes both WindowAggregate and GroupAggregate operators. The time column of the WindowAggregate operator is proctime, which indicates the event time. The managed memory is used to cache data in miniBatch processing mode if the
table.exec.mini-batch.sizeparameter is not configured for the job or the table.exec.mini-batch.size parameter is set to a negative value.The MiniBatchAssigner operator fails to be generated and cannot send the watermark message to compute operators to trigger final calculation and data output. Final calculation and data output are triggered only when one of the following conditions is met: The managed memory is full, the CHECKPOINT command is received and checkpointing has not been performed, and the job is canceled. For more information, see table.exec.mini-batch.size. The checkpoint interval is set to an excessively large value. The LocalGroupAggregate operator does not trigger data output for a long period of time.
Solutions
Decrease the checkpoint interval. This way, the LocalGroupAggregate operator can automatically trigger data output before checkpointing is performed. For more information about the setting of the checkpoint interval, see Tuning Checkpointing.
Use the heap memory to cache data. This way, data output is automatically triggered when the amount of data cached on the LocalGroupAggregate operator reaches the value of the table.exec.mini-batch.size parameter. Set the
table.exec.mini-batch.sizeparameter to a positive value N. For more information, see How do I configure custom runtime parameters for a job?
No data is entered in a partition of the upstream Kafka connector. As a result, the watermark cannot move forward and the window output is delayed. What do I do?
For example, five partitions exist in the upstream Kafka connector and two new data entries are entered into Kafka every minute. However, some of the partitions do not receive data entries in real time. If a partition does not receive any elements within the timeout period, the partition is marked as temporarily idle. As a result, the watermark cannot move forward, the window cannot end at the earliest opportunity, and the result cannot be generated in real time.
In this case, you must configure a timeout period to specify that the partition has no data. This way, the partition can be excluded from the calculation of the watermark. When the partition is identified to have data, the partition can be included in the calculation of the watermark. For more information, see Configuration.
Add the following code to the Other Configuration field in the Parameters section of the Configuration tab. For more information, see How do I configure custom runtime parameters for a job?
table.exec.source.idle-timeout: 1sHow do I locate the error if the JobManager is not running?
The Flink UI page does not appear because the JobManager is not running as expected. To identify the cause of the error, perform the following steps:
In the left-side navigation pane of the development console of Realtime Compute for Apache Flink, choose . On the Deployments page, find the desired job and click the name of the .
Click the Events tab.
To search for errors and obtain error information, use the shortcut keys of the operating system.
Windows: Ctrl+F
macOS: Command+F

What do I do if the error message "INFO: org.apache.flink.fs.osshadoop.shaded.com.aliyun.oss" appears?
Problem description

Cause
Data is stored in an OSS bucket. When OSS creates a directory, OSS checks whether the directory exists. If the directory does not exist, the error message "INFO: org.apache.flink.fs.osshadoop.shaded.com.aliyun.oss" appears. Realtime Compute for Apache Flink s are not affected.
Solution
Add
<Logger level="ERROR" name="org.apache.flink.fs.osshadoop.shaded.com.aliyun.oss"/>to the log template. For more information, see Configure parameters to export logs of a deployment.
What do I do if the error message "akka.pattern.AskTimeoutException" appears?
Causes
Garbage collection (GC) operations are frequently performed due to insufficient JobManager or TaskManager memory. As a result, the heartbeat and Remote Procedure Call (RPC) requests between the JobManager and a TaskManager time out.
A large number of RPC requests exist in the job but the JobManager resources are insufficient. As a result, an RPC request backlog occurs. This causes the heartbeat and RPC requests between the JobManager and a TaskManager to time out.
The timeout period for the job is set to an excessively small value. If Realtime Compute for Apache Flink fails to access a third-party service, Realtime Compute for Apache Flink performs multiple retries to access the service. As a result, no error is reported for the connection failure before the specified timeout period is reached.
Solutions
If the issue is caused by GC operations, we recommend that you check the GC frequency and the time that is consumed by GC operations based on the job memory and GC logs. If the GC frequency is high or GC operations consume a long period of time, you must increase the JobManager memory and the TaskManager memory.
If the issue is caused by a large number of RPC requests in the , we recommend that you increase the number of CPU cores and the memory size of the JobManager, and set the
akka.ask.timeoutandheartbeat.timeoutparameters to larger values.ImportantWe recommend that you adjust the values of the akka.ask.timeout and heartbeat.timeout parameters only when a large number of RPC requests exist in a job. For a job in which a small number of RPC requests exist, small values of the parameters do not cause the issue.
We recommend that you configure the parameters based on your business requirements. If you set the parameters to excessively large values, the time that is required to resume the job increases if a TaskManager unexpectedly exits.
If the issue is caused by a connection failure of a third-party service, increase the values of the following parameters to allow the error of the connection failure to be reported, and resolve the issue.
client.timeout: Default value: 60. Recommended value: 600. Unit: seconds.akka.ask.timeout: Default value: 10. Recommended value: 600 Unit: seconds.client.heartbeat.timeout: Default value: 180000. Recommended value: 600000. Unit: seconds.NoteTo prevent errors, do not include the unit in the value.
heartbeat.timeout: Default value: 50000. Recommended value: 600000. Unit: milliseconds.NoteTo prevent errors, do not include the unit in the value.
For example, if the error message
"Caused by: java.sql.SQLTransientConnectionException: connection-pool-xxx.mysql.rds.aliyuncs.com:3306 - Connection is not available, request timed out after 30000ms"appears, the MySQL connection pool is full. In this case, you must increase the value of theconnection.pool.sizeparameter that is described in the WITH parameters of MySQL. Default value: 20.NoteYou can determine the minimum values of the preceding parameters based on the timeout error message. The value in the error message shows you the value from which you can adjust the values of the preceding parameters. For example, "60000 ms" in the error message
"pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/rpc/dispatcher_1#1064915964]] after [60000 ms]."is the value of theclient.timeoutparameter.
What do I do if the error message "Task did not exit gracefully within 180 + seconds." appears?
Problem description
Task did not exit gracefully within 180 + seconds. 2022-04-22T17:32:25.852861506+08:00 stdout F org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully within 180 + seconds. 2022-04-22T17:32:25.852865065+08:00 stdout F at org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1709) [flink-dist_2.11-1.12-vvr-3.0.4-SNAPSHOT.jar:1.12-vvr-3.0.4-SNAPSHOT] 2022-04-22T17:32:25.852867996+08:00 stdout F at java.lang.Thread.run(Thread.java:834) [?:1.8.0_102] log_level:ERRORCause
The error message does not show the root cause of the job exception. The timeout period for a task to exit is specified by the
task.cancellation.timeoutparameter. The default value of the parameter is 180 in seconds. When a job performs a failover or a task in the job attempts to exit, the task may be blocked and cannot exit due to specific reasons. If the time period during which the task is blocked reaches the timeout period for the task to exit, Realtime Compute for Apache Flink determines that the task is stuck and cannot be resumed. Then, Realtime Compute for Apache Flink automatically stops the TaskManager to which the task belongs to allow the job to perform the failover or the task to exit. As a result, the error message appears in the logs of the .The timeout issue may be caused by the user-defined functions (UDFs) that are used in your . For example, when you use the close method in a , the task in the job is blocked for a long period of time and cannot exit, or no value is returned for a long period of time.
Solution
Set the
task.cancellation.timeoutparameter to 0. For more information about how to configure this parameter, see How do I configure custom runtime parameters for a ? If you set this parameter to 0, the task waits to exit if it is blocked. In this case, no timeout occurs. If a job performs a failover again after you restart the job or a task in the job is blocked for a long period of time when the task attempts to exit, you must find the task that is in the CANCELLING state, check the stack of the task, identify the root cause of the issue, and then resolve the issue based on the root cause.ImportantThe
task.cancellation.timeoutparameter is used for job debugging. We recommend that you do not set this parameter to 0 for a job in a production environment.
What do I do if the error message "Can not retract a non-existent record. This should never happen." appears?
Problem description
java.lang.RuntimeException: Can not retract a non-existent record. This should never happen. at org.apache.flink.table.runtime.operators.rank.RetractableTopNFunction.processElement(RetractableTopNFunction.java:196) at org.apache.flink.table.runtime.operators.rank.RetractableTopNFunction.processElement(RetractableTopNFunction.java:55) at org.apache.flink.streaming.api.operators.KeyedProcessOperator.processElement(KeyedProcessOperator.java:83) at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:205) at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:135) at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:106) at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:66) at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:424) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:204) at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:685) at org.apache.flink.streaming.runtime.tasks.StreamTask.executeInvoke(StreamTask.java:640) at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:651) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:624) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:799) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:586) at java.lang.Thread.run(Thread.java:877)Causes and solutions
Scenario
Cause
Solution
Scenario 1
The issue is caused by the
now()function in the code.The TopN algorithm does not allow you to use a non-deterministic field as a field in the ORDER BY or PARTITION BY clause. If a non-deterministic field is used as a field in the ORDER BY or PARTITION BY clause, the output values that are returned from the
now()function are different. As a result, the previous value cannot be found in the message.Use a deterministic field as a field in the ORDER BY or PARTITION BY clause.
Scenario 2
The
table.exec.state.ttlparameter is set to an excessively small value. The state data is deleted due to expiration. As a result, the key state data cannot be found in the message.Increase the value of the
table.exec.state.ttlparameter. For more information about how to configure this parameter, see How do I configure custom runtime parameters for a ?
What do I do if the error message "The GRPC call timed out in sqlserver" appears?
Problem description
org.apache.flink.table.sqlserver.utils.ExecutionTimeoutException: The GRPC call timed out in sqlserver, please check the thread stacktrace for root cause: Thread name: sqlserver-operation-pool-thread-4, thread state: TIMED_WAITING, thread stacktrace: at java.lang.Thread.sleep0(Native Method) at java.lang.Thread.sleep(Thread.java:360) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.processWaitTimeAndRetryInfo(RetryInvocationHandler.java:130) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:107) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy195.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1661) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1577) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1574) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1589) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1683) at org.apache.flink.connectors.hive.HiveSourceFileEnumerator.getNumFiles(HiveSourceFileEnumerator.java:118) at org.apache.flink.connectors.hive.HiveTableSource.lambda$getDataStream$0(HiveTableSource.java:209) at org.apache.flink.connectors.hive.HiveTableSource$$Lambda$972/1139330351.get(Unknown Source) at org.apache.flink.connectors.hive.HiveParallelismInference.logRunningTime(HiveParallelismInference.java:118) at org.apache.flink.connectors.hive.HiveParallelismInference.infer(HiveParallelismInference.java:100) at org.apache.flink.connectors.hive.HiveTableSource.getDataStream(HiveTableSource.java:207) at org.apache.flink.connectors.hive.HiveTableSource$1.produceDataStream(HiveTableSource.java:123) at org.apache.flink.table.planner.plan.nodes.exec.common.CommonExecTableSourceScan.translateToPlanInternal(CommonExecTableSourceScan.java:127) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateToPlan(ExecNodeBase.java:226) at org.apache.flink.table.planner.plan.nodes.exec.ExecEdge.translateToPlan(ExecEdge.java:290) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.lambda$translateInputToPlan$5(ExecNodeBase.java:267) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase$$Lambda$949/77002396.apply(Unknown Source) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateInputToPlan(ExecNodeBase.java:268) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateInputToPlan(ExecNodeBase.java:241) at org.apache.flink.table.planner.plan.nodes.exec.stream.StreamExecExchange.translateToPlanInternal(StreamExecExchange.java:87) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateToPlan(ExecNodeBase.java:226) at org.apache.flink.table.planner.plan.nodes.exec.ExecEdge.translateToPlan(ExecEdge.java:290) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.lambda$translateInputToPlan$5(ExecNodeBase.java:267) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase$$Lambda$949/77002396.apply(Unknown Source) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateInputToPlan(ExecNodeBase.java:268) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateInputToPlan(ExecNodeBase.java:241) at org.apache.flink.table.planner.plan.nodes.exec.stream.StreamExecGroupAggregate.translateToPlanInternal(StreamExecGroupAggregate.java:148) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateToPlan(ExecNodeBase.java:226) at org.apache.flink.table.planner.plan.nodes.exec.ExecEdge.translateToPlan(ExecEdge.java:290) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.lambda$translateInputToPlan$5(ExecNodeBase.java:267) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase$$Lambda$949/77002396.apply(Unknown Source) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateInputToPlan(ExecNodeBase.java:268) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateInputToPlan(ExecNodeBase.java:241) at org.apache.flink.table.planner.plan.nodes.exec.stream.StreamExecSink.translateToPlanInternal(StreamExecSink.java:108) at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateToPlan(ExecNodeBase.java:226) at org.apache.flink.table.planner.delegation.StreamPlanner$$anonfun$1.apply(StreamPlanner.scala:74) at org.apache.flink.table.planner.delegation.StreamPlanner$$anonfun$1.apply(StreamPlanner.scala:73) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.flink.table.planner.delegation.StreamPlanner.translateToPlan(StreamPlanner.scala:73) at org.apache.flink.table.planner.delegation.StreamExecutor.createStreamGraph(StreamExecutor.java:52) at org.apache.flink.table.planner.delegation.PlannerBase.createStreamGraph(PlannerBase.scala:610) at org.apache.flink.table.planner.delegation.StreamPlanner.explainExecNodeGraphInternal(StreamPlanner.scala:166) at org.apache.flink.table.planner.delegation.StreamPlanner.explainExecNodeGraph(StreamPlanner.scala:159) at org.apache.flink.table.sqlserver.execution.OperationExecutorImpl.validate(OperationExecutorImpl.java:304) at org.apache.flink.table.sqlserver.execution.OperationExecutorImpl.validate(OperationExecutorImpl.java:288) at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor.lambda$validate$22(DelegateOperationExecutor.java:211) at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor$$Lambda$394/1626790418.run(Unknown Source) at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor.wrapClassLoader(DelegateOperationExecutor.java:250) at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor.lambda$wrapExecutor$26(DelegateOperationExecutor.java:275) at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor$$Lambda$395/1157752141.run(Unknown Source) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622) at java.lang.Thread.run(Thread.java:834) at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor.wrapExecutor(DelegateOperationExecutor.java:281) at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor.validate(DelegateOperationExecutor.java:211) at org.apache.flink.table.sqlserver.FlinkSqlServiceImpl.validate(FlinkSqlServiceImpl.java:786) at org.apache.flink.table.sqlserver.proto.FlinkSqlServiceGrpc$MethodHandlers.invoke(FlinkSqlServiceGrpc.java:2522) at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172) at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331) at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:820) at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622) at java.lang.Thread.run(Thread.java:834) Caused by: java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask.get(FutureTask.java:205) at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor.wrapExecutor(DelegateOperationExecutor.java:277) ... 11 moreCause
A complex SQL statement is used in the draft. As a result, the Remote Procedure Call (RPC) execution times out.
Solution
Add the following code to the Other Configuration field in the Parameters section of the Configuration tab to increase the timeout period for the RPC execution. The default timeout period is 120 seconds. For more information, see How do I configure custom parameters for deployment running?
flink.sqlserver.rpc.execution.timeout: 600s
What do I do if the error message "RESOURCE_EXHAUSTED: gRPC message exceeds maximum size 41943040: 58384051" appears?
Problem description
Caused by: io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: gRPC message exceeds maximum size 41943040: 58384051 at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:244) at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:225) at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:142) at org.apache.flink.table.sqlserver.proto.FlinkSqlServiceGrpc$FlinkSqlServiceBlockingStub.generateJobGraph(FlinkSqlServiceGrpc.java:2478) at org.apache.flink.table.sqlserver.api.client.FlinkSqlServerProtoClientImpl.generateJobGraph(FlinkSqlServerProtoClientImpl.java:456) at org.apache.flink.table.sqlserver.api.client.ErrorHandlingProtoClient.lambda$generateJobGraph$25(ErrorHandlingProtoClient.java:251) at org.apache.flink.table.sqlserver.api.client.ErrorHandlingProtoClient.invokeRequest(ErrorHandlingProtoClient.java:335) ... 6 more Cause: RESOURCE_EXHAUSTED: gRPC message exceeds maximum size 41943040: 58384051)Cause
The size of JobGraph is excessively large due to complex draft logic. As a result, an error occurs during the verification or the job for the draft fails to get started or be canceled.
Solution
Add the following code to the Other Configuration field in the Parameters section of the Configuration tab. For more information, see How do I configure custom runtime parameters for a job?
table.exec.operator-name.max-length: 1000
What do I do if the error message "Caused by: java.lang.NoSuchMethodError" appears?
Problem description
Error message: Caused by: java.lang.NoSuchMethodError: org.apache.flink.table.planner.plan.metadata.FlinkRelMetadataQuery.getUpsertKeysInKeyGroupRange(Lorg/apache/calcite/rel/RelNode;[I)Ljava/util/Set;Cause
If the draft that you develop calls an internal API provided by the Apache Flink community and the internal API is optimized by Alibaba Cloud, an exception such as a package conflict may occur.
Solution
Configure your draft to call only methods that are explicitly marked with @Public or @PublicEvolving in the source code of Apache Flink. Alibaba Cloud only ensures that Realtime Compute for Apache Flink is compatible with the methods.
What do I do if the error message "java.lang.ClassCastException: org.codehaus.janino.CompilerFactory cannot be cast to org.codehaus.commons.compiler.ICompilerFactory" appears?
Problem description
Causedby:java.lang.ClassCastException:org.codehaus.janino.CompilerFactorycannotbecasttoorg.codehaus.commons.compiler.ICompilerFactory atorg.codehaus.commons.compiler.CompilerFactoryFactory.getCompilerFactory(CompilerFactoryFactory.java:129) atorg.codehaus.commons.compiler.CompilerFactoryFactory.getDefaultCompilerFactory(CompilerFactoryFactory.java:79) atorg.apache.calcite.rel.metadata.JaninoRelMetadataProvider.compile(JaninoRelMetadataProvider.java:426) ...66moreCause
The JAR package contains a Janino dependency that causes a conflict.
Specific JAR packages that start with
Flink-such as flink-table-planner and flink-table-runtime are added to the JAR package of the user-defined function (UDF) or connector.
Solutions
Check whether the JAR package contains org.codehaus.janino.CompilerFactory. Class conflicts may occur because the class loading sequence on different machines is different. To resolve this issue, perform the following steps:
In the left-side navigation pane of the development console of Realtime Compute for Apache Flink, choose . On the Deployments page, find the desired job and click the name of the job.
On the Configuration tab of the job details page, click Edit in the upper-right corner of the Parameters section.
Add the following code to the Other Configuration field and click Save.
classloader.parent-first-patterns.additional: org.codehaus.janinoReplace the value of the classloader.parent-first-patterns.additional parameter with a conflict class.
Specify
<scope>provided</scope>for Apache Flink dependencies. The non-connector dependencies whose names start withflink-in theorg.apache.flinkgroup are mainly required.