All Products
Search
Document Center

Realtime Compute for Apache Flink:FAQ about job running errors

Last Updated:Aug 04, 2025

This topic provides answers to some frequently asked questions about job running errors.

What do I do if a job cannot be started?

  • Problem description

    After I click Start in the Actions column, the status of the job changes from STARTING to FAILED after a period of time.

  • Solution

    • On the Events tab of the job details page, click the p913571 icon on the left side of the time at which the job fails to be started. Then, identify the issue based on the error message.

    • On the Startup Logs subtab of the Logs tab, check whether errors exist. Then, identify the issue based on the error message.

    • If the JobManager is started as expected, you can view the detailed logs of the JobManager or TaskManager on the Job Manager or Running Task Managers subtab of the Logs tab.

  • Common errors and troubleshooting

    Problem description

    Cause

    Solution

    ERROR:exceeded quota: resourcequota

    The resources in the current queue are insufficient.

    Reconfigure resources in the current queue or reduce the resources used for job startup.

    ERROR:the vswitch ip is not enough

    The number of available IP addresses in the current namespace is less than the number of TaskManagers generated by starting the job.

    Reduce the number of jobs that run in parallel, reconfigure slots that can be allocated to a job, or change the vSwitch of the workspace.

    ERROR: pooler: ***: authentication failed

    The AccessKey pair of the account specified in the code is invalid or the account is not authorized to start the job.

    Enter an AccessKey pair that is valid and has permissions.

What do I do if the error message indicating a database connection error appears on the right side of the development console of Realtime Compute for Apache Flink?

  • Problem description

    image

  • Cause

    The registered catalog is invalid and cannot be connected correctly.

  • Solution

    View all catalogs on the Catalogs page, delete the catalogs that are dimmed, and then register the related catalogs again.

What do I do if data in tasks of a job is not consumed after the job is run?

  • Network connectivity troubleshooting

    If data is not generated or consumed in the upstream and downstream storages, check whether an error message is reported on the Startup Logs tab. If a timeout error is reported, troubleshoot the error that occurs on the network connection between the storages.

  • Task execution status troubleshooting

    On the Configuration tab, check whether data is read from the source and written to the sink to identify the location of the error.

    image

  • Tasks troubleshooting

    Add a print sink table to each task of the job for troubleshooting.

What do I do if a job is restarted after the job is run?

Troubleshoot the error on the Logs tab of the desired job.

  • View the exception information.

    On the JM Exceptions tab, view the reported error and troubleshoot the error.

  • View the logs of the JobManager and TaskManagers of the job.

    image

  • View the logs of failed TaskManagers of the job.

    Some exceptions may cause the TaskManagers to fail. As a result, the TaskManager logs that are scheduled are incomplete. You can view the last invalid TaskManager logs for troubleshooting.

    image

  • View the operational logs of a historical job instance.

    View the operational logs of a historical job instance to check the reason for job failure.

    image

Data output is suspended on the LocalGroupAggregate operator for a long period of time. No data output is generated. Why?

  • Code

    CREATE TEMPORARY TABLE s1 (
      a INT,
      b INT,
      ts as PROCTIME(),
      PRIMARY KEY (a) NOT ENFORCED
    ) WITH (
      'connector'='datagen',
      'rows-per-second'='1',
      'fields.b.kind'='random',
      'fields.b.min'='0',
      'fields.b.max'='10'
    );
    
    CREATE TEMPORARY TABLE sink (
      a BIGINT,
      b BIGINT
    ) WITH (
      'connector'='print'
    );
    
    CREATE TEMPORARY VIEW window_view AS
    SELECT window_start, window_end, a, sum(b) as b_sum FROM TABLE(TUMBLE(TABLE s1, DESCRIPTOR(ts), INTERVAL '2' SECONDS)) GROUP BY window_start, window_end, a;
    
    INSERT INTO sink SELECT count(distinct a), b_sum FROM window_view GROUP BY b_sum;
  • Problem description

    Data output is suspended on the LocalGroupAggregate operator for a long period of time and the MiniBatchAssigner operator is not contained in the topology of the job.

    image

  • Cause

    The job includes both WindowAggregate and GroupAggregate operators. The time column of the WindowAggregate operator is proctime, which indicates the event time. The managed memory is used to cache data in miniBatch processing mode if the table.exec.mini-batch.size parameter is not configured for the job or the table.exec.mini-batch.size parameter is set to a negative value.

    The MiniBatchAssigner operator fails to be generated and cannot send the watermark message to compute operators to trigger final calculation and data output. Final calculation and data output are triggered only when one of the following conditions is met: The managed memory is full, the CHECKPOINT command is received and checkpointing has not been performed, and the job is canceled. For more information, see table.exec.mini-batch.size. The checkpoint interval is set to an excessively large value. The LocalGroupAggregate operator does not trigger data output for a long period of time.

  • Solutions

    • Decrease the checkpoint interval. This way, the LocalGroupAggregate operator can automatically trigger data output before checkpointing is performed. For more information about the setting of the checkpoint interval, see Tuning Checkpointing.

    • Use the heap memory to cache data. This way, data output is automatically triggered when the amount of data cached on the LocalGroupAggregate operator reaches the value of the table.exec.mini-batch.size parameter. Set the table.exec.mini-batch.size parameter to a positive value N. For more information, see How do I configure custom runtime parameters for a job?

No data is entered in a partition of the upstream Kafka connector. As a result, the watermark cannot move forward and the window output is delayed. What do I do?

For example, five partitions exist in the upstream Kafka connector and two new data entries are entered into Kafka every minute. However, some of the partitions do not receive data entries in real time. If a partition does not receive any elements within the timeout period, the partition is marked as temporarily idle. As a result, the watermark cannot move forward, the window cannot end at the earliest opportunity, and the result cannot be generated in real time.

In this case, you must configure a timeout period to specify that the partition has no data. This way, the partition can be excluded from the calculation of the watermark. When the partition is identified to have data, the partition can be included in the calculation of the watermark. For more information, see Configuration.

Add the following code to the Other Configuration field in the Parameters section of the Configuration tab. For more information, see How do I configure custom runtime parameters for a job?

table.exec.source.idle-timeout: 1s

How do I locate the error if the JobManager is not running?

The Flink UI page does not appear because the JobManager is not running as expected. To identify the cause of the error, perform the following steps:

  1. In the left-side navigation pane of the development console of Realtime Compute for Apache Flink, choose O&M > Deployments. On the Deployments page, find the desired job and click the name of the .

  2. Click the Events tab.

  3. To search for errors and obtain error information, use the shortcut keys of the operating system.

    • Windows: Ctrl+F

    • macOS: Command+F

    示例

What do I do if the error message "INFO: org.apache.flink.fs.osshadoop.shaded.com.aliyun.oss" appears?

  • Problem description报错详情

  • Cause

    Data is stored in an OSS bucket. When OSS creates a directory, OSS checks whether the directory exists. If the directory does not exist, the error message "INFO: org.apache.flink.fs.osshadoop.shaded.com.aliyun.oss" appears. Realtime Compute for Apache Flink s are not affected.

  • Solution

    Add <Logger level="ERROR" name="org.apache.flink.fs.osshadoop.shaded.com.aliyun.oss"/> to the log template. For more information, see Configure parameters to export logs of a deployment.

What do I do if the error message "akka.pattern.AskTimeoutException" appears?

  • Causes

    • Garbage collection (GC) operations are frequently performed due to insufficient JobManager or TaskManager memory. As a result, the heartbeat and Remote Procedure Call (RPC) requests between the JobManager and a TaskManager time out.

    • A large number of RPC requests exist in the job but the JobManager resources are insufficient. As a result, an RPC request backlog occurs. This causes the heartbeat and RPC requests between the JobManager and a TaskManager to time out.

    • The timeout period for the job is set to an excessively small value. If Realtime Compute for Apache Flink fails to access a third-party service, Realtime Compute for Apache Flink performs multiple retries to access the service. As a result, no error is reported for the connection failure before the specified timeout period is reached.

  • Solutions

    • If the issue is caused by GC operations, we recommend that you check the GC frequency and the time that is consumed by GC operations based on the job memory and GC logs. If the GC frequency is high or GC operations consume a long period of time, you must increase the JobManager memory and the TaskManager memory.

    • If the issue is caused by a large number of RPC requests in the , we recommend that you increase the number of CPU cores and the memory size of the JobManager, and set the akka.ask.timeout and heartbeat.timeout parameters to larger values.

      Important
      • We recommend that you adjust the values of the akka.ask.timeout and heartbeat.timeout parameters only when a large number of RPC requests exist in a job. For a job in which a small number of RPC requests exist, small values of the parameters do not cause the issue.

      • We recommend that you configure the parameters based on your business requirements. If you set the parameters to excessively large values, the time that is required to resume the job increases if a TaskManager unexpectedly exits.

    • If the issue is caused by a connection failure of a third-party service, increase the values of the following parameters to allow the error of the connection failure to be reported, and resolve the issue.

      • client.timeout: Default value: 60. Recommended value: 600. Unit: seconds.

      • akka.ask.timeout: Default value: 10. Recommended value: 600 Unit: seconds.

      • client.heartbeat.timeout: Default value: 180000. Recommended value: 600000. Unit: seconds.

        Note

        To prevent errors, do not include the unit in the value.

      • heartbeat.timeout: Default value: 50000. Recommended value: 600000. Unit: milliseconds.

        Note

        To prevent errors, do not include the unit in the value.

      For example, if the error message "Caused by: java.sql.SQLTransientConnectionException: connection-pool-xxx.mysql.rds.aliyuncs.com:3306 - Connection is not available, request timed out after 30000ms" appears, the MySQL connection pool is full. In this case, you must increase the value of the connection.pool.size parameter that is described in the WITH parameters of MySQL. Default value: 20.

      Note

      You can determine the minimum values of the preceding parameters based on the timeout error message. The value in the error message shows you the value from which you can adjust the values of the preceding parameters. For example, "60000 ms" in the error message "pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/rpc/dispatcher_1#1064915964]] after [60000 ms]." is the value of the client.timeout parameter.

What do I do if the error message "Task did not exit gracefully within 180 + seconds." appears?

  • Problem description

    Task did not exit gracefully within 180 + seconds.
    2022-04-22T17:32:25.852861506+08:00 stdout F org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully within 180 + seconds.
    2022-04-22T17:32:25.852865065+08:00 stdout F at org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1709) [flink-dist_2.11-1.12-vvr-3.0.4-SNAPSHOT.jar:1.12-vvr-3.0.4-SNAPSHOT]
    2022-04-22T17:32:25.852867996+08:00 stdout F at java.lang.Thread.run(Thread.java:834) [?:1.8.0_102]
    log_level:ERROR
  • Cause

    The error message does not show the root cause of the job exception. The timeout period for a task to exit is specified by the task.cancellation.timeout parameter. The default value of the parameter is 180 in seconds. When a job performs a failover or a task in the job attempts to exit, the task may be blocked and cannot exit due to specific reasons. If the time period during which the task is blocked reaches the timeout period for the task to exit, Realtime Compute for Apache Flink determines that the task is stuck and cannot be resumed. Then, Realtime Compute for Apache Flink automatically stops the TaskManager to which the task belongs to allow the job to perform the failover or the task to exit. As a result, the error message appears in the logs of the .

    The timeout issue may be caused by the user-defined functions (UDFs) that are used in your . For example, when you use the close method in a , the task in the job is blocked for a long period of time and cannot exit, or no value is returned for a long period of time.

  • Solution

    Set the task.cancellation.timeout parameter to 0. For more information about how to configure this parameter, see How do I configure custom runtime parameters for a ? If you set this parameter to 0, the task waits to exit if it is blocked. In this case, no timeout occurs. If a job performs a failover again after you restart the job or a task in the job is blocked for a long period of time when the task attempts to exit, you must find the task that is in the CANCELLING state, check the stack of the task, identify the root cause of the issue, and then resolve the issue based on the root cause.

    Important

    The task.cancellation.timeout parameter is used for job debugging. We recommend that you do not set this parameter to 0 for a job in a production environment.

What do I do if the error message "Can not retract a non-existent record. This should never happen." appears?

  • Problem description

    java.lang.RuntimeException: Can not retract a non-existent record. This should never happen.
        at org.apache.flink.table.runtime.operators.rank.RetractableTopNFunction.processElement(RetractableTopNFunction.java:196)
        at org.apache.flink.table.runtime.operators.rank.RetractableTopNFunction.processElement(RetractableTopNFunction.java:55)
        at org.apache.flink.streaming.api.operators.KeyedProcessOperator.processElement(KeyedProcessOperator.java:83)
        at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:205)
        at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:135)
        at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:106)
        at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:66)
        at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:424)
        at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:204)
        at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:685)
        at org.apache.flink.streaming.runtime.tasks.StreamTask.executeInvoke(StreamTask.java:640)
        at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:651)
        at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:624)
        at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:799)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:586)
        at java.lang.Thread.run(Thread.java:877)
                        
  • Causes and solutions

    Scenario

    Cause

    Solution

    Scenario 1

    The issue is caused by the now() function in the code.

    The TopN algorithm does not allow you to use a non-deterministic field as a field in the ORDER BY or PARTITION BY clause. If a non-deterministic field is used as a field in the ORDER BY or PARTITION BY clause, the output values that are returned from the now() function are different. As a result, the previous value cannot be found in the message.

    Use a deterministic field as a field in the ORDER BY or PARTITION BY clause.

    Scenario 2

    The table.exec.state.ttl parameter is set to an excessively small value. The state data is deleted due to expiration. As a result, the key state data cannot be found in the message.

    Increase the value of the table.exec.state.ttl parameter. For more information about how to configure this parameter, see How do I configure custom runtime parameters for a ?

What do I do if the error message "The GRPC call timed out in sqlserver" appears?

  • Problem description

    org.apache.flink.table.sqlserver.utils.ExecutionTimeoutException: The GRPC call timed out in sqlserver, please check the thread stacktrace for root cause:
    
    Thread name: sqlserver-operation-pool-thread-4, thread state: TIMED_WAITING, thread stacktrace:
        at java.lang.Thread.sleep0(Native Method)
        at java.lang.Thread.sleep(Thread.java:360)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.processWaitTimeAndRetryInfo(RetryInvocationHandler.java:130)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:107)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
        at com.sun.proxy.$Proxy195.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1661)
        at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1577)
        at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1574)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1589)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1683)
        at org.apache.flink.connectors.hive.HiveSourceFileEnumerator.getNumFiles(HiveSourceFileEnumerator.java:118)
        at org.apache.flink.connectors.hive.HiveTableSource.lambda$getDataStream$0(HiveTableSource.java:209)
        at org.apache.flink.connectors.hive.HiveTableSource$$Lambda$972/1139330351.get(Unknown Source)
        at org.apache.flink.connectors.hive.HiveParallelismInference.logRunningTime(HiveParallelismInference.java:118)
        at org.apache.flink.connectors.hive.HiveParallelismInference.infer(HiveParallelismInference.java:100)
        at org.apache.flink.connectors.hive.HiveTableSource.getDataStream(HiveTableSource.java:207)
        at org.apache.flink.connectors.hive.HiveTableSource$1.produceDataStream(HiveTableSource.java:123)
        at org.apache.flink.table.planner.plan.nodes.exec.common.CommonExecTableSourceScan.translateToPlanInternal(CommonExecTableSourceScan.java:127)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateToPlan(ExecNodeBase.java:226)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecEdge.translateToPlan(ExecEdge.java:290)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.lambda$translateInputToPlan$5(ExecNodeBase.java:267)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase$$Lambda$949/77002396.apply(Unknown Source)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateInputToPlan(ExecNodeBase.java:268)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateInputToPlan(ExecNodeBase.java:241)
        at org.apache.flink.table.planner.plan.nodes.exec.stream.StreamExecExchange.translateToPlanInternal(StreamExecExchange.java:87)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateToPlan(ExecNodeBase.java:226)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecEdge.translateToPlan(ExecEdge.java:290)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.lambda$translateInputToPlan$5(ExecNodeBase.java:267)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase$$Lambda$949/77002396.apply(Unknown Source)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateInputToPlan(ExecNodeBase.java:268)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateInputToPlan(ExecNodeBase.java:241)
        at org.apache.flink.table.planner.plan.nodes.exec.stream.StreamExecGroupAggregate.translateToPlanInternal(StreamExecGroupAggregate.java:148)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateToPlan(ExecNodeBase.java:226)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecEdge.translateToPlan(ExecEdge.java:290)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.lambda$translateInputToPlan$5(ExecNodeBase.java:267)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase$$Lambda$949/77002396.apply(Unknown Source)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateInputToPlan(ExecNodeBase.java:268)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateInputToPlan(ExecNodeBase.java:241)
        at org.apache.flink.table.planner.plan.nodes.exec.stream.StreamExecSink.translateToPlanInternal(StreamExecSink.java:108)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateToPlan(ExecNodeBase.java:226)
        at org.apache.flink.table.planner.delegation.StreamPlanner$$anonfun$1.apply(StreamPlanner.scala:74)
        at org.apache.flink.table.planner.delegation.StreamPlanner$$anonfun$1.apply(StreamPlanner.scala:73)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.Iterator$class.foreach(Iterator.scala:891)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
        at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.AbstractTraversable.map(Traversable.scala:104)
        at org.apache.flink.table.planner.delegation.StreamPlanner.translateToPlan(StreamPlanner.scala:73)
        at org.apache.flink.table.planner.delegation.StreamExecutor.createStreamGraph(StreamExecutor.java:52)
        at org.apache.flink.table.planner.delegation.PlannerBase.createStreamGraph(PlannerBase.scala:610)
        at org.apache.flink.table.planner.delegation.StreamPlanner.explainExecNodeGraphInternal(StreamPlanner.scala:166)
        at org.apache.flink.table.planner.delegation.StreamPlanner.explainExecNodeGraph(StreamPlanner.scala:159)
        at org.apache.flink.table.sqlserver.execution.OperationExecutorImpl.validate(OperationExecutorImpl.java:304)
        at org.apache.flink.table.sqlserver.execution.OperationExecutorImpl.validate(OperationExecutorImpl.java:288)
        at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor.lambda$validate$22(DelegateOperationExecutor.java:211)
        at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor$$Lambda$394/1626790418.run(Unknown Source)
        at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor.wrapClassLoader(DelegateOperationExecutor.java:250)
        at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor.lambda$wrapExecutor$26(DelegateOperationExecutor.java:275)
        at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor$$Lambda$395/1157752141.run(Unknown Source)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
        at java.lang.Thread.run(Thread.java:834)
    
        at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor.wrapExecutor(DelegateOperationExecutor.java:281)
        at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor.validate(DelegateOperationExecutor.java:211)
        at org.apache.flink.table.sqlserver.FlinkSqlServiceImpl.validate(FlinkSqlServiceImpl.java:786)
        at org.apache.flink.table.sqlserver.proto.FlinkSqlServiceGrpc$MethodHandlers.invoke(FlinkSqlServiceGrpc.java:2522)
        at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)
        at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
        at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:820)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
        at java.lang.Thread.run(Thread.java:834)
    Caused by: java.util.concurrent.TimeoutException
        at java.util.concurrent.FutureTask.get(FutureTask.java:205)
        at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor.wrapExecutor(DelegateOperationExecutor.java:277)
        ... 11 more
                        
  • Cause

    A complex SQL statement is used in the draft. As a result, the Remote Procedure Call (RPC) execution times out.

  • Solution

    Add the following code to the Other Configuration field in the Parameters section of the Configuration tab to increase the timeout period for the RPC execution. The default timeout period is 120 seconds. For more information, see How do I configure custom parameters for deployment running?

    flink.sqlserver.rpc.execution.timeout: 600s

What do I do if the error message "RESOURCE_EXHAUSTED: gRPC message exceeds maximum size 41943040: 58384051" appears?

  • Problem description

    Caused by: io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: gRPC message exceeds maximum size 41943040: 58384051
    
    at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:244)
    
    at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:225)
    
    at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:142)
    
    at org.apache.flink.table.sqlserver.proto.FlinkSqlServiceGrpc$FlinkSqlServiceBlockingStub.generateJobGraph(FlinkSqlServiceGrpc.java:2478)
    
    at org.apache.flink.table.sqlserver.api.client.FlinkSqlServerProtoClientImpl.generateJobGraph(FlinkSqlServerProtoClientImpl.java:456)
    
    at org.apache.flink.table.sqlserver.api.client.ErrorHandlingProtoClient.lambda$generateJobGraph$25(ErrorHandlingProtoClient.java:251)
    
    at org.apache.flink.table.sqlserver.api.client.ErrorHandlingProtoClient.invokeRequest(ErrorHandlingProtoClient.java:335)
    
    ... 6 more
    Cause: RESOURCE_EXHAUSTED: gRPC message exceeds maximum size 41943040: 58384051)
  • Cause

    The size of JobGraph is excessively large due to complex draft logic. As a result, an error occurs during the verification or the job for the draft fails to get started or be canceled.

  • Solution

    Add the following code to the Other Configuration field in the Parameters section of the Configuration tab. For more information, see How do I configure custom runtime parameters for a job?

     table.exec.operator-name.max-length: 1000

What do I do if the error message "Caused by: java.lang.NoSuchMethodError" appears?

  • Problem description

    Error message: Caused by: java.lang.NoSuchMethodError: org.apache.flink.table.planner.plan.metadata.FlinkRelMetadataQuery.getUpsertKeysInKeyGroupRange(Lorg/apache/calcite/rel/RelNode;[I)Ljava/util/Set;
  • Cause

    If the draft that you develop calls an internal API provided by the Apache Flink community and the internal API is optimized by Alibaba Cloud, an exception such as a package conflict may occur.

  • Solution

    Configure your draft to call only methods that are explicitly marked with @Public or @PublicEvolving in the source code of Apache Flink. Alibaba Cloud only ensures that Realtime Compute for Apache Flink is compatible with the methods.

What do I do if the error message "java.lang.ClassCastException: org.codehaus.janino.CompilerFactory cannot be cast to org.codehaus.commons.compiler.ICompilerFactory" appears?

  • Problem description

    Causedby:java.lang.ClassCastException:org.codehaus.janino.CompilerFactorycannotbecasttoorg.codehaus.commons.compiler.ICompilerFactory
        atorg.codehaus.commons.compiler.CompilerFactoryFactory.getCompilerFactory(CompilerFactoryFactory.java:129)
        atorg.codehaus.commons.compiler.CompilerFactoryFactory.getDefaultCompilerFactory(CompilerFactoryFactory.java:79)
        atorg.apache.calcite.rel.metadata.JaninoRelMetadataProvider.compile(JaninoRelMetadataProvider.java:426)
        ...66more
  • Cause

    • The JAR package contains a Janino dependency that causes a conflict.

    • Specific JAR packages that start with Flink- such as flink-table-planner and flink-table-runtime are added to the JAR package of the user-defined function (UDF) or connector.

  • Solutions

    • Check whether the JAR package contains org.codehaus.janino.CompilerFactory. Class conflicts may occur because the class loading sequence on different machines is different. To resolve this issue, perform the following steps:

      1. In the left-side navigation pane of the development console of Realtime Compute for Apache Flink, choose O&M > Deployments. On the Deployments page, find the desired job and click the name of the job.

      2. On the Configuration tab of the job details page, click Edit in the upper-right corner of the Parameters section.

      3. Add the following code to the Other Configuration field and click Save.

        classloader.parent-first-patterns.additional: org.codehaus.janino

        Replace the value of the classloader.parent-first-patterns.additional parameter with a conflict class.

    • Specify <scope>provided</scope> for Apache Flink dependencies. The non-connector dependencies whose names start with flink- in the org.apache.flink group are mainly required.