All Products
Search
Document Center

Realtime Compute for Apache Flink:FAQ about deployment running errors

Last Updated:Feb 27, 2025

This topic provides answers to some frequently asked questions about deployment running errors.

Data output is suspended on the LocalGroupAggregate operator for a long period of time. No data output is generated. Why?

  • Code

    CREATE TEMPORARY TABLE s1 (
      a INT,
      b INT,
      ts as PROCTIME(),
      PRIMARY KEY (a) NOT ENFORCED
    ) WITH (
      'connector'='datagen',
      'rows-per-second'='1',
      'fields.b.kind'='random',
      'fields.b.min'='0',
      'fields.b.max'='10'
    );
    
    CREATE TEMPORARY TABLE sink (
      a BIGINT,
      b BIGINT
    ) WITH (
      'connector'='print'
    );
    
    CREATE TEMPORARY VIEW window_view AS
    SELECT window_start, window_end, a, sum(b) as b_sum FROM TABLE(TUMBLE(TABLE s1, DESCRIPTOR(ts), INTERVAL '2' SECONDS)) GROUP BY window_start, window_end, a;
    
    INSERT INTO sink SELECT count(distinct a), b_sum FROM window_view GROUP BY b_sum;
  • Problem description

    Data output is suspended on the LocalGroupAggregate operator for a long period of time and the MiniBatchAssigner operator is not contained in the topology of the deployment.

    image

  • Cause

    The deployment includes both WindowAggregate and GroupAggregate operators. The time column of the WindowAggregate operator is proctime, which indicates the event time. The managed memory is used to cache data in miniBatch processing mode if the table.exec.mini-batch.size parameter is not configured for the deployment or the table.exec.mini-batch.size parameter is set to a negative value.

    The MiniBatchAssigner operator fails to be generated and cannot send the watermark message to compute operators to trigger final calculation and data output. Final calculation and data output are triggered only when one of the following conditions is met: The managed memory is full, the CHECKPOINT command is received and checkpointing has not been performed, and the deployment is canceled. For more information, see table.exec.mini-batch.size. The checkpoint interval is set to an excessively large value. The LocalGroupAggregate operator does not trigger data output for a long period of time.

  • Solutions

    • Decrease the checkpoint interval. This way, the LocalGroupAggregate operator can automatically trigger data output before checkpointing is performed. For more information about the setting of the checkpoint interval, see Tuning Checkpointing.

    • Use the heap memory to cache data. This way, data output is automatically triggered when the amount of data cached on the LocalGroupAggregate operator reaches the value of the table.exec.mini-batch.size parameter. Set the table.exec.mini-batch.size parameter to a positive value N. For more information, see How do I configure custom runtime parameters for a deployment?

No data is entered in a partition of the upstream Kafka connector. As a result, the watermark cannot move forward and the window output is delayed. What do I do?

For example, five partitions exist in the upstream Kafka connector and two new data entries are entered into Kafka every minute. However, some of the partitions do not receive data entries in real time. If a partition does not receive any elements within the timeout period, the partition is marked as temporarily idle. As a result, the watermark cannot move forward, the window cannot end at the earliest opportunity, and the result cannot be generated in real time.

In this case, you must configure a timeout period to specify that the partition has no data. This way, the partition can be excluded from the calculation of the watermark. When the partition is identified to have data, the partition can be included in the calculation of the watermark. For more information, see Configuration.

Add the following code to the Other Configuration field in the Parameters section of the Configuration tab. For more information, see How do I configure custom runtime parameters for a deployment?

table.exec.source.idle-timeout: 1s

How do I locate the error if the JobManager is not running?

The Flink UI page does not appear because the JobManager is not running as expected. To identify the cause of the error, perform the following steps:

  1. In the left-side navigation pane of the development console of Realtime Compute for Apache Flink, choose O&M > Deployments. On the Deployments page, find the desired deployment and click the name of the deployment.

  2. Click the Events tab.

  3. To search for errors and obtain error information, use the shortcut keys of the operating system.

    • Windows: Ctrl+F

    • macOS: Command+F

    示例

What do I do if the error message "INFO: org.apache.flink.fs.osshadoop.shaded.com.aliyun.oss" appears?

  • Problem description报错详情

  • Cause

    Data is stored in an OSS bucket. When OSS creates a directory, OSS checks whether the directory exists. If the directory does not exist, the error message "INFO: org.apache.flink.fs.osshadoop.shaded.com.aliyun.oss" appears. Realtime Compute for Apache Flink deployments are not affected.

  • Solutions

    Add <Logger level="ERROR" name="org.apache.flink.fs.osshadoop.shaded.com.aliyun.oss"/> to the log template. For more information, see Configure parameters to export logs of a deployment.

What do I do if the error message "akka.pattern.AskTimeoutException" appears?

  • Causes

    • Garbage collection (GC) operations are frequently performed due to insufficient JobManager or TaskManager memory. As a result, the heartbeat and Remote Procedure Call (RPC) requests between the JobManager and a TaskManager time out.

    • A large number of RPC requests exist in the deployment but the JobManager resources are insufficient. As a result, an RPC request backlog occurs. This causes the heartbeat and RPC requests between the JobManager and a TaskManager to time out.

    • The timeout period for the deployment is set to an excessively small value. If Realtime Compute for Apache Flink fails to access a third-party service, Realtime Compute for Apache Flink performs multiple retries to access the service. As a result, no error is reported for the connection failure before the specified timeout period is reached.

  • Solutions

    • If the issue is caused by GC operations, we recommend that you check the GC frequency and the time that is consumed by GC operations based on the deployment memory and GC logs. If the GC frequency is high or GC operations consume a long period of time, you must increase the JobManager memory and the TaskManager memory.

    • If the issue is caused by a large number of RPC requests in the deployment, we recommend that you increase the number of CPU cores and the memory size of the JobManager, and set the akka.ask.timeout and heartbeat.timeout parameters to larger values.

      Important
      • We recommend that you adjust the values of the akka.ask.timeout and heartbeat.timeout parameters only when a large number of RPC requests exist in a deployment. For a deployment in which a small number of RPC requests exist, small values of the parameters do not cause the issue.

      • We recommend that you configure the parameters based on your business requirements. If you set the parameters to excessively large values, the time that is required to resume the deployment increases if a TaskManager unexpectedly exits.

    • If the issue is caused by a connection failure of a third-party service, increase the values of the following parameters to allow the error of the connection failure to be reported, and resolve the issue.

      • client.timeout: The default value is 60. Unit: seconds. We recommend that you set this parameter to 600.

      • akka.ask.timeout: The default value is 10. Unit: seconds. We recommend that you set this parameter to 600.

      • client.heartbeat.timeout: The default value is 180000. Unit: seconds. We recommend that you set this parameter to 600000.

      • heartbeat.timeout: Default value is 50000. Unit: seconds. We recommend that you set this parameter to 600000.

      For example, if the error message "Caused by: java.sql.SQLTransientConnectionException: connection-pool-xxx.mysql.rds.aliyuncs.com:3306 - Connection is not available, request timed out after 30000ms" appears, the MySQL connection pool is full. In this case, you must increase the value of the connection.pool.size parameter that is described in the WITH parameters of MySQL. Default value: 20.

      Note

      You can determine the minimum values of the preceding parameters based on the timeout error message. The value in the error message shows you the value from which you can adjust the values of the preceding parameters. For example, "60000 ms" in the error message "pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/rpc/dispatcher_1#1064915964]] after [60000 ms]." is the value of the client.timeout parameter.

What do I do if the error message "Task did not exit gracefully within 180 + seconds." appears?

  • Problem description

    Task did not exit gracefully within 180 + seconds.
    2022-04-22T17:32:25.852861506+08:00 stdout F org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully within 180 + seconds.
    2022-04-22T17:32:25.852865065+08:00 stdout F at org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1709) [flink-dist_2.11-1.12-vvr-3.0.4-SNAPSHOT.jar:1.12-vvr-3.0.4-SNAPSHOT]
    2022-04-22T17:32:25.852867996+08:00 stdout F at java.lang.Thread.run(Thread.java:834) [?:1.8.0_102]
    log_level:ERROR
  • Cause

    The error message does not show the root cause of the deployment exception. The timeout period for a task to exit is specified by the task.cancellation.timeout parameter. The default value of the parameter is 180 in seconds. When a deployment performs a failover or a task in the deployment attempts to exit, the task may be blocked and cannot exit due to specific reasons. If the time period during which the task is blocked reaches the timeout period for the task to exit, Realtime Compute for Apache Flink determines that the task is stuck and cannot be resumed. Then, Realtime Compute for Apache Flink automatically stops the TaskManager to which the task belongs to allow the deployment to perform the failover or the task to exit. As a result, the error message appears in the logs of the deployment.

    The timeout issue may be caused by the user-defined functions (UDFs) that are used in your deployment. For example, when you use the close method in a deployment, the task in the deployment is blocked for a long period of time and cannot exit, or no value is returned for a long period of time.

  • Solutions

    Set the task.cancellation.timeout parameter to 0. For more information about how to configure this parameter, see How do I configure custom runtime parameters for a deployment? If you set this parameter to 0, the task waits to exit if it is blocked. In this case, no timeout occurs. If a deployment performs a failover again after you restart the deployment or a task in the deployment is blocked for a long period of time when the task attempts to exit, you must find the task that is in the CANCELLING state, check the stack of the task, identify the root cause of the issue, and then resolve the issue based on the root cause.

    Important

    The task.cancellation.timeout parameter is used for deployment debugging. We recommend that you do not set this parameter to 0 for a deployment in a production environment.

What do I do if the error message "Can not retract a non-existent record. This should never happen." appears?

  • Problem description

    java.lang.RuntimeException: Can not retract a non-existent record. This should never happen.
        at org.apache.flink.table.runtime.operators.rank.RetractableTopNFunction.processElement(RetractableTopNFunction.java:196)
        at org.apache.flink.table.runtime.operators.rank.RetractableTopNFunction.processElement(RetractableTopNFunction.java:55)
        at org.apache.flink.streaming.api.operators.KeyedProcessOperator.processElement(KeyedProcessOperator.java:83)
        at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:205)
        at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:135)
        at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:106)
        at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:66)
        at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:424)
        at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:204)
        at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:685)
        at org.apache.flink.streaming.runtime.tasks.StreamTask.executeInvoke(StreamTask.java:640)
        at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:651)
        at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:624)
        at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:799)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:586)
        at java.lang.Thread.run(Thread.java:877)
                        
  • Causes and solutions

    Scenario

    Cause

    Solution

    Scenario 1

    The issue is caused by the now() function in the code.

    The TopN algorithm does not allow you to use a non-deterministic field as a field in the ORDER BY or PARTITION BY clause. If a non-deterministic field is used as a field in the ORDER BY or PARTITION BY clause, the output values that are returned from the now() function are different. As a result, the previous value cannot be found in the message.

    Use a deterministic field as a field in the ORDER BY or PARTITION BY clause.

    Scenario 2

    The table.exec.state.ttl parameter is set to an excessively small value. The state data is deleted due to expiration. As a result, the key state data cannot be found in the message.

    Increase the value of the table.exec.state.ttl parameter. For more information about how to configure this parameter, see How do I configure custom runtime parameters for a deployment?

What do I do if the error message "The GRPC call timed out in sqlserver" appears?

  • Problem description

    org.apache.flink.table.sqlserver.utils.ExecutionTimeoutException: The GRPC call timed out in sqlserver, please check the thread stacktrace for root cause:
    
    Thread name: sqlserver-operation-pool-thread-4, thread state: TIMED_WAITING, thread stacktrace:
        at java.lang.Thread.sleep0(Native Method)
        at java.lang.Thread.sleep(Thread.java:360)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.processWaitTimeAndRetryInfo(RetryInvocationHandler.java:130)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:107)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
        at com.sun.proxy.$Proxy195.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1661)
        at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1577)
        at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1574)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1589)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1683)
        at org.apache.flink.connectors.hive.HiveSourceFileEnumerator.getNumFiles(HiveSourceFileEnumerator.java:118)
        at org.apache.flink.connectors.hive.HiveTableSource.lambda$getDataStream$0(HiveTableSource.java:209)
        at org.apache.flink.connectors.hive.HiveTableSource$$Lambda$972/1139330351.get(Unknown Source)
        at org.apache.flink.connectors.hive.HiveParallelismInference.logRunningTime(HiveParallelismInference.java:118)
        at org.apache.flink.connectors.hive.HiveParallelismInference.infer(HiveParallelismInference.java:100)
        at org.apache.flink.connectors.hive.HiveTableSource.getDataStream(HiveTableSource.java:207)
        at org.apache.flink.connectors.hive.HiveTableSource$1.produceDataStream(HiveTableSource.java:123)
        at org.apache.flink.table.planner.plan.nodes.exec.common.CommonExecTableSourceScan.translateToPlanInternal(CommonExecTableSourceScan.java:127)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateToPlan(ExecNodeBase.java:226)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecEdge.translateToPlan(ExecEdge.java:290)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.lambda$translateInputToPlan$5(ExecNodeBase.java:267)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase$$Lambda$949/77002396.apply(Unknown Source)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateInputToPlan(ExecNodeBase.java:268)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateInputToPlan(ExecNodeBase.java:241)
        at org.apache.flink.table.planner.plan.nodes.exec.stream.StreamExecExchange.translateToPlanInternal(StreamExecExchange.java:87)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateToPlan(ExecNodeBase.java:226)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecEdge.translateToPlan(ExecEdge.java:290)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.lambda$translateInputToPlan$5(ExecNodeBase.java:267)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase$$Lambda$949/77002396.apply(Unknown Source)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateInputToPlan(ExecNodeBase.java:268)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateInputToPlan(ExecNodeBase.java:241)
        at org.apache.flink.table.planner.plan.nodes.exec.stream.StreamExecGroupAggregate.translateToPlanInternal(StreamExecGroupAggregate.java:148)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateToPlan(ExecNodeBase.java:226)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecEdge.translateToPlan(ExecEdge.java:290)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.lambda$translateInputToPlan$5(ExecNodeBase.java:267)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase$$Lambda$949/77002396.apply(Unknown Source)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateInputToPlan(ExecNodeBase.java:268)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateInputToPlan(ExecNodeBase.java:241)
        at org.apache.flink.table.planner.plan.nodes.exec.stream.StreamExecSink.translateToPlanInternal(StreamExecSink.java:108)
        at org.apache.flink.table.planner.plan.nodes.exec.ExecNodeBase.translateToPlan(ExecNodeBase.java:226)
        at org.apache.flink.table.planner.delegation.StreamPlanner$$anonfun$1.apply(StreamPlanner.scala:74)
        at org.apache.flink.table.planner.delegation.StreamPlanner$$anonfun$1.apply(StreamPlanner.scala:73)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.Iterator$class.foreach(Iterator.scala:891)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
        at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.AbstractTraversable.map(Traversable.scala:104)
        at org.apache.flink.table.planner.delegation.StreamPlanner.translateToPlan(StreamPlanner.scala:73)
        at org.apache.flink.table.planner.delegation.StreamExecutor.createStreamGraph(StreamExecutor.java:52)
        at org.apache.flink.table.planner.delegation.PlannerBase.createStreamGraph(PlannerBase.scala:610)
        at org.apache.flink.table.planner.delegation.StreamPlanner.explainExecNodeGraphInternal(StreamPlanner.scala:166)
        at org.apache.flink.table.planner.delegation.StreamPlanner.explainExecNodeGraph(StreamPlanner.scala:159)
        at org.apache.flink.table.sqlserver.execution.OperationExecutorImpl.validate(OperationExecutorImpl.java:304)
        at org.apache.flink.table.sqlserver.execution.OperationExecutorImpl.validate(OperationExecutorImpl.java:288)
        at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor.lambda$validate$22(DelegateOperationExecutor.java:211)
        at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor$$Lambda$394/1626790418.run(Unknown Source)
        at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor.wrapClassLoader(DelegateOperationExecutor.java:250)
        at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor.lambda$wrapExecutor$26(DelegateOperationExecutor.java:275)
        at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor$$Lambda$395/1157752141.run(Unknown Source)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
        at java.lang.Thread.run(Thread.java:834)
    
        at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor.wrapExecutor(DelegateOperationExecutor.java:281)
        at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor.validate(DelegateOperationExecutor.java:211)
        at org.apache.flink.table.sqlserver.FlinkSqlServiceImpl.validate(FlinkSqlServiceImpl.java:786)
        at org.apache.flink.table.sqlserver.proto.FlinkSqlServiceGrpc$MethodHandlers.invoke(FlinkSqlServiceGrpc.java:2522)
        at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)
        at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
        at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:820)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
        at java.lang.Thread.run(Thread.java:834)
    Caused by: java.util.concurrent.TimeoutException
        at java.util.concurrent.FutureTask.get(FutureTask.java:205)
        at org.apache.flink.table.sqlserver.execution.DelegateOperationExecutor.wrapExecutor(DelegateOperationExecutor.java:277)
        ... 11 more
                        
  • Cause

    A complex SQL statement is used in the draft. As a result, the Remote Procedure Call (RPC) execution times out.

  • Solution

    Add the following code to the Other Configuration field in the Parameters section of the Configuration tab to increase the timeout period for the RPC execution. The default timeout period is 120 seconds. For more information, see How do I configure custom runtime parameters for a deployment?

    flink.sqlserver.rpc.execution.timeout: 600s

What do I do if the error message "RESOURCE_EXHAUSTED: gRPC message exceeds maximum size 41943040: 58384051" appears?

  • Problem description

    Caused by: io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: gRPC message exceeds maximum size 41943040: 58384051
    
    at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:244)
    
    at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:225)
    
    at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:142)
    
    at org.apache.flink.table.sqlserver.proto.FlinkSqlServiceGrpc$FlinkSqlServiceBlockingStub.generateJobGraph(FlinkSqlServiceGrpc.java:2478)
    
    at org.apache.flink.table.sqlserver.api.client.FlinkSqlServerProtoClientImpl.generateJobGraph(FlinkSqlServerProtoClientImpl.java:456)
    
    at org.apache.flink.table.sqlserver.api.client.ErrorHandlingProtoClient.lambda$generateJobGraph$25(ErrorHandlingProtoClient.java:251)
    
    at org.apache.flink.table.sqlserver.api.client.ErrorHandlingProtoClient.invokeRequest(ErrorHandlingProtoClient.java:335)
    
    ... 6 more
    Cause: RESOURCE_EXHAUSTED: gRPC message exceeds maximum size 41943040: 58384051)
  • Cause

    The size of JobGraph is excessively large due to complex draft logic. As a result, an error occurs during the verification or the deployment for the draft fails to get started or be canceled.

  • Solution

    Add the following code to the Other Configuration field in the Parameters section of the Configuration tab. For more information, see How do I configure custom runtime parameters for a deployment?

     table.exec.operator-name.max-length: 1000

What do I do if the error message "Caused by: java.lang.NoSuchMethodError" appears?

  • Problem description

    Error message: Caused by: java.lang.NoSuchMethodError: org.apache.flink.table.planner.plan.metadata.FlinkRelMetadataQuery.getUpsertKeysInKeyGroupRange(Lorg/apache/calcite/rel/RelNode;[I)Ljava/util/Set;
  • Cause

    If the draft that you develop calls an internal API provided by the Apache Flink community and the internal API is optimized by Alibaba Cloud, an exception such as a package conflict may occur.

  • Solution

    Configure your draft to call only methods that are explicitly marked with @Public or @PublicEvolving in the source code of Apache Flink. Alibaba Cloud only ensures that Realtime Compute for Apache Flink is compatible with the methods.

What do I do if the error message "java.lang.ClassCastException: org.codehaus.janino.CompilerFactory cannot be cast to org.codehaus.commons.compiler.ICompilerFactory" appears?

  • Problem description

    Causedby:java.lang.ClassCastException:org.codehaus.janino.CompilerFactorycannotbecasttoorg.codehaus.commons.compiler.ICompilerFactory
        atorg.codehaus.commons.compiler.CompilerFactoryFactory.getCompilerFactory(CompilerFactoryFactory.java:129)
        atorg.codehaus.commons.compiler.CompilerFactoryFactory.getDefaultCompilerFactory(CompilerFactoryFactory.java:79)
        atorg.apache.calcite.rel.metadata.JaninoRelMetadataProvider.compile(JaninoRelMetadataProvider.java:426)
        ...66more
  • Cause

    • The JAR package contains a Janino dependency that causes a conflict.

    • Specific JAR packages that start with Flink- such as flink-table-planner and flink-table-runtime are added to the JAR package of the user-defined function (UDF) or connector.

  • Solutions

    • Check whether the JAR package contains org.codehaus.janino.CompilerFactory. Class conflicts may occur because the class loading sequence on different machines is different. To resolve this issue, perform the following steps:

      1. In the left-side navigation pane of the development console of Realtime Compute for Apache Flink, choose O&M > Deployments. On the Deployments page, find the desired deployment and click the name of the deployment.

      2. On the Configuration tab of the deployment details page, click Edit in the upper-right corner of the Parameters section.

      3. Add the following code to the Other Configuration field and click Save.

        classloader.parent-first-patterns.additional: org.codehaus.janino

        Replace the value of the classloader.parent-first-patterns.additional parameter with a conflict class.

    • Specify <scope>provided</scope> for Apache Flink dependencies. The non-connector dependencies whose names start with flink- in the org.apache.flink group are mainly required.