This topic provides answers to some frequently asked questions about how to troubleshoot permission issues, operations and maintenance (O&M) issues, and data exceptions of fully managed Flink.

What do I do if fully managed Flink becomes unavailable after I delete a role or change authorization policies?

  • Method 1: Create the AliyunStreamAsiDefaultRole role and attach the following custom authorization policies to the role for re-authorization: AliyunStreamAsiDefaultRolePolicy0, AliyunStreamAsiDefaultRolePolicy1, and FlinkServerlessPolicy. For more information about this method, see Manual authorization (method 1).
  • Method 2: Delete the stack of Resource Orchestration Service (ROS), RAM roles, and the policies of the RAM roles. Then, log on to the Realtime Compute for Apache Flink console to enable automatic re-authorization. For more information about this method, see Manual authorization (method 2).

How do I locate the error if the JobManager is not running?

The Flink UI page does not appear because the JobManager is not running as expected. To identify the cause of the error, perform the following steps:
  1. On the Deployments page, click the name of the job whose error you want to identify.
  2. Click the Events tab.
  3. To search for errors and obtain error information, use the shortcut keys of the operating system.
    • Windows: Ctrl+F
    • macOS: Command+F
    Sample code

What do I do if checkpoints for Python jobs are created at a low speed?

  • Cause

    Python operators contain a cache. When the system creates a checkpoint, the system must process all data in the cache. If the performance of Python user-defined functions (UDFs) is poor, the time that is required to create a checkpoint increases. This affects the execution of Python jobs.

  • Solution
    On the right side of the Draft Editor page in the console of fully managed Flink, click the Advanced tab. In the panel that appears, configure the following parameters in the Additional Configuration section to reduce the amount of data in the cache:
    python.fn-execution.bundle.size: The maximum number of elements that can be included in a bundle. Default value: 100000. 
    python.fn-execution.bundle.time: Default value: 1000. Unit: milliseconds. 
    For more information about the parameters, see Flink Python Configuration.

What do I do if the error message "Invalid versionName string" is reported?

  • Problem description
    When a job is started, the error message "Invalid versionName string" is reported.
    Note SQL jobs are not affected. Java Archive (JAR) or Python jobs of the following Flink engine versions are affected: Flink 1.10.4 or an earlier minor version, and Flink 1.11.3 or an earlier minor version.
  • Cause

    When you create a JAR or Python job to be deployed in a session cluster, the Flink engine version is not configured for the job.

  • Solution

    On the Draft Editor page, configure the Flink engine version for the job. Then, publish and start the job again.

    If a job that is deployed in a session cluster is created by using an SDK, we recommend that you update the version of the common dependency package of the SDK to 1.0.21 and configure the Flink engine version for the job when you create an artifact. Examples:
    • Configuration of the common dependency package
      <dependency>
          <groupId>com.aliyun</groupId>
          <artifactId>ververica-common</artifactId>
          <version>1.0.21</version>
      </dependency>
    • Configuration of the Flink engine version when you create an artifact
      com.ververica.common.model.deployment.Artifact.SqlScriptArtifact#setVersionName
      com.ververica.common.model.deployment.Artifact.JarArtifact#setVersionName
      Note The value of versionName in the configuration must be the same as the Flink engine version that you configured for the session cluster in which the job is deployed.

How do I troubleshoot the issue that fully managed Flink cannot read source data?

If fully managed Flink cannot read source data, we recommend that you perform the following steps to troubleshoot this issue:
  • Check the network connectivity between the upstream storage service and fully managed Flink.
    Fully managed Flink can access only storage services that are deployed in the same virtual private cloud (VPC) or the same region as fully managed Flink. If you need to access storage resources across multiple VPCs or access fully managed Flink over the Internet, use the following methods:
  • Check whether whitelists are configured for the upstream storage services.
    You must configure whitelists for the following upstream storage services: Elasticsearch and Message Queue for Apache Kafka. To configure a whitelist, perform the following steps:
    1. Obtain the CIDR blocks of the vSwitch to which fully managed Flink belongs.

      For more information, see Configure a whitelist.

    2. Configure whitelists of fully managed Flink for the upstream storage services.

      For more information about how to configure whitelists for the upstream storage services, see the topics that are mentioned in the prerequisites of the related DDL documentation, such as the topic that is mentioned in the prerequisites of Create a Message Queue for Apache Kafka source table.

  • Check whether the field type, field sequence, and field letter case defined in DDL statements are consistent with those of the physical table.

    The field types that are supported by the upstream storage services and the field types that are supported by fully managed Flink may not be completely consistent, but they have mapping relationships. You need to match the fields based on the field type mappings defined in DDL statements. For more information, see the field type mappings in the related DDL documentation, such as Data type mapping in Create a Log Service source table.

  • Check whether the Taskmanager.log file of the source table contains error information.
    If the file contains error information, troubleshoot the error based on the error information. To view the Taskmanager.log file of the source table, perform the following steps:
    1. On the Deployments page, click the name of the job whose error you want to troubleshoot.
    2. On the Overview tab, click the box of the source node.
    3. On the SubTasks tab, find the subtask whose logs you want to view and click Open TaskManager Log Page in the Actions column. Taskmanager.log file
    4. On the Logs tab of the page that appears, view the log information.

      Find the last Caused by, which is the Cause by of the first failover, on this tab. In most cases, this information indicates the root cause of the issue. You can quickly locate the cause of the issue based on this information.

Note If this issue persists after you perform the preceding operations, submit a ticket.

How do I troubleshoot the issue that fully managed Flink cannot write data to the result table?

If fully managed Flink cannot write data to the result table, we recommend that you perform the following operations to troubleshoot this issue:
  • Check the network connectivity between the downstream storage service and fully managed Flink.
    Fully managed Flink can access only storage services that are deployed in the same virtual private cloud (VPC) or the same region as fully managed Flink. If you need to access storage resources across multiple VPCs or access fully managed Flink over the Internet, use the following methods:
  • Check whether whitelists are configured for the downstream storage service.
    You must configure whitelists for the following downstream storage services: ApsaraDB RDS for MySQL, Message Queue for Apache Kafka, Elasticsearch, AnalyticDB for MySQL V3.0, ApsaraDB for HBase, ApsaraDB for Redis, and ApsaraDB for ClickHouse. To configure a whitelist, perform the following steps:
    1. Obtain the CIDR blocks of the vSwitch to which fully managed Flink belongs.

      For more information, see Configure a whitelist.

    2. Configure whitelists of fully managed Flink for the downstream storage services.

      For more information about how to configure whitelists for the downstream storage services, see the topics that are mentioned in the prerequisites of the related DDL documentation, such as the topic that is mentioned in the prerequisites of Create an ApsaraDB RDS for MySQL result table.

  • Check whether the field type, field sequence, and field letter case defined in DDL statements are consistent with those of the physical table.

    The field types supported by the downstream storage services and the field types supported by fully managed Flink may not be completely consistent, but their field types have mapping relationships. You need to match the fields based on the field type mappings defined in DDL statements. For more information, see the field type mappings in the related DDL documentation, such as Data type mapping in Create a Log Service result table.

  • Check whether data is filtered by intermediate nodes, such as a WHERE, JOIN, or window node.

    You can check the data input and output of each compute node in the vertex topology. For example, if the input data of a WHERE node is 5 and the output data is 0, the data is filtered out by the WHERE node. Therefore, no data is written to the downstream storage service.

  • Check whether the default values of output condition parameters are configured appropriately in the downstream storage service.
    If the amount of data in your data source is small but the default values of the output condition parameters configured in the DDL statement of the result table are large, the output condition cannot be met. As a result, data cannot be written to the downstream storage service. In this scenario, you need to set the default values of the output condition parameters to smaller values. The following table describes the output condition parameters in common downstream storage services.
    Output condition Parameter Downstream storage service
    The number of data records that can be written at a time. batchSize
    • DataHub
    • Tablestore
    • MongoDB
    • Phoenix5
    • RDS MYSQL
    • AnalyticDB for MySQL V3.0
    • ClickHouse
    • InfluxDB
    The maximum number of data records that are written at a time. batchCount DataHub
    The flush interval for the buffer of a writer in MaxCompute Tunnel. flushIntervalMs Maxcompute
    The size of data in bytes cached in the memory before data is written to the ApsaraDB for HBase database. sink.buffer-flush.max-size Hbase
    The number of data records cached in the memory before data is written to the ApsaraDB for HBase database. sink.buffer-flush.max-rows Hbase
    The interval at which cached data is written to the ApsaraDB for HBase database. This parameter controls the latency of data writing to the ApsaraDB for HBase database. sink.buffer-flush.interval Hbase
    The maximum number of rows that a Hologres streaming sink node can write to Hologres at a time. jdbcWriteBatchSize Hologres
  • Check whether data output in a window fails due to out-of-order data.

    For example, if fully managed Flink receives a piece of future data of the year 2100 with the watermark of 2100 at the beginning, fully managed Flink considers that the data before 2100 is processed. Only the data whose watermark is later than 2100 is processed. The subsequent normal input data of 2021 is discarded because the watermarks of the data are earlier than 2100. The window closes, and data output is triggered only when fully managed Flink receives data whose watermark is later than 2100. Otherwise, no data output exists in the result table.

    You can use a print result table or the Log4j configuration to check whether out-of-order data exists in the data source. For more information, see Create a print result table and Configure the logs of a historical job instance to be exported. You can filter out the out-of-order data or process the out-of-order data by using a delayed-trigger of window calculation.

  • Check whether data output fails because no data exists in parallel jobs.

    If multiple jobs are run in parallel but data of some parallel jobs does not flow into fully managed Flink, the watermark of the job data is defined as January 1, 1970, 00:00:00 UTC. In this case, no watermark for closing the window exists because the earliest watermark of the jobs that are run in parallel is used. As a result, the window cannot be closed and no data output occurs.

    To troubleshoot the issue, you must check whether the data of each parallel job of a subtask in the vertex topology of the upstream storage service flows into fully managed Flink. If some parallel jobs have no data, we recommend that you adjust the job parallelism to a value less than or equal to the number of shards of the source table. This method ensures that all parallel jobs have data.

  • Check whether data output fails because a partition of Message Queue for Apache Kafka has no data.

    If a partition of Message Queue for Apache Kafka has no data, the watermark cannot be generated. As a result, no data output is returned after the data of the Message Queue for Apache Kafka source table is calculated based on the event time-based window. Solution: For more information, see After the data of a Message Queue for Apache Kafka source table is calculated by using event time-based window functions, no data output is returned. Why?.

Note If this issue persists after you perform the preceding operations, submit a ticket.

How do I troubleshoot data loss?

In most cases, when data is calculated in intermediate phases, such as the WHERE, JOIN, or windowing operations, the volume of data is reduced because the data is filtered or the JOIN operation fails. If your data is missing, we recommend that you perform the following steps to troubleshoot the issue:
  • Check whether the cache policy in the dimension table is configured as expected.

    If the cache policy defined in the DDL statement of the dimension table is invalid, the system cannot obtain data from the dimension table. As a result, data loss occurs. If data loss occurs, we recommend that you query and modify the setting of the cache policy. You can query the cache policy based on the type of the dimension table. If you want to query the cache policy of ApsaraDB for HBase dimension tables, see Create an ApsaraDB for HBase dimension table.

  • Check whether functions are correctly used.

    If you incorrectly use functions such as TO_TIMESTAMP_TZ and DATA_FORMAT in your job, an error occurs during data conversion. As a result, data loss occurs.

    If data loss occurs, you can print information about the function that you use to logs by using a print result table or the Log4j configuration. This helps you check whether the function is correctly used. For more information, see Create a print result table and Configure the logs of a historical job instance to be exported.

  • Check whether out-of-order data exists in the data source.
    If out-of-order data exists in your job and the watermark of the out-of-order data is outside the time range of window opening and window closing, the data is lost. For example, in the following figure, the data in the 11th second enters the window of 15 to 20 seconds in the 16th second, the watermark of the data is 11. In this case, the data is dropped because the system considers the data as late data. Out-of-order data

    In most cases, data is lost in a single window. You can check whether out-of-order data exists in the data source by using a print result table or the Log4j configuration. For more information, see Create a print result table and Configure the logs of a historical job instance to be exported.

    After you find the out-of-order data, you can set the watermark based on the degree of the out-of-order data. You can process the out-of-order data by using a delayed-trigger for window calculation. In this example, you can set the time at which a watermark is generated by using the following formula: Watermark = Event time - 5s. This way, the out-of-order data can be processed as expected. We recommend that window aggregation be performed at the exact day, hour, or minute. Otherwise, data loss occurs even if you increase the offset.

Note If this issue persists after you perform the preceding operations, submit a ticket.