This topic provides answers to some frequently asked questions about how to troubleshoot permission issues, operations and maintenance (O&M) issues, and data exceptions of fully managed Flink.

What do I do if fully managed Flink becomes unavailable after I accidentally delete a role or change authorization policies?

  • Method 1: Create the AliyunStreamAsiDefaultRole role and attach the following custom authorization policies to the role for re-authorization: AliyunStreamAsiDefaultRolePolicy0, AliyunStreamAsiDefaultRolePolicy1, and FlinkServerlessPolicy. For more information about this method, see Manual authorization (method 1).
  • Method 2: Delete the stack of Resource Orchestration Service (ROS), RAM roles, and the policies of the RAM roles. Then, log on to the Realtime Compute for Apache Flink console to enable automatic re-authorization. For more information about this method, see Manual authorization (method 2).

How do I locate the error if the JobManager is not running?

The Flink UI page does not appear because the JobManager is not running. To locate the error, perform the following steps:
  1. On the Deployments page, click the name of the job whose error you want to locate.
  2. Click the Events tab.
  3. To search for errors and obtain error information, use the shortcut keys of the operating system.
    • Windows: Ctrl+F
    • macOS: Command+F
    Example

What do I do if checkpoints for Python jobs are created at a low speed?

  • Cause

    Python operators have a cache. When the system creates a checkpoint, the system must process all data in the cache. Therefore, if the performance of Python UDFs is poor, the time that is required to create a checkpoint increases. This affects the execution of Python jobs.

  • Solution
    On the right side of the Draft Editor page in the console of fully managed Flink, click the Advanced tab. In the panel that appears, configure the following parameters in the Additional Configuration section to reduce the amount of data in the cache:
    python.fn-execution.bundle.size: The maximum number of elements that can be included in a bundle. Default value: 100000. 
    python.fn-execution.bundle.time: Default value: 1000. Unit: milliseconds. 
    For more information about the parameters, see Flink Python Configuration.

How do I troubleshoot the issue that fully managed Flink cannot read source data?

If fully managed Flink cannot read source data, we recommend that you perform the following operations to troubleshoot this issue:
  • Check the network connectivity between the upstream storage service and fully managed Flink.
    Fully managed Flink can access only storage services that are deployed in the same virtual private cloud (VPC) and same region as fully managed Flink. If you need to access storage resources across VPCs or access fully managed Flink over the Internet, use the following methods:
  • Check whether whitelists are configured for the upstream storage services.
    You must configure whitelists for the following upstream storage services: Elasticsearch and Message Queue for Apache Kafka. To configure a whitelist, perform the following steps:
    1. Obtain the Classless Inter-Domain Routing (CIDR) blocks of the vSwitch to which fully managed Flink belongs.

      For more information, see Configure a whitelist.

    2. Configure whitelists of fully managed Flink for the upstream storage services.

      For more information about how to configure whitelists for the upstream storage services, see the topics that are mentioned in the prerequisites of the related DDL documentation, such as the topic that is mentioned in the prerequisites of Create a Message Queue for Apache Kafka source table.

  • Check whether the field type, field sequence, and field letter case defined in DDL statements are consistent with those of the physical table.

    The field types that are supported by the upstream storage services and the field types that are supported by fully managed Flink may not be completely consistent, but they have mapping relationships. You need to match the fields based on the field type mappings in DDL statements. For more information, see the field type mappings in the related DDL documentation, such as Data type mapping in Create a Log Service source table.

  • Check whether the Taskmanager.log file of the source table contains error information.
    If the file contains error information, troubleshoot the error based on the error information. To view the Taskmanager.log file of the source table, perform the following steps:
    1. On the Deployments page, click the name of the job whose error you want to troubleshoot.
    2. On the Overview tab, click the box of the source node.
    3. On the SubTasks tab, find the subtask whose logs you want to view and click Open TaskManager Log Page in the Actions column. Taskmanager.log file
    4. On the Logs tab of the page that appears, view the log information.

      Find the last Caused by, which is the Cause by of the first failover, on this tab. In most cases, this information indicates the root cause of the issue. You can quickly locate the cause of the issue based on this information.

Note If you cannot troubleshoot this issue after you perform the preceding operations, submit a ticket.

How do I troubleshoot the issue that fully managed Flink cannot write data to the result table?

If fully managed Flink cannot write data to the result table, we recommend that you perform the following operations to troubleshoot this issue:
  • Check the network connectivity between the downstream storage service and fully managed Flink.
    Fully managed Flink can access only storage services that are in the same VPC and same region as fully managed Flink. If you need to access storage resources across VPCs or access fully managed Flink over the Internet, use the following methods:
  • Check whether whitelists are configured for the downstream storage service.
    You must configure whitelists for the following downstream storage services: ApsaraDB RDS for MySQL, Message Queue for Apache Kafka, Elasticsearch, AnalyticDB for MySQL V3.0, ApsaraDB for HBase, ApsaraDB for Redis, and ApsaraDB for ClickHouse. To configure a whitelist, perform the following steps:
    1. Obtain the CIDR blocks of the vSwitch to which fully managed Flink belongs.

      For more information, see Configure a whitelist.

    2. Configure whitelists of fully managed Flink for the downstream storage services.

      For more information about how to configure whitelists for the downstream storage services, see the topics that are mentioned in the prerequisites of the related DDL documentation, such as the topic that is mentioned in the prerequisites of Create an ApsaraDB RDS for MySQL result table.

  • Check whether the field type, field sequence, and field letter case defined in DDL statements are consistent with those of the physical table.

    The field types supported by the downstream storage services and the field types supported by fully managed Flink may not be completely consistent, but their field types have mapping relationships. You need to match the fields based on the field type mappings in DDL statements. For more information, see the field type mappings in the related DDL documentation, such as Data type mapping in Create a Log Service result table.

  • Check whether data is filtered by intermediate nodes, such as a WHERE, JOIN, or window node.

    You can check the data input and output of each compute node in the vertex topology. For example, if the input data of a WHERE node is 5 and the output data is 0, the data is filtered out by the WHERE node. Therefore, no data is written to the downstream storage service.

  • Check whether the default values of output condition parameters are configured appropriately in the downstream storage service.
    If the amount of data in your data source is small but the default values of the output condition parameters configured in the DDL statement of the result table are large, the output condition cannot be met. As a result, data cannot be written to the downstream storage service. In this scenario, you need to set the default values of the output condition parameters to smaller values. The following table describes the output condition parameters in common downstream storage services.
    Output condition Parameter Downstream storage service
    The number of data records that are written at a time. batchSize
    • DataHub
    • Tablestore
    • ApsaraDB for MongoDB
    • Phoenix5
    • ApsaraDB RDS for MySQL
    • AnalyticDB for MySQL V3.0
    • ApsaraDB for ClickHouse
    • InfluxDB
    The maximum number of data records that are written at a time. batchCount DataHub
    The flush interval for the buffer of a writer in MaxCompute Tunnel. flushIntervalMs MaxCompute
    The size of data in bytes cached in the memory before data is written to the ApsaraDB for HBase database. sink.buffer-flush.max-size ApsaraDB for HBase
    The number of data records cached in the memory before data is written to the ApsaraDB for HBase database. sink.buffer-flush.max-rows ApsaraDB for HBase
    The interval at which cached data is written to the ApsaraDB for HBase database. This parameter controls the latency of data writing to the ApsaraDB for HBase database. sink.buffer-flush.interval ApsaraDB for HBase
    The maximum number of rows that a Hologres streaming sink node can write to Hologres at a time. jdbcWriteBatchSize Hologres
  • Check whether data output in a window fails due to out-of-order data.

    For example, if fully managed Flink receives a piece of future data of year 2100 with the watermark of 2100 at the beginning, fully managed Flink considers that the data before 2100 is processed. Only the data whose watermark is later than 2100 is processed. The subsequent normal input data of 2021 is discarded because the watermarks of the data are earlier than 2100. The window closes, and data output is triggered only when fully managed Flink receives data whose watermark is later than 2100. Otherwise, no data output exists in the result table.

    You can use a print result table or the Log4j configuration to check whether out-of-order data exists in the data source. For more information, see Create a print result table and Configure job logs. You can filter out the out-of-order data or process the out-of-order data by using a delayed-trigger of window calculation.

  • Check whether data output fails because no data exists in parallel jobs.

    If multiple jobs are run in parallel but data of some parallel jobs does not flow into fully managed Flink, the watermark of the job data is defined as January 1, 1970, 00:00:00 UTC. In this case, no watermark for closing the window exists because the earliest watermark of the jobs that are run in parallel is used. As a result, the window cannot be closed and no data output occurs.

    To troubleshoot the issue, you must check whether the data of each parallel job of a subtask in the vertex topology of the upstream storage service flows into fully managed Flink. If some parallel jobs have no data, we recommend that you adjust the job parallelism to a value less than or equal to the number of shards of the source table. This method ensures that all parallel jobs have data.

  • Check whether data output fails because a partition of Message Queue for Apache Kafka has no data.

    If a partition of Message Queue for Apache Kafka has no data, the watermark cannot be generated. As a result, no data output is returned after the data of the Message Queue for Apache Kafka source table is calculated based on the event time-based window. Solution: For more information, see After the data of a Message Queue for Apache Kafka source table is calculated by using event time-based window functions, no data output is returned. Why?.

Note If you cannot troubleshoot this issue after you perform the preceding operations, submit a ticket.