Troubleshooting - Realtime Compute for Apache Flink - Alibaba Cloud Documentation Center

This topic provides answers to some frequently asked questions about how to troubleshoot permission issues, operations and maintenance (O&M) issues, and data exceptions of fully managed Flink.

Permission issues
What do I do if fully managed Flink becomes unavailable after I delete a role or change authorization policies?
O&M issues
Data exceptions

What do I do if fully managed Flink becomes unavailable after I delete a role or change authorization policies?

You can perform the following steps to perform automated authorization:

Check whether the RAM role AliyunStreamAsiDefaultRole is deleted. For more information, see Delete a RAM role.
Important
You can delete a role only after you revoke all permissions from the role.
Delete the stacks FlinkServerlessStack and FlinkOnAckStack of fully managed Flink. For more information, see Delete a stack.
- FlinkServerlessStack: the name of the Resource Orchestration Service (ROS) stack of fully managed Flink.
- FlinkOnAckStack: the name of the ROS stack of Container Service for Kubernetes (ACK).
Delete the policy AliyunStreamAsiDefaultRolePolicy. For more information, see Delete a custom policy.
Log on to the Realtime Compute for Apache Flink console and perform automated authorization. For more information, see Procedure.

How do I locate the error if the JobManager is not running?

The Flink UI page does not appear because the JobManager is not running as expected. To identify the cause of the error, perform the following steps:

On the Deployments page, click the name of the desired deployment.
Click the Events tab.
To search for errors and obtain error information, use the shortcut keys of the operating system.
- Windows: Ctrl+F
- macOS: Command+F

What do I do if checkpoints for Python deployments are created at a low speed?

Cause
Python operators contain a cache. When the system creates a checkpoint, the system must process all data in the cache. If the performance of Python user-defined functions (UDFs) is poor, the time that is required to create a checkpoint increases. This affects the execution of Python deployments.
Solution
Reduce the amount of data that can be cached. You can add the following configurations to the Other Configuration field in the Parameters section of the Configuration tab. For more information, see How do I configure parameters for deployment running?
```
python.fn-execution.bundle.size: The maximum number of elements that can be included in a bundle. Default value: 100000. 
python.fn-execution.bundle.time: Default value: 1000. Unit: milliseconds.
```
For more information about the parameters, see Flink Python Configuration.

What do I do if the error message "Invalid versionName string" is reported?

Problem description
When a deployment is started, the error message "Invalid versionName string" is reported. SQL deployments are not affected. Java Archive (JAR) or Python deployments whose Flink version is Flink 1.11.3 or an earlier minor version.
Cause
When you create a JAR or Python deployment to be deployed in a session cluster, the Flink engine version is not configured for the deployment.
Solution
On the SQL Editor page, configure the Flink engine version for the deployment. Then, publish and start the deployment again.
If a deployment that is deployed in a session cluster is created by using an SDK, we recommend that you update the version of the common dependency package of the SDK to 1.0.21 and configure the Flink engine version for the deployment when you create an artifact. Examples:
- Configuration of the common dependency package
```
<dependency>
  <groupId>com.aliyun</groupId>
  <artifactId>ververica-common</artifactId>
  <version>1.0.21</version>
</dependency>
```
- Configuration of the Flink engine version when you create an artifact
```
com.ververica.common.model.deployment.Artifact.SqlScriptArtifact#setVersionName
com.ververica.common.model.deployment.Artifact.JarArtifact#setVersionName
```
  Important
  The value of versionName in the configuration must be the same as the Flink engine version that you configured for the session cluster in which the deployment is deployed.

How do I troubleshoot the issue that fully managed Flink cannot read source data?

If fully managed Flink cannot read source data, we recommend that you perform the following steps to troubleshoot this issue:

Check the network connectivity between the upstream storage service and fully managed Flink.
Realtime Compute for Apache Flink can access only storage services that are deployed in the same virtual private cloud (VPC) and the same region as Realtime Compute for Apache Flink. If you want to access storage resources across VPCs or access Realtime Compute for Apache Flink over the Internet, use the following methods:
- To access storage resources across multiple VPCs, you can use one of the methods that are described in How does fully managed Flink access a service across VPCs?
- To access fully managed Flink over the Internet, you can use NAT gateways that are provided by Alibaba Cloud to enable communications between VPCs and the Internet. For more information, see How does the fully managed Flink service access the Internet?
Check whether whitelists are configured for the upstream storage services.
You must configure whitelists for the following upstream storage services: Elasticsearch and ApsaraMQ for Kafka. To configure a whitelist, perform the following steps:
1. Obtain the CIDR blocks of the vSwitch to which fully managed Flink belongs.
  For more information, see Configure a whitelist.
2. Configure whitelists of fully managed Flink for the upstream storage services.
  For more information about how to configure whitelists for the upstream storage services, see the topics that are mentioned in the prerequisites of the related DDL documentation, such as the topic that is mentioned in the prerequisites of Create an ApsaraMQ for Kafka source table.
Check whether the field type, field sequence, and field letter case defined in DDL statements are consistent with those of the physical table.
The field types that are supported by the upstream storage services and the field types that are supported by fully managed Flink may not be completely consistent, but they have mapping relationships. You need to match the fields based on the field type mappings defined in DDL statements. For more information, see the field type mappings in the related DDL documentation, such as data type mappings in Create a Simple Log Service source table.
Check whether the Taskmanager.log file of the source table contains error information.
If the file contains error information, troubleshoot the error based on the error information. To view the Taskmanager.log file of the source table, perform the following steps:
1. On the Deployments page, click the name of the desired deployment.
2. On the Status tab, click the name of the source node in the Name column.
3. On the SubTasks tab, find the subtask whose logs you want to view and click Open TaskManager Log Page in the Actions column.
4. On the Logs tab of the page that appears, view the log information.
  Find the last Caused by, which is the Cause by of the first failover, on this tab. In most cases, this information indicates the root cause of the issue. You can quickly locate the cause of the issue based on this information.

How do I troubleshoot the issue that fully managed Flink cannot write data to the result table?

If fully managed Flink cannot write data to the result table, we recommend that you perform the following operations to troubleshoot this issue:

Check the network connectivity between the downstream storage service and fully managed Flink.
Realtime Compute for Apache Flink can access only storage services that are deployed in the same VPC and the same region as Realtime Compute for Apache Flink. If you want to access storage resources across VPCs or access Realtime Compute for Apache Flink over the Internet, use the following methods:
- To access storage resources across multiple VPCs, you can use one of the methods that are described in How does fully managed Flink access a service across VPCs?
- To access fully managed Flink over the Internet, you can use NAT gateways that are provided by Alibaba Cloud to enable communications between VPCs and the Internet. For more information, see How does the fully managed Flink service access the Internet?
Check whether whitelists are configured for the downstream storage service.
You must configure whitelists for the following downstream storage services: ApsaraDB RDS for MySQL, ApsaraMQ for Kafka, Elasticsearch, AnalyticDB for MySQL V3.0, ApsaraDB for HBase, ApsaraDB for Redis, and ApsaraDB for ClickHouse. To configure a whitelist, perform the following steps:
1. Obtain the CIDR blocks of the vSwitch to which fully managed Flink belongs.
  For more information, see Configure a whitelist.
2. Configure whitelists of fully managed Flink for the downstream storage services.
  For more information about how to configure whitelists for the downstream storage services, see the topics that are mentioned in the prerequisites of the related DDL documentation, such as the topic that is mentioned in the prerequisites of Create an ApsaraDB RDS for MySQL result table.
Check whether the field type, field sequence, and field letter case defined in DDL statements are consistent with those of the physical table.
The field types supported by the downstream storage services and the field types supported by fully managed Flink may not be completely consistent, but their field types have mapping relationships. You need to match the fields based on the field type mappings defined in DDL statements. For more information, see the field type mappings in the related DDL documentation, such as Data type mapping in Create a Simple Log Service result table.
Check whether data is filtered by intermediate nodes, such as a WHERE, JOIN, or window node.
You can check the data input and output of each compute node in the vertex topology. For example, if the input data of a WHERE node is 5 and the output data is 0, the data is filtered out by the WHERE node. Therefore, no data is written to the downstream storage service.

Check whether the default values of output condition parameters are configured appropriately in the downstream storage service.

If the amount of data in your data source is small but the default values of the output condition parameters configured in the DDL statement of the result table are large, the output condition cannot be met. As a result, data cannot be written to the downstream storage service. In this scenario, you need to set the default values of the output condition parameters to smaller values. The following table describes the output condition parameters in common downstream storage services.

Output condition	Parameter	Downstream storage service
The number of data records that can be written at a time.	batchSize	DataHub Tablestore MongoDB Phoenix5 RDS MYSQL AnalyticDB for MySQL V3.0 connector ClickHouse InfluxDB
The maximum number of data records that are written at a time.	batchCount	DataHub
The flush interval for the buffer of a writer in MaxCompute Tunnel.	flushIntervalMs	MaxCompute
The size of data in bytes cached in the memory before data is written to the ApsaraDB for HBase database.	sink.buffer-flush.max-size	Hbase
The number of data records cached in the memory before data is written to the ApsaraDB for HBase database.	sink.buffer-flush.max-rows	Hbase
The interval at which cached data is written to the ApsaraDB for HBase database. This parameter controls the latency of data writing to the ApsaraDB for HBase database.	sink.buffer-flush.interval	Hbase
The maximum number of rows that a Hologres streaming sink node can write to Hologres at a time.	jdbcWriteBatchSize	Hologres

Check whether data output in a window fails due to out-of-order data.
For example, if fully managed Flink receives a piece of future data of the year 2100 with the watermark of 2100 at the beginning, fully managed Flink considers that the data before 2100 is processed. Only the data whose watermark is later than 2100 is processed. The subsequent normal input data of 2021 is discarded because the watermarks of the data are earlier than 2100. The window closes, and data output is triggered only when fully managed Flink receives data whose watermark is later than 2100. Otherwise, no data output exists in the result table.
You can use a print result table or view the Log4j logs to check whether out-of-order data exists in the data source. For more information, see Create a print result table and Configure parameters to export logs of a deployment. You can filter out the out-of-order data or process the out-of-order data by using a delayed-trigger of window calculation.
Check whether data output fails because no data exists in parallel deployments.
If multiple deployments are run in parallel but data of some parallel deployments does not flow into fully managed Flink, the watermark of the deployment data is defined as January 1, 1970, 00:00:00 UTC. In this case, no watermark for closing the window exists because the earliest watermark of the deployments that are run in parallel is used. As a result, the window cannot be closed and no data output occurs.
To troubleshoot the issue, you must check whether the data of each parallel deployment of a subtask in the vertex topology of the upstream storage service flows into fully managed Flink. If some parallel deployments have no data, we recommend that you adjust the deployment parallelism to a value less than or equal to the number of shards of the source table. This method ensures that all parallel deployments have data.
Check whether data output fails because a partition of ApsaraMQ for Kafka has no data.
If a partition of ApsaraMQ for Kafka has no data, the watermark cannot be generated. As a result, no data output is returned after the data of the Message Queue for Apache Kafka source table is calculated based on the event time-based window. For more information about the solution, see After the data of an ApsaraMQ for Kafka source table is calculated by using event time-based window functions, no data output is returned. Why?

How do I troubleshoot data loss?

In most cases, when data is calculated in intermediate phases, such as the WHERE, JOIN, or windowing operations, the volume of data is reduced because the data is filtered or the JOIN operation fails. If your data is missing, we recommend that you perform the following steps to troubleshoot the issue:

Check whether the cache policy in the dimension table is configured as expected.
If the cache policy defined in the DDL statement of the dimension table is invalid, the system cannot obtain data from the dimension table. As a result, data loss occurs. If data loss occurs, we recommend that you query and modify the setting of the cache policy. You can query the cache policy based on the type of the dimension table. If you want to query the cache policy of ApsaraDB for HBase dimension tables, see Create an ApsaraDB for HBase dimension table.
Check whether functions are correctly used.
If you incorrectly use functions such as TO_TIMESTAMP_TZ and DATA_FORMAT in your deployment, an error occurs during data conversion. As a result, data loss occurs.
If data loss occurs, you can print information about the function that you use to logs by using a print result table or viewing the Log4j logs. This helps you check whether the function is correctly used. For more information, see Create a print result table or Configure parameters to export logs of a deployment.
Check whether out-of-order data exists in the data source.
If out-of-order data exists in your deployment and the watermark of the out-of-order data is outside the time range of window opening and window closing, the data is lost. For example, in the following figure, the data in the 11th second enters the windows for the 15th to 20th second at the 16th second, and the watermark of the data is 11. In this case, the data is dropped because the system considers the data late.
In most cases, data is lost in a single window. You can check whether out-of-order data exists in the data source by using a print result table or viewing the Log4j logs. For more information, see Create a print result table or Configure parameters to export logs of a deployment.
After you find the out-of-order data, you can set the watermark based on the degree of the out-of-order data. You can process the out-of-order data by using a delayed-trigger for window calculation. In this example, you can set the time at which a watermark is generated by using the following formula: Watermark = Event time - 5s. This way, the out-of-order data can be processed as expected. We recommend that window aggregation be performed at the exact day, hour, or minute. Otherwise, data loss occurs even if you increase the offset.