This topic describes connector behavior and actions to take when an upstream or downstream system scales, along with necessary precautions.
Connector behavior and recommended actions
When a job encounters faults, such as task failures or vertex downtime, Realtime Compute for Apache Flink performs a failover, automatically recovering the job. It aims to recover the job to a normal state and guarantees accurate and consistent data processing results.
Connector | Connector behavior | Recommended action | Writing to sink dependent on checkpointing? |
When Kafka partitions are added, Realtime Compute for Apache Flink dynamically detects this change. But if the new partition count is not an integer multiple of the parallelism, data will not be distributed evenly. | After adding Kafka partitions, adjust the job parallelism to be a divisor of the partition count to ensure even data distribution. For example, when the Kafka partition count is increased from three to eight, change the parallelism to four or eight. | Yes, for exactly-once delivery | |
During instance scaling or restart, connections may be distrupted. In this case, Realtime Compute for Apache Flink tries to re-establish the connection until timeout. Then, it will make failover attempts until the Hologres instance is restarted. | Perform a stateless startup, because Realtime Compute for Apache Flink reads a Hologres table based on its table name. | No | |
| After failover, your job will adapt to the change of the partition count. To prevent job failover, you can manually restart the job after changing partition count. | No | |
The connection may be distrupted due to instance scaling or restart. If this happens, Realtime Compute for Apache Flink will detect the disconnection and restarts the task. If the database endpoint is not changed and the database service is still available, the connector attempts to recreate the connection to recover the job. Note How this works Generally, when an external system fails, these connectors first attempt to reconnect. If these attempts succeed, the job continues without triggering a failover. However, if all attempts fail due to prolonged unavailability of the external system, the connectors throw exceptions, causing task failure. Realtime Compute for Apache Flink then fails over, recovering your job based on predefined strategies. After recovery, tasks are re-orchestrated, and the connectors try to create connections with external systems again. | Evaluate the impact of job restart before performing any scaling operation.
Note Primary-secondary switchover or cluster restart can cause temporary connection disruption. If the connection is not recovered for a prolonged period of time, failover is triggered. To avoid this, cancel the job and restart it after the configuration modification has finished. | No | |
No | |||
No | |||
No | |||
No | |||
No | |||
No | |||
No | |||
No | |||
No | |||
N/A | |||
Yes | |||
Yes, for exactly-once delivery | |||
Following a scale-down operation, if MaxCompute lacks sufficient compute resources to write data to or read data from Realtime Compute for Apache Flink at the current job parallelism, affected subtasks will throw errors until resources become available. | Before performing a scale-down, carefully evaluate data traffic; Alternatively, reduce the job parallelism. | Batch Tunnel mode Yes | |
The connector cannot automatically detect partition count changes. | Manually restart your job so it can adapt to the changes. | No | |
|
| No | |
No | |||
If If it is set to true, perform actions based on the value of the | Perform actions based on the value of the
| No | |
| Automatic adaptation may cause repeated data consumption. If this is unacceptable, cancel the job before changing partition count. Once changes are complete, restart the job from the last checkpoint. | Yes | |
No data exists in the connector or is sent by the buffer. | N/A | Yes | |
Data writing is not affected. | N/A | No | |
During the data reading stage, topology changes will cause the | If your actions cause a topology change, we recommend that you cancel the job first and restart the job when your operations are complete and the cluster returns to normal. | Yes | |
Not applicable. This is because an independent metadata layer is maintained to describe data structures and status, and no scaling is involved for these systems. | N/A | Yes | |
N/A | Yes | ||
N/A | Yes | ||
N/A | Yes | ||
Not applicable, because these connectors are used for testing only. | N/A | N/A | |
N/A | N/A | ||
N/A | N/A | ||
N/A | N/A |