Handle the scaling of upstream and downstream systems - Realtime Compute for Apache Flink

This topic describes connector behavior and actions to take when an upstream or downstream system scales, along with necessary precautions.

Connector behavior and recommended actions

Note

When a job encounters faults, such as task failures or vertex downtime, Realtime Compute for Apache Flink performs a failover, automatically recovering the job. It aims to recover the job to a normal state and guarantees accurate and consistent data processing results.

Connector	Connector behavior	Recommended action	Writing to sink dependent on checkpointing?
Kafka connector	When Kafka partitions are added, Realtime Compute for Apache Flink dynamically detects this change. But if the new partition count is not an integer multiple of the parallelism, data will not be distributed evenly.	After adding Kafka partitions, adjust the job parallelism to be a divisor of the partition count to ensure even data distribution. For example, when the Kafka partition count is increased from three to eight, change the parallelism to four or eight.	Yes, for exactly-once delivery
Upsert Kafka connector			Yes, for exactly-once delivery
Hologres connector	During instance scaling or restart, connections may be distrupted. In this case, Realtime Compute for Apache Flink tries to re-establish the connection until timeout. Then, it will make failover attempts until the Hologres instance is restarted.	Perform a stateless startup, because Realtime Compute for Apache Flink reads a Hologres table based on its table name.	No
Simple Log Service connector	VVR 8.0.8 or earlier: Realtime Compute for Apache Flink fails over your job to adapt to partition changes. VVR 8.0.9 or later: If the `enableNewSource` option is set to true, Realtime Compute for Apache Flink will not fail over a job. Additionally, if the `shardDiscoveryIntervalMs` option is set, Realtime Compute for Apache Flink will regularly detect partition count changes.	After failover, your job will adapt to the change of the partition count. To prevent job failover, you can manually restart the job after changing partition count.	No
MySQL connector	The connection may be distrupted due to instance scaling or restart. If this happens, Realtime Compute for Apache Flink will detect the disconnection and restarts the task. If the database endpoint is not changed and the database service is still available, the connector attempts to recreate the connection to recover the job. Note How this works Generally, when an external system fails, these connectors first attempt to reconnect. If these attempts succeed, the job continues without triggering a failover. However, if all attempts fail due to prolonged unavailability of the external system, the connectors throw exceptions, causing task failure. Realtime Compute for Apache Flink then fails over, recovering your job based on predefined strategies. After recovery, tasks are re-orchestrated, and the connectors try to create connections with external systems again.	Evaluate the impact of job restart before performing any scaling operation. If the endpoint changes, update your code accordingly, re-deploy the program, and start your job. If the endpoint does not change, there is no need for restart. Note Primary-secondary switchover or cluster restart can cause temporary connection disruption. If the connection is not recovered for a prolonged period of time, failover is triggered. To avoid this, cancel the job and restart it after the configuration modification has finished.	No
ApsaraDB RDS for MySQL connector			No
JDBC connector			No
AnalyticDB for PostgreSQL connector			No
AnalyticDB for MySQL V3.0 connector			No
TSDB for InfluxDB connector			No
OceanBase connector (public preview)			No
PolarDB for PostgreSQL			No
Lindorm connector			No
ApsaraDB for HBase connector			No
PostgreSQL CDC connector (public preview)			N/A
Elasticsearch			Yes
StarRocks			Yes, for exactly-once delivery
MaxCompute connector	Following a scale-down operation, if MaxCompute lacks sufficient compute resources to write data to or read data from Realtime Compute for Apache Flink at the current job parallelism, affected subtasks will throw errors until resources become available.	Before performing a scale-down, carefully evaluate data traffic; Alternatively, reduce the job parallelism.	Batch Tunnel mode Yes
DataHub connector	The connector cannot automatically detect partition count changes.	Manually restart your job so it can adapt to the changes.	No
Tair (Redis OSS-compatible) connector	Standard master-replica instances: Support imperceptible scaling, including increasing and reducing shards. Cluster instances: Modifying instance configurations can result in transient connections. To handle this, Realtime Compute for Apache Flink tries to re-establish the connection. If these attempts fail, a failover is triggered.	Standard master-replica instances: N/A. Cluster instances: To prevent failovers and accelerate adaptation, manually restart the job after configurations modification is finished.	No
Tair connector			No
ClickHouse connector	If `shardWrite` is set to false, your job is not restarted. If it is set to true, perform actions based on the value of the `inferLocalTable` option.	Perform actions based on the value of the `inferLocalTable` option: If it is set to `false` (the default value), you can add a node's IP address to the url and then restart the job. If it is set to `true`, you can manually restart the job. The node of the local table is automatically inferred.	No
ApsaraMQ for RocketMQ connector	ApsaraMQ for RocketMQ 4.x: Realtime Compute for Apache Flink performs a failover so the job adapts to partition count changes. ApsaraMQ for RocketMQ 5.x: VVR 8.0.6 or earlier: Realtime Compute for Apache Flink does not automatically adapt to the the change of partition count. Manual restart is needed for adaptation. VVR 8.0.7 or later: Realtime Compute for Apache Flink automatically adapts to the change of partition count.	Automatic adaptation may cause repeated data consumption. If this is unacceptable, cancel the job before changing partition count. Once changes are complete, restart the job from the last checkpoint.	Yes
Tablestore connector	No data exists in the connector or is sent by the buffer.	N/A	Yes
SelectDB connector	Data writing is not affected.	N/A	No
MongoDB connector	During the data reading stage, topology changes will cause the `134 - ReadConcernMajorityNotAvailableYet` error, which is a non-retryable error.	If your actions cause a topology change, we recommend that you cancel the job first and restart the job when your operations are complete and the cluster returns to normal.	Yes
OSS connector	Not applicable. This is because an independent metadata layer is maintained to describe data structures and status, and no scaling is involved for these systems.	N/A	Yes
Iceberg connector		N/A	Yes
Apache Paimon connector		N/A	Yes
Hudi connector (to be retired)		N/A	Yes
Print connector	Not applicable, because these connectors are used for testing only.	N/A	N/A
Blackhole connector		N/A	N/A
Datagen connector		N/A	N/A
Faker connector		N/A	N/A