All Products
Search
Document Center

Realtime Compute for Apache Flink:Handle the scaling of upstream and downstream systems

Last Updated:May 23, 2025

This topic describes connector behavior and actions to take when an upstream or downstream system scales, along with necessary precautions.

Connector behavior and recommended actions

Note

When a job encounters faults, such as task failures or vertex downtime, Realtime Compute for Apache Flink performs a failover, automatically recovering the job. It aims to recover the job to a normal state and guarantees accurate and consistent data processing results.

Connector

Connector behavior

Recommended action

Writing to sink dependent on checkpointing?

Kafka connector

When Kafka partitions are added, Realtime Compute for Apache Flink dynamically detects this change. But if the new partition count is not an integer multiple of the parallelism, data will not be distributed evenly.

After adding Kafka partitions, adjust the job parallelism to be a divisor of the partition count to ensure even data distribution. For example, when the Kafka partition count is increased from three to eight, change the parallelism to four or eight.

Yes, for exactly-once delivery

Upsert Kafka connector

Hologres connector

During instance scaling or restart, connections may be distrupted. In this case, Realtime Compute for Apache Flink tries to re-establish the connection until timeout. Then, it will make failover attempts until the Hologres instance is restarted.

Perform a stateless startup, because Realtime Compute for Apache Flink reads a Hologres table based on its table name.

No

Simple Log Service connector

  • VVR 8.0.8 or earlier: Realtime Compute for Apache Flink fails over your job to adapt to partition changes.

  • VVR 8.0.9 or later: If the enableNewSource option is set to true, Realtime Compute for Apache Flink will not fail over a job. Additionally, if the shardDiscoveryIntervalMs option is set, Realtime Compute for Apache Flink will regularly detect partition count changes.

After failover, your job will adapt to the change of the partition count.

To prevent job failover, you can manually restart the job after changing partition count.

No

MySQL connector

The connection may be distrupted due to instance scaling or restart. If this happens, Realtime Compute for Apache Flink will detect the disconnection and restarts the task.

If the database endpoint is not changed and the database service is still available, the connector attempts to recreate the connection to recover the job.

Note

How this works

Generally, when an external system fails, these connectors first attempt to reconnect. If these attempts succeed, the job continues without triggering a failover.

However, if all attempts fail due to prolonged unavailability of the external system, the connectors throw exceptions, causing task failure. Realtime Compute for Apache Flink then fails over, recovering your job based on predefined strategies.

After recovery, tasks are re-orchestrated, and the connectors try to create connections with external systems again.

Evaluate the impact of job restart before performing any scaling operation.

  • If the endpoint changes, update your code accordingly, re-deploy the program, and start your job.

  • If the endpoint does not change, there is no need for restart.

Note

Primary-secondary switchover or cluster restart can cause temporary connection disruption. If the connection is not recovered for a prolonged period of time, failover is triggered. To avoid this, cancel the job and restart it after the configuration modification has finished.

No

ApsaraDB RDS for MySQL connector

No

JDBC connector

No

AnalyticDB for PostgreSQL connector

No

AnalyticDB for MySQL V3.0 connector

No

TSDB for InfluxDB connector

No

OceanBase connector (public preview)

No

PolarDB for PostgreSQL

No

Lindorm connector

No

ApsaraDB for HBase connector

No

PostgreSQL CDC connector (public preview)

N/A

Elasticsearch

Yes

StarRocks

Yes, for exactly-once delivery

MaxCompute connector

Following a scale-down operation, if MaxCompute lacks sufficient compute resources to write data to or read data from Realtime Compute for Apache Flink at the current job parallelism, affected subtasks will throw errors until resources become available.

Before performing a scale-down, carefully evaluate data traffic; Alternatively, reduce the job parallelism.

Batch Tunnel mode

Yes

DataHub connector

The connector cannot automatically detect partition count changes.

Manually restart your job so it can adapt to the changes.

No

Tair (Redis OSS-compatible) connector

  • Standard master-replica instances: Support imperceptible scaling, including increasing and reducing shards.

  • Cluster instances: Modifying instance configurations can result in transient connections. To handle this, Realtime Compute for Apache Flink tries to re-establish the connection. If these attempts fail, a failover is triggered.

  • Standard master-replica instances: N/A.

  • Cluster instances: To prevent failovers and accelerate adaptation, manually restart the job after configurations modification is finished.

No

Tair connector

No

ClickHouse connector

If shardWrite is set to false, your job is not restarted.

If it is set to true, perform actions based on the value of the inferLocalTable option.

Perform actions based on the value of the inferLocalTable option:

  • If it is set to false (the default value), you can add a node's IP address to the url and then restart the job.

  • If it is set to true, you can manually restart the job. The node of the local table is automatically inferred.

No

ApsaraMQ for RocketMQ connector

  • ApsaraMQ for RocketMQ 4.x:

    Realtime Compute for Apache Flink performs a failover so the job adapts to partition count changes.

  • ApsaraMQ for RocketMQ 5.x:

    • VVR 8.0.6 or earlier: Realtime Compute for Apache Flink does not automatically adapt to the the change of partition count. Manual restart is needed for adaptation.

    • VVR 8.0.7 or later: Realtime Compute for Apache Flink automatically adapts to the change of partition count.

Automatic adaptation may cause repeated data consumption. If this is unacceptable, cancel the job before changing partition count. Once changes are complete, restart the job from the last checkpoint.

Yes

Tablestore connector

No data exists in the connector or is sent by the buffer.

N/A

Yes

SelectDB connector

Data writing is not affected.

N/A

No

MongoDB connector

During the data reading stage, topology changes will cause the 134 - ReadConcernMajorityNotAvailableYet error, which is a non-retryable error.

If your actions cause a topology change, we recommend that you cancel the job first and restart the job when your operations are complete and the cluster returns to normal.

Yes

OSS connector

Not applicable. This is because an independent metadata layer is maintained to describe data structures and status, and no scaling is involved for these systems.

N/A

Yes

Iceberg connector

N/A

Yes

Apache Paimon connector

N/A

Yes

Hudi connector (to be retired)

N/A

Yes

Print connector

Not applicable, because these connectors are used for testing only.

N/A

N/A

Blackhole connector

N/A

N/A

Datagen connector

N/A

N/A

Faker connector

N/A

N/A