All Products
Search
Document Center

Realtime Compute for Apache Flink:FAQ about Data Ingestion with Flink CDC

Last Updated:Nov 06, 2025

This topic describes frequently asked questions (FAQs) and solutions for data ingestion jobs powered by Flink CDC.

My Flink job restarts during the snapshot phase with a JobManager OOM error. What's happening?

Scope of impact

  • Affected jobs: Data Ingestion jobs using a MySQL source.

  • Affected versions: Can occur with any Ververica Runtime (VVR) engine version.

Description

  • You might observe frequent job restarts during the snapshot phase. The JobManager logs will show an OutOfMemoryError (OOM) stack trace.

  • Additionally, on the Alarm tab, exceptionally high values for Num of remaining SnapshotSplits and Num of processed SnapshotSplits metrics indicate an issue.

image

Cause

This problem arises during the snapshot phase. The MySQL source persists all table shard metadata to the Flink job's state. If your job handles a large volume of data or uses very small shard sizes, the JobManager may create an excessive number of shards to read. This large number of shards consumes too much memory, leading to the JobManager experiencing an OOM.

Solution

  • Increase the memory resources allocated to the JobManager.

  • Adjust the jobmanager.memory.heap.size and jobmanager.memory.off-heap.size parameters to increase the JobManager's heap and off-heap memory.

Why does my Flink job fail to restore from a state during incremental reading with a JobManager OOM?

Scope of impact

  • Affected jobs: Data Ingestion jobs using a MySQL source.

  • Affected versions: Can occur with VVR 11.1 or earlier.

Description

  • Your Flink job enters the incremental phase but fails during state restoration. The JobManager logs will show an OOM.

Cause

VVR 11.1 or earlier versions may not properly clean up persisted table schema information from the job's state after transitioning from snapshot to incremental phases. This leftover, unmanaged schema information accumulates, leading to an OOM when the job attempts to restore its state from a checkpoint.

Solution

  • Upgrade to VVR 11.2 or later.

My Flink job fails with a TaskManager OOM near the end of the snapshot phase, with only a few shards left. What's causing this?

Scope of impact

  • Affected jobs: Data Ingestion jobs using a MySQL source.

  • Affected versions: Can occur with any Ververica Runtime (VVR) engine version.

Description

  • You observe a TaskManager OOM occurring late in the snapshot phase, typically when only a small number of shards remain to be processed.

  • Examining TaskManager logs for the keyword using select statement might reveal that the last unbounded query involves a very large volume of data.

Cause

The issue stems from prolonged data reading during the snapshot phase. This causes a significant amount of incremental data to accumulate for the final shard(s). When the TaskManager attempts to process this large, accumulated shard, it runs out of memory, resulting in an OOM error.

Solution

  • Set the option scan.incremental.snapshot.unbounded-chunk-first.enabled: true. After setting this parameter, re-run the snapshot.

After performing a schema change on my MySQL table using a lock-free tool (like pt-osc), my Flink job stops receiving new data. What's the problem?

Scope of impact

  • Affected jobs: Data Ingestion jobs using a MySQL source.

  • Affected versions: Can occur with VVR 11.1 or earlier.

Description

  • Your Flink job continues to run without restarting after a lock-free table schema change.

  • The CurrentFetchTimeLag metric progresses as expected, indicating data is being fetched.

  • However, the MySQL source stops producing new data, and the CurrentEmitTimeLag metric stops updating, signifying a halt in data processing or emission.

Cause

VVR 11.1 or earlier versions could not correctly handle Data Definition Language (DDL) change events generated by lock-free schema change tools such as pt-osc. This inability to process these specific DDL events causes the data pipeline to stall after the schema change.

Solution

  • Upgrade to VVR 11.2 or later.

  • Set the option scan.parse.online.schema.changes.enabled: true.

My Flink job restarts unexpectedly after a lock-free MySQL schema change, and the Transform operator reports a column type mismatch. What is the cause and solution?

Scope of impact

  • Affected jobs: Data Ingestion jobs using a MySQL source.

  • Affected versions: Can occur with VVR 11.1 or earlier.

Description

  • Your Flink job unexpectedly restarts following a lock-free table schema change (e.g., using pt-osc). The Transform operator logs indicate a column type mismatch error.

Cause

In VVR 11.1, there's a known issue. If a significant volume of data is inserted into a table during a lock-free schema change operation, the engine might generate an unparsable event.

Solution

  • Upgrade to VVR version 11.2 or later and perform a stateful restart from a savepoint before the lock-free schema change.

Why does my Flink job fail to restore from a savepoint created before a table schema change?

Scope of impact

  • Affected jobs: Data Ingestion jobs using a MySQL source.

  • Affected versions: Can occur with VVR 11.1 or earlier.

Description

  • When you attempt a stateful restart from a savepoint created before a table schema change, your job fails. The error message typically indicates a table schema mismatch exception while the job is consuming binary logs.

Cause

VVR 11.1 or earlier versions do not support stateful restarts from savepoints that contain an incompatible table schema.

Solution

Upgrade to VVR 11.2 or later. After the upgrade, restart the job from a pre-schema change savepoint.