This topic describes frequently asked questions (FAQs) and solutions for data ingestion jobs powered by Flink CDC.
My Flink job restarts during the snapshot phase with a JobManager OOM error. What's happening?
Scope of impact
Affected jobs: Data Ingestion jobs using a MySQL source.
Affected versions: Can occur with any Ververica Runtime (VVR) engine version.
Description
You might observe frequent job restarts during the snapshot phase. The JobManager logs will show an OutOfMemoryError (OOM) stack trace.
Additionally, on the Alarm tab, exceptionally high values for
Num of remaining SnapshotSplitsandNum of processed SnapshotSplitsmetrics indicate an issue.

Cause
This problem arises during the snapshot phase. The MySQL source persists all table shard metadata to the Flink job's state. If your job handles a large volume of data or uses very small shard sizes, the JobManager may create an excessive number of shards to read. This large number of shards consumes too much memory, leading to the JobManager experiencing an OOM.
Solution
Increase the memory resources allocated to the JobManager.
Adjust the
jobmanager.memory.heap.sizeandjobmanager.memory.off-heap.sizeparameters to increase the JobManager's heap and off-heap memory.
Why does my Flink job fail to restore from a state during incremental reading with a JobManager OOM?
Scope of impact
Affected jobs: Data Ingestion jobs using a MySQL source.
Affected versions: Can occur with VVR 11.1 or earlier.
Description
Your Flink job enters the incremental phase but fails during state restoration. The JobManager logs will show an OOM.
Cause
VVR 11.1 or earlier versions may not properly clean up persisted table schema information from the job's state after transitioning from snapshot to incremental phases. This leftover, unmanaged schema information accumulates, leading to an OOM when the job attempts to restore its state from a checkpoint.
Solution
Upgrade to VVR 11.2 or later.
My Flink job fails with a TaskManager OOM near the end of the snapshot phase, with only a few shards left. What's causing this?
Scope of impact
Affected jobs: Data Ingestion jobs using a MySQL source.
Affected versions: Can occur with any Ververica Runtime (VVR) engine version.
Description
You observe a TaskManager OOM occurring late in the snapshot phase, typically when only a small number of shards remain to be processed.
Examining TaskManager logs for the keyword
using select statementmight reveal that the last unbounded query involves a very large volume of data.
Cause
The issue stems from prolonged data reading during the snapshot phase. This causes a significant amount of incremental data to accumulate for the final shard(s). When the TaskManager attempts to process this large, accumulated shard, it runs out of memory, resulting in an OOM error.
Solution
Set the option
scan.incremental.snapshot.unbounded-chunk-first.enabled: true. After setting this parameter, re-run the snapshot.
After performing a schema change on my MySQL table using a lock-free tool (like pt-osc), my Flink job stops receiving new data. What's the problem?
Scope of impact
Affected jobs: Data Ingestion jobs using a MySQL source.
Affected versions: Can occur with VVR 11.1 or earlier.
Description
Your Flink job continues to run without restarting after a lock-free table schema change.
The
CurrentFetchTimeLagmetric progresses as expected, indicating data is being fetched.However, the MySQL source stops producing new data, and the
CurrentEmitTimeLagmetric stops updating, signifying a halt in data processing or emission.
Cause
VVR 11.1 or earlier versions could not correctly handle Data Definition Language (DDL) change events generated by lock-free schema change tools such as pt-osc. This inability to process these specific DDL events causes the data pipeline to stall after the schema change.
Solution
Upgrade to VVR 11.2 or later.
Set the option
scan.parse.online.schema.changes.enabled: true.
My Flink job restarts unexpectedly after a lock-free MySQL schema change, and the Transform operator reports a column type mismatch. What is the cause and solution?
Scope of impact
Affected jobs: Data Ingestion jobs using a MySQL source.
Affected versions: Can occur with VVR 11.1 or earlier.
Description
Your Flink job unexpectedly restarts following a lock-free table schema change (e.g., using
pt-osc). The Transform operator logs indicate a column type mismatch error.
Cause
In VVR 11.1, there's a known issue. If a significant volume of data is inserted into a table during a lock-free schema change operation, the engine might generate an unparsable event.
Solution
Upgrade to VVR version 11.2 or later and perform a stateful restart from a savepoint before the lock-free schema change.
Why does my Flink job fail to restore from a savepoint created before a table schema change?
Scope of impact
Affected jobs: Data Ingestion jobs using a MySQL source.
Affected versions: Can occur with VVR 11.1 or earlier.
Description
When you attempt a stateful restart from a savepoint created before a table schema change, your job fails. The error message typically indicates a table schema mismatch exception while the job is consuming binary logs.
Cause
VVR 11.1 or earlier versions do not support stateful restarts from savepoints that contain an incompatible table schema.
Solution
Upgrade to VVR 11.2 or later. After the upgrade, restart the job from a pre-schema change savepoint.