FAQ and solutions for ingestion - Realtime Compute for Apache Flink

Quick reference

Symptom	Phase	Severity	Link
JobManager OOM with high `SnapshotSplits` metric values	Snapshot	Critical	FAQ 1
TaskManager OOM when few shards remain	Snapshot	Critical	FAQ 3
JobManager OOM on state restoration during incremental reading	Incremental	Critical	FAQ 2
No new data after a lock-free schema change with `pt-osc`	Schema change	High	FAQ 4
Transform column type mismatch after a lock-free schema change	Schema change	High	FAQ 5
Job fails to restore from a pre-schema-change savepoint	State recovery	High	FAQ 6

Snapshot phase

FAQ 1: JobManager OOM during the snapshot phase

Severity: Critical | Phase: Snapshot | Affected versions: All VVR engine versions

Symptom

The job restarts repeatedly during the snapshot phase.
The JobManager logs contain an OutOfMemoryError (OOM) stack trace.
On the Alarm tab, the Num of remaining SnapshotSplits and Num of processed SnapshotSplits metrics show exceptionally high values.

Alarm tab showing high SnapshotSplits metrics

Cause

During the snapshot phase, the MySQL source persists all table shard metadata to the Flink job's state. If the job handles a large volume of data or uses very small shard sizes, the JobManager creates an excessive number of shards. This consumes too much memory and causes the JobManager to run out of memory.

Solution

Increase the memory resources allocated to the JobManager.
Adjust the following parameters to increase the JobManager's heap and off-heap memory:
- jobmanager.memory.heap.size
- jobmanager.memory.off-heap.size

FAQ 3: TaskManager OOM near the end of the snapshot phase

Severity: Critical | Phase: Snapshot | Affected versions: All VVR engine versions

Symptom

The TaskManager runs out of memory late in the snapshot phase, typically when only a small number of shards remain.
Searching the TaskManager logs for using select statement reveals that the last unbounded query involves a very large volume of data.

Cause

Prolonged data reading during the snapshot phase causes a significant amount of incremental data to accumulate for the final shard or shards. When the TaskManager processes this large accumulated shard, it runs out of memory.

Solution

Set the following option:

   scan.incremental.snapshot.unbounded-chunk-first.enabled: true

Re-run the snapshot.

Incremental phase

FAQ 2: JobManager OOM during state restoration in the incremental phase

Severity: Critical | Phase: Incremental | Affected versions: VVR 11.1 or earlier

Symptom

The job enters the incremental phase but fails during state restoration.
The JobManager logs show an OOM.

Cause

VVR 11.1 and earlier versions may not properly clean up persisted table schema information from the job's state after transitioning from the snapshot phase to the incremental phase. This leftover schema information accumulates, causing an OOM when the job restores its state from a checkpoint.

Solution

Upgrade to VVR 11.2 or later.

Schema change

FAQ 4: No new data after a lock-free schema change with pt-osc

Severity: High | Phase: Schema change | Affected versions: VVR 11.1 or earlier

Symptom

The job continues running without restarting after a lock-free table schema change.
The CurrentFetchTimeLag metric progresses as expected, indicating that data is being fetched.
The MySQL source stops producing new data and the CurrentEmitTimeLag metric stops updating.

Cause

VVR 11.1 and earlier versions cannot correctly handle DDL change events generated by lock-free schema change tools such as pt-osc. This causes the data pipeline to stall after the schema change.

Solution

Upgrade to VVR 11.2 or later.

Set the following option:

   scan.parse.online.schema.changes.enabled: true

FAQ 5: Transform column type mismatch after a lock-free schema change

Severity: High | Phase: Schema change | Affected versions: VVR 11.1 or earlier

Symptom

The job unexpectedly restarts following a lock-free table schema change (for example, using pt-osc).
The Transform operator logs indicate a column type mismatch error.

Cause

This is a known issue in VVR 11.1. If a significant volume of data is inserted into a table during a lock-free schema change operation, the engine may generate an unparsable event.

Solution

Upgrade to VVR 11.2 or later.
Perform a stateful restart from a savepoint that was created before the lock-free schema change.

State recovery

FAQ 6: Job fails to restore from a pre-schema-change savepoint

Severity: High | Phase: State recovery | Affected versions: VVR 11.1 or earlier

Symptom

A stateful restart from a savepoint created before a table schema change fails.
The error message indicates a table schema mismatch exception while consuming binary logs.

Cause

VVR 11.1 and earlier versions do not support stateful restarts from savepoints that contain an incompatible table schema.

Solution

Upgrade to VVR 11.2 or later.
After the upgrade, restart the job from a pre-schema-change savepoint.