Dirty data collection - Realtime Compute for Apache Flink

This topic describes how to use the dirty data collector in Flink CDC data ingestion jobs.

Overview

In real-time data synchronization, data from a source can fail to parse due to issues like incorrect formats, encoding errors, or schema incompatibility. This type of unprocessable data is known as dirty data.

Starting with VVR version 11.5, data ingestion supports dirty data collection for the Kafka data source. Connector configuration options allow you to configure your job to ignore errors, log details, and continue running.

When the connector encounters unparsable data, the system automatically captures the raw data and exception information and writes them to a specified collector. This enables you to:

Tolerate a small amount of dirty data to prevent the entire pipeline from being interrupted.
Log the complete context for later troubleshooting and analysis.
Set a threshold to prevent an excessive number of errors.

Typical use cases

Use case	Objective
Log collection pipelines (For unstructured data sources like app logs)	Data quality is inconsistent. Skip a few bad records to ensure the main process continues to run.
Core business table synchronization (For critical systems like orders or account changes)	Requires high consistency. Errors trigger immediate alerts for manual intervention.
Data exploration and analysis phase	Process the full dataset quickly to understand its overall structure before addressing dirty data issues.

Limitations and considerations

Before using this feature, understand its limitations and potential risks:

Supported connectors: This feature is currently available only for the Kafka data source. Support for other sources will be added in future releases.
Supported collector types: Currently, only the logger type is supported, which writes dirty data to a log file.

Note

This feature is suitable for debugging and early production stages. If large amounts of dirty data persist, we recommend implementing data governance measures on the upstream system.

Syntax

Enable the dirty data collector

The dirty data collector is defined in the pipeline module. The syntax is as follows:

pipeline:
  dirty-data.collector:
    name: Logger Dirty Data Collector
    type: logger

Parameter

Description

name

The collector name. We recommend using a meaningful name, such as Kafka-DQ-Collector.

type

The collector type. Valid values:

logger: Writes dirty data to a log file.

Note

If you do not define this option, the system does not record dirty data, even if error tolerance is enabled.

Configure error tolerance in the data source

Configuring the dirty data collector alone does not skip parse errors. You must use this feature in conjunction with the Kafka error tolerance policy. For details, see the Kafka connector documentation. The following example shows a typical configuration:

source:
  type: kafka
  # Skip the first 100 parsing errors; if the count exceeds 100, the job fails.
  ingestion.ignore-errors: true
  ingestion.error-tolerance.max-count: 100

Parameter

Default

Description

ingestion.ignore-errors

false

Specifies whether to ignore parsing errors.

If set to true, the job skips the failed record. If set to false, the job fails immediately.

ingestion.error-tolerance.max-count

-1 (unlimited)

The maximum number of dirty data records to tolerate.

When ingestion.ignore-errors is set to true, if the number of collected dirty data records exceeds this value, the job triggers a failover and stops.

Logger dirty data collector

The logger dirty data collector stores dirty data in a separate log file. To view the dirty data logs, follow these steps:

Go to the O&M page and click the Job Logs tab.
Click Running Logs, select the Running Task Managers sub-tab, and then select the TaskManager node for the relevant operator.
Click Log List, and then click the log file named yaml-dirty-data.out in the list to view or save the dirty data records.

Each dirty data record includes the following metadata:

The timestamp when the dirty data was processed
The operator and subtask index that emitted the dirty data record
The raw data content
The exception information that caused the processing failure

Dirty data record format

Each record contains the following metadata:

text[2025-04-05 10:23:45] [Operator: SourceKafka -> Subtask: 2]
Raw Data: {"id": "abc", "ts": "invalid-timestamp"}
Exception: java.time.format.DateTimeParseException: Text 'invalid-timestamp' could not be parsed at index 0
---

Field	Description
Timestamp	The time when the dirty data was captured.
Operator & subtask	The specific operator and its parallel subtask index where the error occurred.
Raw data	The raw, unparsed data (in Base64 or string format).
Exception	The exception type and a summary of the stack trace for the parsing failure.

FAQ

Does dirty data affect checkpoints?

No. Dirty data is intercepted before it can be included in a state update, so it does not cause checkpoints to fail.

Dirty data collection vs. side output

Dirty data collector: Handles data that cannot be deserialized or parsed.
Side output: Handles data that can be parsed but does not meet business logic requirements.