Guide to exception handling - Realtime Compute for Apache Flink

This guide provides clear, actionable exception handling guidelines for developers. These guidelines ensure your code works effectively with Flink's fault tolerance, improving job stability and observability.

Background

Apache Flink offers robust fault tolerance through checkpointing, automatic restarts, and exactly-once semantics.

However, improper exception handling in user code, such as TaskCancelledException, OutOfMemoryError, or ClassCastException in DataStream jobs and UDFs in SQL jobs, can compromise these mechanisms. Poor exception handling can disrupt Flink's recovery, leading to state inconsistency or data loss.

Core principle

Delegate system-level fault recovery to Flink and focus on handling business-specific exceptions.

Exception types and recommended handling strategies

Exception type	Examples	Recommended handling strategy
Business exceptions (recoverable)	Failures in JSON parsing, missing data fields, or business rule violations.	Catch, log, and output erroneous records to Side Outputs. This preserves the main data flow's continuity.
External dependency exceptions (partially recoverable)	Issues with HTTP timeouts, database connectivity, or third-party API errors (e.g., 5xx).	Employ a strategy of limited retries with a backoff policy, and escalate to an exception if retries fail.
System exceptions (non-recoverable)	`TaskCancelledException`, `OutOfMemoryError`, `ClassCastException`, and state access issues.	Do not catch these. Allow Flink's built-in fault recovery system to manage and address these critical failures.

Best practices

1. Avoid catching generic exceptions

Don't:

Catching broad Exception types can mask critical Flink internal errors (like checkpointing issues or task cancellations). This can result in jobs that appear operational but fail to process data.

try {
    // user logic
} catch (Exception e) {
    LOG.warn("Something went wrong", e);
}

Do:

Catch only specific, recoverable business exceptions and use Side Outputs to emit error records.

try {
    processRecord(value);
} catch (JsonParseException e) {
    LOG.warn("Invalid input record: {}", value, e);
    ctx.output(ERROR_TAG, new ErrorRecord(value, e.getMessage()));
}

2. Delegate system-level errors to Flink

If you encounter a non-recoverable state error (e.g., uninitialized state, deserialization failure), throw an exception. This triggers Flink's failover mechanism.

if (state.value() == null) {
    // State is not properly initialized
    throw new IllegalStateException("State not properly initialized");
}

Flink will then automatically restore the job using the configured restart strategy and the last successful checkpoint.

3. Limit retries for external calls

When interacting with external systems (e.g., databases, HTTP services), avoid implementing infinite retry mechanisms.

int maxRetries = 3;
for (int i = 0; i < maxRetries; i++) {
    try {
        callExternalService(record);
        return;
    } catch (IOException e) {
        if (i == maxRetries - 1) {
            throw new RuntimeException("Failed after " + maxRetries + " attempts", e);
        }
        Thread.sleep(1000 * (i + 1)); // Use an exponential backoff.
    }
}

4. Preserve full exception context

Logging only e.getMessage() is often insufficient for troubleshooting. Instead:

Log the full stack trace.
Preserve original input data.
Use Side Outputs to emit structured error events for downstream monitoring or replay.

ctx.output(ERROR_TAG, new ErrorRecord(
    originalInput,
    Instant.now(),
    e.getClass().getSimpleName(),
    e.getMessage(),
    ExceptionUtils.stringifyException(e)
));

5. Validate exception handling paths

Include exception scenarios in your unit and integration tests.
During code reviews, specifically scrutinize logic that uses broad catch (Exception) or silently ignores exceptions.
For every exception handling block, ask: "Is this error truly recoverable?" and "If recoverable, does the recovery logic impact state consistency?"

Effective exception handling is not just about preventing job crashes. Proper exception handling ensures errors are trackable, recoverable, and maintain state consistency. Always trust Flink's fault-tolerant capabilities and be cautious about interfering with its recovery process. Integrate these practices into your CI/CD pipelines by using static code analysis tools like SonarQube rules to automatically flag improper exception handling.