Prevent Backpressure and Control State Growth with DataStream API - Realtime Compute for Apache Flink

Unbounded state growth is one of the most common causes of backpressure in Apache Flink. When state exceeds the memory allocated to the state backend, Flink spills infrequently accessed data to disk. Every subsequent disk read adds latency to operator processing, which slows downstream operators and propagates backpressure upstream. This topic explains how to confirm that state size is the bottleneck and how to bring it under control using the DataStream API.

Background

Apache Flink supports two types of state: operator state and keyed state. Keyed state is more likely to grow large because it is partitioned by key and accumulates a separate entry for each distinct key in the data stream. The DataStream API provides ValueState, ListState, and MapState interfaces for managing keyed state, and supports time-to-live (TTL) configuration to expire stale entries automatically. For the full state API reference, see Using Keyed State.

Confirm that state is the bottleneck

Before tuning, verify that backpressure is caused by state access rather than by other factors such as slow sinks or data skew.

Symptoms to look for:

LastCheckPointFullSize is increasing steadily or spiking — this indicates that the total state in the deployment is growing.
State access latency metrics reach the millisecond range — this indicates that the state backend is reading from disk rather than from memory.
Thread dumps or flame graphs show operator threads spending most of their time on state access rather than computation.

Once you identify the slow operator, use the diagnostic tools on the Realtime Compute for Apache Flink development console to confirm that state access is the cause.

Diagnostic tools

The following tools are available on the development console. Use them together with intelligent deployment diagnostics and automatic tuning to accelerate the investigation.

Tool	What it shows
Thread dumps	Whether the operator thread is primarily accessing state at the current moment
Thread activity	Whether the operator thread was primarily accessing state during a specific time window
Flame graphs	Whether most CPU time was spent on state access during a specific period
Runtime metrics monitoring	State size per subtask, checkpoint size, and state access latency

To view runtime metrics:

On the Deployments page, click the deployment you want to inspect.
Click the Status tab.
In the Actions section, click Metrics.

The following metrics are most relevant for state size diagnosis:

State Size: the state size of each subtask. This metric is available only when using GeminiStateBackend.
LastCheckPointFullSize: the total size of the most recent checkpoint. Use this to estimate overall deployment state size.
State access latency metrics: if any of these reach the millisecond range, state backend disk I/O is likely throttling performance. For the full list of metrics, see State.

Reduce state size

Redesign application state

Storing only what is necessary is the most effective long-term fix. To prevent unbounded state growth, store only the required information in the state.

Configure TTL to expire state automatically

TTL causes the state backend to expire and delete state entries that have not been accessed within the configured duration. This is the most direct way to prevent unbounded growth for keyed state.

Apache Flink provides various methods to help you manage the lifecycle of state data. For example, you can invoke the setTTL() method of the ValueStateDescriptor interface to automatically expire and clear state data. You can also call clear() or remove() explicitly to delete state for records that you know are no longer needed.

Use timers for fine-grained cleanup

For cleanup logic that TTL cannot express — such as deleting state only after a business event occurs — use timers. Timers let you schedule a callback at a future processing time or event time, at which point you can inspect the state and delete it if appropriate.

This approach gives you precise control over the state lifecycle and ensures expired data is removed promptly without relying on background compaction.

Minimize disk reads

If reducing state volume is not sufficient, reduce how often the state backend reads from disk.

Choose the right approach for your situation:

Scenario	Approach	How to apply	Precautions
Heap memory and other memory resources are adequate	Increase managed memory proportion	Set `taskmanager.memory.managed.fraction` (default: `0.4`) to a higher value. See Memory Configuration and How do I configure parameters for deployment running?	If heap memory is insufficient, this may trigger frequent Full GC, which degrades performance.
All scenarios	Add memory to the deployment	See Configure resources for a deployment. Use expert mode for fine-grained allocation.	N/A
All scenarios	Increase parallelism	Raise the deployment parallelism. Each subtask holds a smaller partition of state, reducing how much any single subtask spills to disk.	—

Monitor state health continuously

Set up ongoing monitoring to catch state growth early:

Track LastCheckPointFullSize over time. A steady upward trend indicates unbounded growth.
Watch state access latency metrics. A rise toward the millisecond range signals that the state backend is hitting disk frequently.
Review checkpoint details periodically to detect subtask-level state skew before it affects performance.
Generate detailed logs during peak load to correlate state access patterns with throughput degradation.

What's next

For an end-to-end tuning workflow covering large-state deployments, see Performance tuning for large-state deployments.
For state management in SQL deployments, see Control state size to reduce backpressure in SQL deployments.
For checkpoint and savepoint timeout diagnosis, see Diagnose and prevent checkpoint and savepoint timeout.
For startup and scaling performance, see Improve startup and scaling speed.