Event streams in EventBridge use retry policies, fault tolerance policies, and dead-letter queues (DLQs) to handle event delivery failures. When delivery to a target fails, EventBridge retries based on the configured retry policy. If all retries are exhausted, the fault tolerance policy determines whether to skip the failed event or block the stream. Route undeliverable events to a DLQ to preserve them for later inspection.
The following diagram shows how these three policies interact:
Event delivery fails
|
v
Retry policy
(backoff or exponential decay)
|
All retries exhausted?
/ \
No Yes
| |
Retry Fault tolerance policy
again / \
Allowed Prohibited
| |
DLQ configured? Stream blocked,
/ \ task status -> Ready
Yes No
| |
Send to Discard
DLQ eventRetry policies
A retry policy controls how EventBridge reattempts delivery after a failure. Each event stream supports two retry policies:
| Policy | Max retries | Retry interval | Max duration | Default |
|---|---|---|---|---|
| Backoff retry | 3 | Random, 10--20 seconds between attempts | -- | Yes |
| Exponential decay retry | 176 | Starts at 1 s, doubles up to 512 s | 1 day | No |
Backoff retry
Backoff retry is the default policy. EventBridge retries a failed event up to 3 times, with a random interval of 10 to 20 seconds between consecutive attempts. Use this policy when you expect transient failures that resolve quickly.
Exponential decay retry
Exponential decay retry provides a longer retry window for targets that may take longer to recover. EventBridge retries a failed event up to 176 times over a maximum period of 1 day. The interval doubles with each attempt, up to a ceiling of 512 seconds:
1 s, 2 s, 4 s, 8 s, 16 s, 32 s, 64 s, 128 s, 256 s, 512 sAfter the interval reaches 512 seconds, the remaining 167 retries continue at 512-second intervals.
Non-retryable errors
If retries cannot be performed due to errors such as invalid resource configurations, the task status changes to Start Failed regardless of the retry or fault tolerance policy. EventBridge does not retry these errors because the underlying issue requires manual intervention.
Fault tolerance policies
A fault tolerance policy controls how EventBridge handles an event that still fails after all retries are exhausted. Each event stream supports two fault tolerance policies:
| Policy | Behavior after retries exhausted | Effect on subsequent events |
|---|---|---|
| Fault tolerance allowed | Event is sent to the DLQ (if configured) or discarded | Processing continues |
| Fault tolerance prohibited | Task status changes to Ready | Processing blocked until you resolve the issue |
Fault tolerance allowed
When fault tolerance is allowed, delivery failures do not block event processing. After all retries are exhausted, EventBridge delivers the event to the DLQ or discards it, then continues processing.
Choose this policy when event loss is acceptable or when you have a DLQ configured to capture failed events.
Fault tolerance prohibited
When fault tolerance is prohibited, delivery failures block event processing after all retries are exhausted. The task status changes to Ready, and no further events are processed until you resolve the issue.
Choose this policy when every event must be delivered and you prefer to halt processing rather than lose events.
Dead-letter queues
A dead-letter queue (DLQ) captures events that fail delivery after all retries are exhausted. When you enable a DLQ on a task, EventBridge sends the raw event data to the DLQ instead of discarding it. The DLQ feature is disabled by default.
Supported DLQ targets
The following services are supported as DLQ targets:
| Service | Description |
|---|---|
| ApsaraMQ for RocketMQ | Message queue service |
| Simple Message Queue (formerly MNS) | Lightweight message queue service |
| ApsaraMQ for Kafka | Kafka-compatible message queue service |
| EventBridge event bus | Route failed events to another event bus for further processing |
When to enable a DLQ
Enable a DLQ when you need to:
Inspect and debug events that failed delivery
Reprocess failed events after fixing the root cause
Maintain a record of all delivery failures for auditing
If you use the Fault tolerance allowed policy without a DLQ, failed events are permanently discarded after retries are exhausted. To prevent data loss, configure a DLQ before enabling fault tolerance.