Understand retry policies and dead-letter queues for handling message failures - ApsaraMQ for Kafka

Message delivery can fail due to downstream outages, misconfigurations, or transient network issues. Message Integration retries delivery automatically based on the configured retry policy. A fault tolerance policy controls what happens when retries are exhausted. For undeliverable messages, a dead-letter queue (DLQ) captures the raw data for later inspection and reprocessing.

Choose a retry and fault tolerance strategy

Use the following table to select the right combination of retry policy and fault tolerance policy for your workload.

Scenario	Retry policy	Fault tolerance	Dead-letter queue
Fast failure detection; stale messages lose value	Backoff retry (default)	Allowed	Optional
Transient failures that may take time to resolve	Exponential decay retry	Allowed	Recommended
Every message must be delivered, even at the cost of blocking	Backoff or Exponential decay	Prohibited	Optional

Retry policies

A retry policy controls how Message Integration retries failed message deliveries. Two policies are available:

Policy	Max retries	Interval	Total duration
Backoff retry (default)	3	10 to 20 seconds (random)	N/A
Exponential decay retry	176	1 to 512 seconds (exponential)	1 day

Backoff retry

Backoff retry is the default policy. A failed message is retried up to 3 times. The interval between consecutive retries is a random value from 10 to 20 seconds.

Use backoff retry when fast failure detection matters more than delivery persistence. For example, choose this policy when downstream issues are typically resolved quickly or when stale messages lose their value.

Exponential decay retry

Exponential decay retry increases the interval between retries exponentially, up to a maximum of 512 seconds. A failed message is retried up to 176 times over a total window of 1 day.

The retry intervals follow this progression:

1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s, 512s

After reaching 512 seconds, the remaining 167 retries continue at that interval.

Use exponential decay retry for transient failures that may take longer to resolve, such as service outages or network issues.

Fault tolerance policies

A fault tolerance policy determines how Message Integration handles a message after all retries are exhausted.

Policy	After retries exhausted	Task status
Fault tolerance allowed	Message goes to the dead-letter queue or is discarded. Other messages continue processing.	Unchanged
Fault tolerance prohibited	All message processing stops.	Changes to Ready

Fault tolerance allowed

When fault tolerance is allowed, a delivery failure does not block the processing of other messages. After the maximum number of retries is reached, the failed message is:

Sent to the dead-letter queue, if you configured one.
Discarded, if no dead-letter queue exists.

Fault tolerance prohibited

When fault tolerance is prohibited, a delivery failure blocks all message processing after retries are exhausted. The task status automatically changes to Ready, and processing stops until the issue is resolved.

Note

If retries cannot be performed due to invalid resource configurations (for example, a deleted or misconfigured target), the task status changes to Startup Failed regardless of the fault tolerance policy.

Dead-letter queues

A dead-letter queue (DLQ) captures messages that fail to be delivered after all retries are exhausted. Instead of discarding these messages, Message Integration sends the raw data to the DLQ for later inspection and reprocessing.

Dead-letter queues are scoped to individual tasks and are disabled by default.

Supported destinations

The following services can serve as dead-letter queue targets:

ApsaraMQ for RocketMQ queues
Simple Message Queue (formerly MNS) queues
ApsaraMQ for Kafka queues
EventBridge event buses

When to enable a dead-letter queue

Enable a dead-letter queue when:

Failed messages contain business-critical data that must not be lost.
You need to inspect and debug delivery failures after the fact.
Downstream consumers require eventual delivery of all messages, even after transient failures.