All Products
Search
Document Center

ApsaraMQ for Kafka:Retry policies and dead-letter queues

Last Updated:Mar 11, 2026

Message delivery can fail due to downstream outages, misconfigurations, or transient network issues. Message Integration retries delivery automatically based on the configured retry policy. A fault tolerance policy controls what happens when retries are exhausted. For undeliverable messages, a dead-letter queue (DLQ) captures the raw data for later inspection and reprocessing.

Choose a retry and fault tolerance strategy

Use the following table to select the right combination of retry policy and fault tolerance policy for your workload.

ScenarioRetry policyFault toleranceDead-letter queue
Fast failure detection; stale messages lose valueBackoff retry (default)AllowedOptional
Transient failures that may take time to resolveExponential decay retryAllowedRecommended
Every message must be delivered, even at the cost of blockingBackoff or Exponential decayProhibitedOptional

Retry policies

A retry policy controls how Message Integration retries failed message deliveries. Two policies are available:

PolicyMax retriesIntervalTotal duration
Backoff retry (default)310 to 20 seconds (random)N/A
Exponential decay retry1761 to 512 seconds (exponential)1 day

Backoff retry

Backoff retry is the default policy. A failed message is retried up to 3 times. The interval between consecutive retries is a random value from 10 to 20 seconds.

Use backoff retry when fast failure detection matters more than delivery persistence. For example, choose this policy when downstream issues are typically resolved quickly or when stale messages lose their value.

Exponential decay retry

Exponential decay retry increases the interval between retries exponentially, up to a maximum of 512 seconds. A failed message is retried up to 176 times over a total window of 1 day.

The retry intervals follow this progression:

1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s, 512s

After reaching 512 seconds, the remaining 167 retries continue at that interval.

Use exponential decay retry for transient failures that may take longer to resolve, such as service outages or network issues.

Fault tolerance policies

A fault tolerance policy determines how Message Integration handles a message after all retries are exhausted.

PolicyAfter retries exhaustedTask status
Fault tolerance allowedMessage goes to the dead-letter queue or is discarded. Other messages continue processing.Unchanged
Fault tolerance prohibitedAll message processing stops.Changes to Ready

Fault tolerance allowed

When fault tolerance is allowed, a delivery failure does not block the processing of other messages. After the maximum number of retries is reached, the failed message is:

  • Sent to the dead-letter queue, if you configured one.

  • Discarded, if no dead-letter queue exists.

Fault tolerance prohibited

When fault tolerance is prohibited, a delivery failure blocks all message processing after retries are exhausted. The task status automatically changes to Ready, and processing stops until the issue is resolved.

Note

If retries cannot be performed due to invalid resource configurations (for example, a deleted or misconfigured target), the task status changes to Startup Failed regardless of the fault tolerance policy.

Dead-letter queues

A dead-letter queue (DLQ) captures messages that fail to be delivered after all retries are exhausted. Instead of discarding these messages, Message Integration sends the raw data to the DLQ for later inspection and reprocessing.

Dead-letter queues are scoped to individual tasks and are disabled by default.

Supported destinations

The following services can serve as dead-letter queue targets:

  • ApsaraMQ for RocketMQ queues

  • Simple Message Queue (formerly MNS) queues

  • ApsaraMQ for Kafka queues

  • EventBridge event buses

When to enable a dead-letter queue

Enable a dead-letter queue when:

  • Failed messages contain business-critical data that must not be lost.

  • You need to inspect and debug delivery failures after the fact.

  • Downstream consumers require eventual delivery of all messages, even after transient failures.