Message delivery can fail due to downstream outages, misconfigurations, or transient network issues. Message Integration retries delivery automatically based on the configured retry policy. A fault tolerance policy controls what happens when retries are exhausted. For undeliverable messages, a dead-letter queue (DLQ) captures the raw data for later inspection and reprocessing.
Choose a retry and fault tolerance strategy
Use the following table to select the right combination of retry policy and fault tolerance policy for your workload.
| Scenario | Retry policy | Fault tolerance | Dead-letter queue |
|---|---|---|---|
| Fast failure detection; stale messages lose value | Backoff retry (default) | Allowed | Optional |
| Transient failures that may take time to resolve | Exponential decay retry | Allowed | Recommended |
| Every message must be delivered, even at the cost of blocking | Backoff or Exponential decay | Prohibited | Optional |
Retry policies
A retry policy controls how Message Integration retries failed message deliveries. Two policies are available:
| Policy | Max retries | Interval | Total duration |
|---|---|---|---|
| Backoff retry (default) | 3 | 10 to 20 seconds (random) | N/A |
| Exponential decay retry | 176 | 1 to 512 seconds (exponential) | 1 day |
Backoff retry
Backoff retry is the default policy. A failed message is retried up to 3 times. The interval between consecutive retries is a random value from 10 to 20 seconds.
Use backoff retry when fast failure detection matters more than delivery persistence. For example, choose this policy when downstream issues are typically resolved quickly or when stale messages lose their value.
Exponential decay retry
Exponential decay retry increases the interval between retries exponentially, up to a maximum of 512 seconds. A failed message is retried up to 176 times over a total window of 1 day.
The retry intervals follow this progression:
1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s, 512s
After reaching 512 seconds, the remaining 167 retries continue at that interval.
Use exponential decay retry for transient failures that may take longer to resolve, such as service outages or network issues.
Fault tolerance policies
A fault tolerance policy determines how Message Integration handles a message after all retries are exhausted.
| Policy | After retries exhausted | Task status |
|---|---|---|
| Fault tolerance allowed | Message goes to the dead-letter queue or is discarded. Other messages continue processing. | Unchanged |
| Fault tolerance prohibited | All message processing stops. | Changes to Ready |
Fault tolerance allowed
When fault tolerance is allowed, a delivery failure does not block the processing of other messages. After the maximum number of retries is reached, the failed message is:
Sent to the dead-letter queue, if you configured one.
Discarded, if no dead-letter queue exists.
Fault tolerance prohibited
When fault tolerance is prohibited, a delivery failure blocks all message processing after retries are exhausted. The task status automatically changes to Ready, and processing stops until the issue is resolved.
If retries cannot be performed due to invalid resource configurations (for example, a deleted or misconfigured target), the task status changes to Startup Failed regardless of the fault tolerance policy.
Dead-letter queues
A dead-letter queue (DLQ) captures messages that fail to be delivered after all retries are exhausted. Instead of discarding these messages, Message Integration sends the raw data to the DLQ for later inspection and reprocessing.
Dead-letter queues are scoped to individual tasks and are disabled by default.
Supported destinations
The following services can serve as dead-letter queue targets:
ApsaraMQ for RocketMQ queues
Simple Message Queue (formerly MNS) queues
ApsaraMQ for Kafka queues
EventBridge event buses
When to enable a dead-letter queue
Enable a dead-letter queue when:
Failed messages contain business-critical data that must not be lost.
You need to inspect and debug delivery failures after the fact.
Downstream consumers require eventual delivery of all messages, even after transient failures.