Fault drill - ApsaraMQ for RocketMQ - Alibaba Cloud Documentation Center

Test your application's resilience to zone-level failures by stopping and resuming ApsaraMQ for RocketMQ service in a specific availability zone (AZ). A zone failure can make service instances in that zone unavailable, causing partial or complete service disruption. Use fault drills to proactively identify vulnerabilities in your messaging architecture before an actual failure occurs.

ApsaraMQ for RocketMQ supports multi-zone deployment and provides a built-in fault drill feature that simulates this scenario. During a drill, the service in one zone of your instance is stopped and then resumed, mimicking a real zone-level outage.

Supported editions

Fault drill is available for Dedicated Edition instances in the Serverless series and Platinum Edition instances in the non-Serverless series (subscription and pay-as-you-go) of ApsaraMQ for RocketMQ 5.x.

Plan your drill

Before starting a fault drill, take the following steps to maximize its value:

Define steady-state metrics. Identify the metrics that matter for your application, such as message throughput, consumer lag, end-to-end latency, and error rate. Record baseline values so you can compare them during and after the drill.
Review your architecture. Confirm that your producers and consumers are deployed across multiple zones and that your client SDK is configured for automatic reconnection.
Verify cluster capacity. Make sure the remaining zones have enough capacity to handle the full workload. If the remaining capacity is insufficient after a zone is stopped, service interruptions may occur.

Expected behaviors during a drill

The following behaviors are expected during and immediately after a fault drill:

Behavior	When it occurs
Client connections are briefly interrupted and then automatically reconnected	During the zone stop
Delivery of stacked messages is delayed	During the zone stop
Messages in sequential topics may be briefly delivered out of order	During the zone stop
Duplicate messages may appear	After the service is resumed

Note

While a drill task is running, you cannot upgrade, downgrade, or modify the instance.

How a fault drill works

A fault drill follows five sequential stages:

Create task --> Stop service --> Verify application --> Resume service --> End drill

Run a fault drill

Step 1: Create a drill task

Log on to the ApsaraMQ for RocketMQ console. In the top navigation bar, select a region, such as China (Hangzhou).
In the left-side navigation pane, choose RocketMQ Copilot > Fault Drill.
On the Fault Drill page, click Create Task.

In the Create Task panel, configure the following parameters, and then click OK.

Parameter	Description
Task Name	A descriptive name for the drill task, such as `az-b-drill-2026-03`
Instance	The ApsaraMQ for RocketMQ instance to test

Step 2: Stop the service in a zone

On the Fault Drill page, click the name of the drill task.
On the drill details page, select a zone and click Stop Service. The service in the selected zone begins shutting down. Client connections to that zone are interrupted.

Step 3: Verify your application

While the zone is down, verify that your application continues to function:

Message production: Confirm that producers can still send messages through the remaining zones.
Message consumption: Confirm that consumers reconnect and continue processing messages.
Latency and errors: Compare current metrics against the baseline values you recorded during planning. Check for abnormal spikes.
Alerts: Review any alerts triggered by the zone stop.

Identify and fix any issues before proceeding.

Step 4: Resume the service

On the Fault Drill page, click the name of the drill task.
On the drill details page, click Resume Service. The service in the stopped zone is restored. Duplicate messages may appear briefly as the zone rejoins the cluster.

Step 5: End the drill

On the Fault Drill page, click the name of the drill task.
On the drill details page, click End Drill.

Best practices

Start in a pre-production environment. Run your first drill against a staging instance before testing production workloads.
Schedule drills during low-traffic windows. Minimize the impact on end users by choosing off-peak hours.
Enable idempotent consumers. Because duplicate messages may appear after the service is resumed, design your consumers to handle duplicates gracefully.
Set up monitoring alerts. Configure alerts on message throughput, consumer lag, and error rate so you can detect anomalies during the drill in real time.
Document results. Record the drill date, zone tested, observed behaviors, and any issues found. Use these records to track improvements over time.