Test your application's resilience to zone-level failures by stopping and resuming ApsaraMQ for RocketMQ service in a specific availability zone (AZ). A zone failure can make service instances in that zone unavailable, causing partial or complete service disruption. Use fault drills to proactively identify vulnerabilities in your messaging architecture before an actual failure occurs.
ApsaraMQ for RocketMQ supports multi-zone deployment and provides a built-in fault drill feature that simulates this scenario. During a drill, the service in one zone of your instance is stopped and then resumed, mimicking a real zone-level outage.
Supported editions
Fault drill is available only for Platinum Edition instances of the ApsaraMQ for RocketMQ 5.x series.
Plan your drill
Before starting a fault drill, take the following steps to maximize its value:
Define steady-state metrics. Identify the metrics that matter for your application, such as message throughput, consumer lag, end-to-end latency, and error rate. Record baseline values so you can compare them during and after the drill.
Review your architecture. Confirm that your producers and consumers are deployed across multiple zones and that your client SDK is configured for automatic reconnection.
Verify cluster capacity. Make sure the remaining zones have enough capacity to handle the full workload. If the remaining capacity is insufficient after a zone is stopped, service interruptions may occur.
Expected behaviors during a drill
The following behaviors are expected during and immediately after a fault drill:
| Behavior | When it occurs |
|---|---|
| Client connections are briefly interrupted and then automatically reconnected | During the zone stop |
| Delivery of stacked messages is delayed | During the zone stop |
| Messages in sequential topics may be briefly delivered out of order | During the zone stop |
| Duplicate messages may appear | After the service is resumed |
While a drill task is running, you cannot upgrade, downgrade, or modify the instance.
How a fault drill works
A fault drill follows five sequential stages:
Create task --> Stop service --> Verify application --> Resume service --> End drillRun a fault drill
Step 1: Create a drill task
Log on to the ApsaraMQ for RocketMQ console. In the top navigation bar, select a region, such as China (Hangzhou).
In the left-side navigation pane, choose .
On the Fault Drill page, click Create Task.
In the Create Task panel, configure the following parameters, and then click OK.
Parameter Description Task Name A descriptive name for the drill task, such as az-b-drill-2026-03Instance The ApsaraMQ for RocketMQ instance to test
Step 2: Stop the service in a zone
On the Fault Drill page, click the name of the drill task.
On the drill details page, select a zone and click Stop Service. The service in the selected zone begins shutting down. Client connections to that zone are interrupted.
Step 3: Verify your application
While the zone is down, verify that your application continues to function:
Message production: Confirm that producers can still send messages through the remaining zones.
Message consumption: Confirm that consumers reconnect and continue processing messages.
Latency and errors: Compare current metrics against the baseline values you recorded during planning. Check for abnormal spikes.
Alerts: Review any alerts triggered by the zone stop.
Identify and fix any issues before proceeding.
Step 4: Resume the service
On the Fault Drill page, click the name of the drill task.
On the drill details page, click Resume Service. The service in the stopped zone is restored. Duplicate messages may appear briefly as the zone rejoins the cluster.
Step 5: End the drill
On the Fault Drill page, click the name of the drill task.
On the drill details page, click End Drill.
Best practices
Start in a pre-production environment. Run your first drill against a staging instance before testing production workloads.
Schedule drills during low-traffic windows. Minimize the impact on end users by choosing off-peak hours.
Enable idempotent consumers. Because duplicate messages may appear after the service is resumed, design your consumers to handle duplicates gracefully.
Set up monitoring alerts. Configure alerts on message throughput, consumer lag, and error rate so you can detect anomalies during the drill in real time.
Document results. Record the drill date, zone tested, observed behaviors, and any issues found. Use these records to track improvements over time.