All Products
Search
Document Center

ApsaraMQ for RocketMQ:Fault drill

Last Updated:Mar 10, 2026

Test your application's resilience to zone-level failures by stopping and resuming ApsaraMQ for RocketMQ service in a specific availability zone (AZ). A zone failure can make service instances in that zone unavailable, causing partial or complete service disruption. Use fault drills to proactively identify vulnerabilities in your messaging architecture before an actual failure occurs.

ApsaraMQ for RocketMQ supports multi-zone deployment and provides a built-in fault drill feature that simulates this scenario. During a drill, the service in one zone of your instance is stopped and then resumed, mimicking a real zone-level outage.

Supported editions

Fault drill is available only for Platinum Edition instances of the ApsaraMQ for RocketMQ 5.x series.

Plan your drill

Before starting a fault drill, take the following steps to maximize its value:

  1. Define steady-state metrics. Identify the metrics that matter for your application, such as message throughput, consumer lag, end-to-end latency, and error rate. Record baseline values so you can compare them during and after the drill.

  2. Review your architecture. Confirm that your producers and consumers are deployed across multiple zones and that your client SDK is configured for automatic reconnection.

  3. Verify cluster capacity. Make sure the remaining zones have enough capacity to handle the full workload. If the remaining capacity is insufficient after a zone is stopped, service interruptions may occur.

Expected behaviors during a drill

The following behaviors are expected during and immediately after a fault drill:

BehaviorWhen it occurs
Client connections are briefly interrupted and then automatically reconnectedDuring the zone stop
Delivery of stacked messages is delayedDuring the zone stop
Messages in sequential topics may be briefly delivered out of orderDuring the zone stop
Duplicate messages may appearAfter the service is resumed
Note

While a drill task is running, you cannot upgrade, downgrade, or modify the instance.

How a fault drill works

A fault drill follows five sequential stages:

Create task --> Stop service --> Verify application --> Resume service --> End drill

Run a fault drill

Step 1: Create a drill task

  1. Log on to the ApsaraMQ for RocketMQ console. In the top navigation bar, select a region, such as China (Hangzhou).

  2. In the left-side navigation pane, choose RocketMQ Copilot > Fault Drill.

  3. On the Fault Drill page, click Create Task.

  4. In the Create Task panel, configure the following parameters, and then click OK.

    ParameterDescription
    Task NameA descriptive name for the drill task, such as az-b-drill-2026-03
    InstanceThe ApsaraMQ for RocketMQ instance to test

Step 2: Stop the service in a zone

  1. On the Fault Drill page, click the name of the drill task.

  2. On the drill details page, select a zone and click Stop Service. The service in the selected zone begins shutting down. Client connections to that zone are interrupted.

Step 3: Verify your application

While the zone is down, verify that your application continues to function:

  • Message production: Confirm that producers can still send messages through the remaining zones.

  • Message consumption: Confirm that consumers reconnect and continue processing messages.

  • Latency and errors: Compare current metrics against the baseline values you recorded during planning. Check for abnormal spikes.

  • Alerts: Review any alerts triggered by the zone stop.

Identify and fix any issues before proceeding.

Step 4: Resume the service

  1. On the Fault Drill page, click the name of the drill task.

  2. On the drill details page, click Resume Service. The service in the stopped zone is restored. Duplicate messages may appear briefly as the zone rejoins the cluster.

Step 5: End the drill

  1. On the Fault Drill page, click the name of the drill task.

  2. On the drill details page, click End Drill.

Best practices

  • Start in a pre-production environment. Run your first drill against a staging instance before testing production workloads.

  • Schedule drills during low-traffic windows. Minimize the impact on end users by choosing off-peak hours.

  • Enable idempotent consumers. Because duplicate messages may appear after the service is resumed, design your consumers to handle duplicates gracefully.

  • Set up monitoring alerts. Configure alerts on message throughput, consumer lag, and error rate so you can detect anomalies during the drill in real time.

  • Document results. Record the drill date, zone tested, observed behaviors, and any issues found. Use these records to track improvements over time.