Multi-region instance disaster recovery - ApsaraMQ for MQTT

When a regional outage or network disruption affects your primary ApsaraMQ for MQTT instance, cross-region disaster recovery redirects device traffic to a secondary instance in another region. Devices reconnect without configuration changes because ApsaraMQ for MQTT handles permission compatibility across instances during the switchover.

Important

The disaster recovery feature addresses only the issue of permission compatibility during traffic switchover between instances. It does not synchronize metadata, permissions, or message data. You must manually create identical metadata and permissions on all involved instances.

Limitations

Edition: Enterprise Platinum Edition instances only.
Activation: Submit a ticket to enable disaster recovery.
Instance count: No limit on the number of instances or regions. Deploy three or more instances for higher availability.
Data isolation: Messages on each instance are isolated. The disaster recovery feature does not synchronize message data between instances.

How it works

The following example uses two ApsaraMQ for MQTT instances in the China (Hangzhou) and China (Shanghai) regions.

An edge device connects to a cloud-based ApsaraMQ for MQTT instance through either its region endpoint or the global disaster recovery endpoint.
The cloud-based instance is compatible with the device parameters (instance ID, username, and password) and maps the connection to the instance in the device's region. Both instances must belong to the same Alibaba Cloud account.
The backend server is deployed across multiple regions and connected through Cloud Enterprise Network (CEN) over virtual private cloud (VPC). It subscribes to device status notifications from ApsaraMQ for MQTT instances in every region and maintains a global route table. When sending a message to a device, the backend server queries the route table to identify which instance the device is connected to, then pushes the message through that instance.
Each ApsaraMQ for MQTT instance provides an internal IP address for cross-region VPC connectivity.

Note

During a switchover, only the public endpoint (virtual IP address, or VIP) changes. Internal connections are not affected.

Key concepts

Permission compatibility

Disaster recovery does not synchronize data between instances, including permissions. To maintain consistent access control:

Create identical metadata on both the primary and secondary instances.
Configure the same topic and group permissions on every instance involved in disaster recovery.

Instance ID mapping

Each device stores its instance ID, username, and password locally. When a switchover redirects traffic to a different region, the target instance cannot recognize these parameters by default. ApsaraMQ for MQTT resolves this through instance ID mapping: after the instance ID switches from Instance A to Instance B, the device automatically uses Instance B's context for connection and messaging.

Disaster recovery domain names

Two types of domain names are available:

Domain name type	Behavior	When to use
Instance domain name	Routes to a specific instance in the nearest region. Each instance has its own domain name. During a switchover, the primary instance's domain name is pointed to the secondary instance's VIP.	Nearest-region access with manual failover control.
Global disaster recovery domain name	Routes to any healthy instance. When the primary instance fails, its VIP is automatically removed from this domain name.	Automatic failover without nearest-region routing.

Switchover methods

ApsaraMQ for MQTT supports three switchover methods. Choose the one that fits your operational model.

Method	Initiated by	Automation level
API switchover	You (manual)	Manual trigger via API call
Automatic inspection	ApsaraMQ for MQTT	Fully automatic
Device connection behavior	Device DNS resolution	Passive (depends on DNS cache TTL)

API switchover

Call the DisasterDowngrade or DisasterRecovery API operation to manually trigger a switchover. During the switchover:

The domain name and VIP pointing of the primary instance are changed.
The primary instance's VIP is removed from the global disaster recovery domain name.

Automatic inspection

A global inspection mechanism continuously monitors instance health across regions. If an instance fails health checks, traffic is automatically switched to a healthy instance without manual intervention.

Device connection behavior

Devices access instances through DNS-resolved VIPs and are not affected by a switchover in progress. After the VIP pointing changes:

Existing connections continue until the device reconnects.
New connections resolve to the updated VIP.
DNS cache expiration determines when devices pick up the new VIP. To force an immediate switchover, disconnect devices from the primary instance so they re-resolve DNS and connect to the secondary instance.