All Products
Search
Document Center

Elastic Compute Service:ECS data backup and high availability architecture

Last Updated:Apr 01, 2026

To address logical errors such as accidental deletions and physical failures like zone or regional outages, ECS provides a layered protection strategy covering two core aspects: data durability and business continuity. For data backup and recovery, ECS offers granular recovery options ranging from file-level to block storage-level. For business continuity, ECS provides high availability architectures that withstand zone-level and region-level failures, enabling cloud architectures that meet varying business continuity objectives.

Enhance cloud business resilience

When building cloud-based applications, data durability and business continuity form two pillars that ensure stable operations. Data durability protects core data recoverability through backup mechanisms against logical errors or physical damage. Business continuity eliminates single points of failure and maintains service availability through redundant architectures and automated operations. Based on business stage, budget, and disaster recovery requirements, choose from the following options:

  • Cost-sensitive with limited resources-primarily need daily data protection.

    No complex architectural changes required. Focus on building cost-effective data backup mechanisms. Refer to Data backup and recovery.

  • Business in growth stage-require resilience against data center failures.

    As businesses grow, extended service interruptions result in significant losses. To guard against zone-level (data center) failures, implement multi-zone high availability deployment. When a zone becomes unavailable, the system automatically routes traffic to healthy zones within the same region.

  • Business requires resilience against city-level disasters.

    For financial services, gaming, and cross-border e-commerce, single-region high availability no longer suffices. Build a cross-region high availability architecture to withstand extreme regional disasters such as natural catastrophes or large-scale network outages, safeguarding your business lifeline.

Data backup and recovery

Data backup and recovery addresses data loss caused by corruption, accidental deletions, or infrastructure failures.

  1. Based on features and protection scope, flexibly combine protection mechanisms.

    • Snapshots: Backs up cloud disk data without client installation 

      Charged based on snapshot type and capacity. See Snapshot billing.
    • ECS File Backup Basic Edition: File-level backup through client installation within the region. Enables quick restoration of deleted files with simple recovery operations.

      Each Alibaba Cloud account (including RAM users) shares a 100 GiB free quota across all regions. Usage beyond this quota is charged based on total block storage capacity attached to ECS. For details, see file-backup-essential-edition-benefit-description.
    • Cloud disks: Leverages native cloud disk capabilities to enable cross-zone/cross-region data backup.

      • Regional ESSDs: Data is replicated across multiple zones within the same region. During zone-level failures, force-attach the disk to instances in other zones for recovery.

        Charged based on disk capacity. See Block storage devices.
      • Async Replication: Asynchronously replicates disk data to another disk in a different zone or region based on block storage replication capabilities. On failure, trigger a manual failover, then attach the secondary disk to a standby instance.

        Fees include capacity charges for the target disk. For cross-region replication, additional disk replication charges apply.
  2. Set your Recovery Point Objective (RPO) and Recovery Time Objective (RTO) targets. Quantify the cost of downtime and data loss with your stakeholders, then regularly validate your recovery targets through business continuity drills.

    RPO: Maximum tolerable data loss, measured in time.
    RTO: Maximum time from failure to full recovery.
    Important

    RPO and RTO are business metrics, not technical guarantees. Estimate your actual end-to-end values. Tighter targets cost more.

Multi-zone high availability

Single-instance deployment risks availability. Any failure (such as hardware issues or process crashes) causes service interruption.  Deploy multi-instance across zones with ALB for automatic failover via health checks.  

  • Application Load Balancer (ALB): Spreads traffic across healthy instances. Health checks pull unhealthy instances out of rotation. ALB works with Auto Scaling to swap out failures and bring new instances online.

  • Relational Database Service (RDS): RDS high-availability editions use primary-standby architecture across zones to persist data.

To further optimize performance and costs:

image

Getting started

Cross-region high availability

When business deploys in a single region, there is risk of complete interruption when extreme natural disasters or large-scale network outages cause full regional data center failures. To ensure business continuity, build a cross-region high availability architecture. The core approach: Extend cross-zone HA across regions with redundant systems, global traffic management, and data sync for automatic regional failover.  

  • Global Traffic Manager (GTM): Routes users based on geography or latency. Health checks monitor each region; when a region fails, GTM redirects traffic to healthy regions via DNS.

  • Data Transmission Service (DTS) : Supports real-time bidirectional data sync across regions for active-active (unit-based) and disaster recovery scenarios.  

To further optimize performance and costs:

image

Getting started

Cross-region high availability: How GTM implements off-site disaster recovery- Use GTM to route traffic across regions and maintain service availability during regional incidents.

Business continuity drills

Once your solution is in place, drill regularly:

  1. Simulate failures. Run full recovery. Measure actual RPO and RTO against targets.

  2. Confirm that quotas, network configs, and security policies in the failover zone or region are ready.

  3. After recovery, verify that data is complete, consistent, and that apps work.