ECS data backup and high availability architecture - Elastic Compute Service

To address logical errors such as accidental deletions and physical failures like zone or regional outages, ECS provides a layered protection strategy covering two core aspects: data durability and business continuity. For data backup and recovery, ECS offers granular recovery options ranging from file-level to block storage-level. For business continuity, ECS provides high availability architectures that withstand zone-level and region-level failures, enabling cloud architectures that meet varying business continuity objectives.

Enhance cloud business resilience

When building cloud-based applications, data durability and business continuity form two pillars that ensure stable operations. Data durability protects core data recoverability through backup mechanisms against logical errors or physical damage. Business continuity eliminates single points of failure and maintains service availability through redundant architectures and automated operations. Based on business stage, budget, and disaster recovery requirements, choose from the following options:

Cost-sensitive with limited resources-primarily need daily data protection.
No complex architectural changes required. Focus on building cost-effective data backup mechanisms. Refer to Data backup and recovery.
Business in growth stage-require resilience against data center failures.
As businesses grow, extended service interruptions result in significant losses. To guard against zone-level (data center) failures, implement multi-zone high availability deployment. When a zone becomes unavailable, the system automatically routes traffic to healthy zones within the same region.
Business requires resilience against city-level disasters.
For financial services, gaming, and cross-border e-commerce, single-region high availability no longer suffices. Build a cross-region high availability architecture to withstand extreme regional disasters such as natural catastrophes or large-scale network outages, safeguarding your business lifeline.

Data backup and recovery

Data backup and recovery addresses data loss caused by corruption, accidental deletions, or infrastructure failures.

Based on features and protection scope, flexibly combine protection mechanisms.
- Snapshots: Backs up cloud disk data without client installation
  Charged based on snapshot type and capacity. See Snapshot billing.
  - Create snapshots: Create snapshots periodically. Roll back as needed to address accidental deletions and application rollbacks.
    Or create a custom image, then restore by replacing the OS.
  - Replicate snapshots: Create an automatic snapshot policy with cross-region replication enabled to handle region-level failures. On failure, create a data disk from the replicated snapshot and mount it to a standby instance.
- ECS File Backup Basic Edition: File-level backup through client installation within the region. Enables quick restoration of deleted files with simple recovery operations.
  Each Alibaba Cloud account (including RAM users) shares a 100 GiB free quota across all regions. Usage beyond this quota is charged based on total block storage capacity attached to ECS. For details, see file-backup-essential-edition-benefit-description.
- Cloud disks: Leverages native cloud disk capabilities to enable cross-zone/cross-region data backup.
  - Regional ESSDs: Data is replicated across multiple zones within the same region. During zone-level failures, force-attach the disk to instances in other zones for recovery.
    Charged based on disk capacity. See Block storage devices.
  - Async Replication: Asynchronously replicates disk data to another disk in a different zone or region based on block storage replication capabilities. On failure, trigger a manual failover, then attach the secondary disk to a standby instance.
    Fees include capacity charges for the target disk. For cross-region replication, additional disk replication charges apply.
Set your Recovery Point Objective (RPO) and Recovery Time Objective (RTO) targets. Quantify the cost of downtime and data loss with your stakeholders, then regularly validate your recovery targets through business continuity drills.
RPO: Maximum tolerable data loss, measured in time.
RTO: Maximum time from failure to full recovery.
Important
RPO and RTO are business metrics, not technical guarantees. Estimate your actual end-to-end values. Tighter targets cost more.

Multi-zone high availability

Single-instance deployment risks availability. Any failure (such as hardware issues or process crashes) causes service interruption. Deploy multi-instance across zones with ALB for automatic failover via health checks.

Application Load Balancer (ALB): Spreads traffic across healthy instances. Health checks pull unhealthy instances out of rotation. ALB works with Auto Scaling to swap out failures and bring new instances online.
Relational Database Service (RDS): RDS high-availability editions use primary-standby architecture across zones to persist data.

To further optimize performance and costs:

Store static files such as images and scripts in Object Storage Service (OSS) and leverage Content Delivery Network (CDN) to improve access speed.
Use Auto Scaling (ESS) to handle traffic fluctuations. ESS adjusts capacity by demand and replaces failed instances automatically.

Getting started

Multi-AZ traffic distribution: High-availability architecture on the cloud- Deploy ALB across availability zones to achieve zone-level high availability.
Self-healing with Auto Scaling: Automatic elasticity and stable delivery- Combine ALB with Auto Scaling to automatically replace failed instances and scale with demand.

Cross-region high availability

When business deploys in a single region, there is risk of complete interruption when extreme natural disasters or large-scale network outages cause full regional data center failures. To ensure business continuity, build a cross-region high availability architecture. The core approach: Extend cross-zone HA across regions with redundant systems, global traffic management, and data sync for automatic regional failover.

Global Traffic Manager (GTM): Routes users based on geography or latency. Health checks monitor each region; when a region fails, GTM redirects traffic to healthy regions via DNS.
Data Transmission Service (DTS) : Supports real-time bidirectional data sync across regions for active-active (unit-based) and disaster recovery scenarios.

To further optimize performance and costs:

Store static files such as images and scripts in Object Storage Service (OSS) and leverage Content Delivery Network (CDN) to improve access speed.
Use Auto Scaling (ESS) to handle traffic fluctuations. ESS adjusts capacity by demand and replaces failed instances automatically.

Getting started

Cross-region high availability: How GTM implements off-site disaster recovery- Use GTM to route traffic across regions and maintain service availability during regional incidents.

Business continuity drills

Once your solution is in place, drill regularly:

Simulate failures. Run full recovery. Measure actual RPO and RTO against targets.
Confirm that quotas, network configs, and security policies in the failover zone or region are ready.
After recovery, verify that data is complete, consistent, and that apps work.