ECS data reliability with triplicate storage - Elastic Compute Service

ESSD cloud disks achieve 99.9999999% (nine nines) data reliability with local redundancy and 99.9999999999% (twelve nines) with zone redundancy through triplicate storage and end-to-end data validation.

Technical advantages

Data durability: Each piece of data is replicated to three copies on different physical nodes and racks. If one or two replicas become unavailable, the remaining replicas continue to serve reads and writes.
Data integrity: The system generates and verifies a checksum at each stage of the write and storage process. A mismatch triggers immediate error correction to prevent data corruption during transfer and storage. This validation is hardware-accelerated with negligible impact on read and write performance.
Automatic fault recovery: When the system detects a storage node failure or insufficient replicas, it restores data from a healthy replica to re-establish the full three-replica state. The recovery process is transparent to your applications.

Protection scenarios

Data unavailability due to hardware failure
- Challenge: Disk damage, server downtime, or rack power outages can make data on the affected physical devices inaccessible.
- Technical protection: The triplicate storage mechanism distributes data across different physical nodes. On failure, the system fails over to healthy replicas and rebuilds a new replica in the background with no impact on your business.
Silent data corruption
- Challenge: Memory bit flips, network transmission errors, or disk firmware degradation can cause undetected data corruption that is difficult to catch with traditional methods.
- Technical protection: End-to-end data validation generates checksums at each step of the write process. On read, the system verifies these checksums and triggers immediate error correction on mismatch, ensuring the data read matches the data written.

Important

These technologies protect against hardware failures and data corruption at the infrastructure layer. Application-level risks such as accidental deletion or virus attacks require snapshots.

Triplicate storage mechanism

Triplicate storage addresses data unavailability caused by hardware failures. The system replicates each piece of data written to a cloud disk into three copies and stores them on different physical nodes.

Data write process

Triplicate storage

The system uses a multi-replica synchronous write mechanism. A write succeeds only when data is written to all replicas; otherwise it fails. This ensures strong consistency: any subsequent read returns the most recently written data.

Replica placement strategy

To prevent correlated failures such as multiple replicas lost to a rack power outage, the triplicate storage mechanism follows this placement strategy:

Rack isolation: The three replicas are distributed across storage nodes on different racks. A single-machine or single-rack failure does not affect data availability.
Fault domain isolation: For local-redundancy ESSD cloud disks, replicas are distributed across different racks within the same zone. For zone-redundancy ESSD cloud disks, replicas span different zones, upgrading disaster recovery from rack-level to zone-level.
Load balancing: While satisfying isolation requirements, the system also considers storage capacity, I/O load, and network topology to balance resource utilization and performance.

Fault recovery process

Data protection mechanism

When the system detects insufficient replicas, it selects a healthy storage node that meets the isolation policy and copies data from an existing replica to restore the three-replica state. This process is transparent to your applications and requires no manual intervention.

End-to-end data validation

End-to-end data validation addresses silent data corruption during transfer and storage.

Validation process

At each stage of the write and storage process, the system uses a Cyclic Redundancy Check (CRC) to verify data integrity.

After an I/O request is initiated: Data enters the block storage path and an initial checksum is generated.
After memory copy: After data is copied to the compute node's memory, the system compares the checksum to detect errors.
After network transmission: When data reaches the storage node's network layer, the system compares the checksum to detect bit errors during transfer.
Upon receipt by the storage node: After data is written to the storage node's memory, the system compares the checksum.
When data is persisted to disk: After data is written to disk, the system compares the checksum.

If a checksum mismatch is found at any stage, it immediately triggers error handling. This validation is hardware-accelerated with negligible impact on read and write performance.

Error handling

Error handling differs based on where the error occurs:

Network transport layer: The system retransmits data until validation passes.
Storage media: The system marks the bad block and reads correct data from another replica for recovery.
Memory: The Error-Correcting Code (ECC) mechanism corrects errors, and the system retries the I/O operation.

FAQ

Does the triplicate storage mechanism mean I have to pay for three times the storage capacity?

No. Triplicate storage is a built-in data reliability feature. Alibaba Cloud bears the cost of the 3x storage redundancy. You pay only for the cloud disk capacity you purchase. For example, a 40 GiB cloud disk provides 40 GiB of usable and billable capacity.
How can I further protect my data?
- Create an automatic snapshot policy for regular backups. Use snapshots to roll back a cloud disk if an issue occurs.
- Copy snapshots across regions. On failure, create a data disk from a snapshot and attach it to a standby instance.
Can the triplicate storage mechanism prevent all types of data loss?

Triplicate storage protects against hardware failures at the infrastructure layer. Application-level risks such as accidental deletion or virus attacks require snapshots.
How does the triplicate storage mechanism ensure data consistency?

The system uses a multi-replica synchronous write mechanism. A write succeeds only when data is written to all replicas; otherwise it fails. This ensures strong consistency: any subsequent read returns the most recently written data.

Technical advantages

Protection scenarios

Triplicate storage mechanism

Data write process

Replica placement strategy

Fault recovery process

End-to-end data validation

Validation process

Error handling

FAQ

Does the triplicate storage mechanism mean I have to pay for three times the storage capacity?

How can I further protect my data?

Can the triplicate storage mechanism prevent all types of data loss?

How does the triplicate storage mechanism ensure data consistency?