Solution options
Elasticsearch (ES) offers the following remote disaster recovery solutions:
OSS snapshot backup and restoration: Back up index data to Object Storage Service (OSS) for persistent storage. The first snapshot is a full backup, and subsequent snapshots are incremental backups. You can use a cross-cluster OSS repository to restore snapshot data to a target ES instance. For more information, see Back up and restore data by using a cross-cluster OSS repository.
Logstash: You can configure a pipeline to read data from a source cluster, process it, and then write it to a target cluster. This approach is ideal for migrating data between major versions or when data filtering and transformation are required. For more information, see Quick start.
Reindex: The built-in ES Reindex API lets you copy all or a subset of data from one index to another, including across clusters. This is ideal for one-time migrations of small datasets. For more information, see Migrate data by using the Reindex API.
Cross-Cluster Replication (CCR): CCR automatically replicates writable indexes from a leader cluster to one or more follower clusters asynchronously and incrementally. It supports near-real-time synchronization, making it suitable for disaster recovery scenarios with strict RPO and RTO requirements. For more information, see Replicate data across clusters by using CCR.
Solution comparison
Solution | Use cases | RPO | RTO | Limitations |
OSS snapshot | Periodic backup and recovery of large-scale data (from gigabytes to petabytes). | Hours to days (depending on the snapshot interval). | Several hours (depending on data volume and shard recovery time). | Does not support continuous synchronization. Service may need to be stopped during recovery. |
Logstash | Data migration with low real-time requirements, for data that needs filtering and transformation, or for migration between major versions. | Seconds to minutes (depending on synchronization frequency). | Several hours (depending on data volume and instance performance). | Batch synchronization only; not real-time. Does not support synchronizing delete operations. |
Reindex | One-time index migration for small datasets. | Not applicable (one-time operation). | Minutes to hours (depending on data volume). | Does not support continuous synchronization. Inefficient for large-scale data migrations. |
CCR | Remote disaster recovery, read/write splitting, and geo-proximity access. | Near-zero (seconds). | Seconds to minutes. | Follower indexes are read-only. Requires identical mapping and shard counts. |
For remote disaster recovery scenarios with strict RPO and real-time requirements, CCR is the best choice for the following reasons:
CCR synchronizes data within seconds, minimizing data loss.
If the leader cluster fails, you can fail over to a follower cluster to restore service without the delay of a snapshot recovery.
Although the initial deployment cost is higher, CCR is more cost-effective in the long run by preventing business losses from data loss.