All Products
Search
Document Center

Elasticsearch:Select a synchronization method

Last Updated:Jun 29, 2023

If you encounter slow queries when you use an ApsaraDB RDS database, you can synchronize data from the database to an Alibaba Cloud Elasticsearch cluster for data queries and analytics. Alibaba Cloud Elasticsearch is a Lucene-based, distributed search and analytics engine. It allows you to store, query, and analyze large amounts of datasets in near real time. You can use Data Transmission Service (DTS), Logstash, DataWorks, or Canal to synchronize data from an ApsaraDB RDS database to an Alibaba Cloud Elasticsearch cluster. This topic describes the use scenarios of each method. You can select a method based on your business requirements.

Method

Description

Use scenario

Usage note

References

Use DTS to synchronize data in real time

DTS synchronizes data by subscribing to binary logs. Data synchronization by using DTS has only a millisecond-level latency, without negative impacts imposed on the source database.

You have a high requirement for real-time performance of data synchronization.

  • DTS uses the read and write resources of the source database and destination cluster when DTS initializes full data. This may increase the loads of the database and cluster.

  • You can customize mappings for indexes. However, you must make sure that the fields defined in the mappings are the same as the fields in the source database.

  • You must purchase a data synchronization instance in the DTS console. For information about how to purchase a data synchronization instance, see Purchase a DTS instance. For information about the pricing of DTS, see Billing overview.

Use DTS to synchronize MySQL data to an Alibaba Cloud Elasticsearch cluster in real time

Use the logstash-input-jdbc plug-in to synchronize data

You can use the logstash-input-jdbc plug-in to query the data in an ApsaraDB RDS database and migrate the data to an Elasticsearch cluster. During data synchronization, the plug-in uses a round-robin method to identify the latest inserted or updated data in the database on a regular basis. Then, the plug-in queries all identified data at a time and migrates the data to an Elasticsearch cluster. Data synchronization by using the logstash-input-jdbc plug-in has poorer real-time performance than data synchronization by using DTS and has a second-level latency.

  • You want to synchronize full data and can accept a latency of a few seconds.

  • You want to query specific data at a time and synchronize the data.

  • Before you can use this method, you must upload an SQL JDBC driver that is compatible with the version of the ApsaraDB RDS database.

  • You must add the IP addresses of the nodes in your Logstash cluster to the IP address whitelist of your ApsaraDB RDS instance.

  • Your Logstash cluster and ApsaraDB RDS instance must reside in the same time zone. This prevents inconsistent timestamps during data synchronization.

  • You must make sure that the value of the _id field in your Elasticsearch cluster is the same as the value of the id field in the ApsaraDB RDS database.

  • When you insert or update data in your ApsaraDB RDS database, you must make sure that the related record contains a field that indicates the insertion or update time.

Use Logstash to synchronize data from ApsaraDB RDS for MySQL to Elasticsearch

Use DataWorks to synchronize offline data

DataWorks is a comprehensive service that provides modules such as Data Integration, DataStudio, and Data Quality. You can use DataWorks to import and store structured data, convert and develop the data, and then synchronize the processed data to Elasticsearch clusters or other data systems.

  • You want to synchronize offline big data. DataWorks can collect offline data at a minimum interval of minutes.

  • You want to customize query statements, perform joint queries on multiple tables, and then synchronize data.

  • You want to synchronize all data in your ApsaraDB RDS database.

  • You must activate the DataWorks service.

  • If a high transmission speed is required or the network environment is complex, you must customize resource groups.

  • You must add the IP address of the resource group that you use to the IP address whitelist of your ApsaraDB RDS instance.

Use DataWorks to synchronize data from a MySQL database to an Alibaba Cloud Elasticsearch cluster

Use Canal to synchronize MySQL data

Canal synchronizes data in real time by subscribing to binary logs. Data synchronization by using Canal has only a millisecond-level latency, without negative impacts imposed on the source database.

You have a high requirement for real-time performance of data synchronization.

  • You must build a Canal environment on an Elastic Compute Service (ECS) instance. However, this increases data synchronization costs.

  • You cannot use Canal 1.1.4 to synchronize data to an Elasticsearch V7.X cluster. We recommend that you use Canal 1.1.5 or methods such as Logstash and DTS to synchronize MySQL data to an Elasticsearch V7.X cluster.

  • You can customize mappings for indexes. However, you must make sure that the fields defined in the mappings are the same as the fields in the source database.

  • You must make sure the availability of Canal. Otherwise, faults may occur or your business may be affected. For example, you must make sure that data synchronization is not interrupted in scenarios such as ECS instance restart and exit of Canal due to exceptions.

Use Canal to synchronize MySQL data to Alibaba Cloud Elasticsearch