Combining Elasticsearch with DBs: Offline Data Synchronization

Released by ELK Geek

Definition

First of all, let us clarify the definitions of real-time, instant, and offline modes.

Instant

The instant mode allows you to query data changes immediately after they are made. A transaction isolation mechanism is used internally. Queried data must be blocked until the data update is completed. For example, you can query a data change in a single-instance relational database immediately after it is made.

Real-time

In real-time mode, data is synchronized between heterogeneous or homogeneous data sources within an acceptably short time usually measured in seconds, milliseconds, or microseconds. You can define the time limit based on business requirements and implementation capabilities. If the MySQL-based synchronization between primary and secondary databases cannot be completed in instant mode, it must be at least completed in the real-time mode. Read/Write splitting in business systems can only be performed on non-instant data.

Offline

Compared with real-time mode, the offline mode does not impose strong timeliness requirements. Instead, it emphasizes data throughput. Synchronization in offline mode usually takes minutes, hours, or days. For example, in big data business intelligence (BI) applications, the data synchronization time is generally required to be T+1.

The previous article, "Combining Elasticsearch with DBs: Real-time Data Synchronization", describes a technical solution for real-time data synchronization in a real-world project. The solution is based on the Change Data Capture (CDC) mechanism. It involves a long technology stack pipeline and many intermediate steps, requiring complex coding and high investment. This does not facilitate rapid production and application. In fact, in most business system scenarios, offline synchronization is the most frequently needed capability. To this end, this article describes how to synchronize DB data to Elasticsearch in offline mode.

Background

Offline synchronization from DBs to Elasticsearch is suitable for many business scenarios. Compared with real-time synchronization, offline synchronization better demonstrates the benefits of replacing DBs with Elasticsearch as a query engine.

This article describes how to synchronize data from databases (DBs) to Elasticsearch in offline mode. The following sections describe several business scenarios involving offline synchronization.

Queries of Historical Data

In the e-commerce or logistics industry, a large amount of order data is generated every day. Data of outstanding orders is queried in real time at a high frequency, requiring a lot of upstream and downstream data transfer and processing. By contrast, data of closed orders is queried at a low frequency. However, data of historical orders grows quickly. If a data system treats real-time data and historical data equally in its system architecture design, it will soon suffer various fatal performance problems. For example, the update speed for queries of real-time data slows down, user experience deteriorates, and even one query can cause serious congestion in the system because it involves a large volume of historical data.

Conventionally, real-time data is separated from historical data. Real-time data is queried frequently, and one query involves a small volume of data, allowing a fast system response. By contrast, historical data is queried less frequently, and one query involves a large volume of data, resulting in a slow but acceptable response.

Due to the limitations of traditional relational databases, we can use Elasticsearch to handle such a scenario. An Elasticsearch real-time cluster is used for queries of real-time data, while an Elasticsearch historical cluster is used for queries of historical data. The CDC mechanism is used for the synchronization of real-time data. Historical data synchronization requires a dedicated offline data channel.

Business and Technical Reconstruction

As your business changes and develops, the business system needs to be reconstructed accordingly. This involves both the business model and the technical architecture. When, due to human resources or technology evolution, the existing database system can no longer meet business requirements, you need a more suitable database system.

For example, you can shift a variety of business systems based on MongoDB storage to Elasticsearch for faster speed and lower costs (this will be discussed in subsequent cases). After such a system reconstruction, you need to synchronize all the MongoDB data to Elasticsearch. This is an offline synchronization scenario.

Even the renowned MongoDB has improper applications, to say nothing of the limitations of relational databases. For example, you can shift a variety of business systems based on MongoDB storage to Elasticsearch for faster speed and lower costs (for more information, see the article: Why You Should Migrate from MongoDB to Elasticsearch). After such a system reconstruction, you need to synchronize all the MongoDB data to Elasticsearch. This is an offline synchronization scenario.

Technical Products

Compared with real-time synchronization, offline synchronization features much lower technical complexity, less demanding scenarios, and therefore more technical solutions to choose from. Many excellent special tools are available for effortless configuration of offline synchronization. The following sections introduce several popular products that are my personal favorites. I will briefly analyze their excellent features and architectural principles.

Logstash

Logstash is an Elasticsearch product that is an essential part of the Elasticsearch Stack.

Logstash was developed with JRuby based on the free open-source Java Virtual Machine (JVM) platform.
The architecture is concise and outstanding, with a clear hierarchy of input -> filter -> output.
Logstash supports pipeline models and dependencies between multiple pipelines.
Logstash has diversified features and supports various data sources and custom editing, including Ruby scripts.
Logstash supports the general Java DataBase Connectivity (JDBC) protocol. It uses SQL statements to extract data from DBs to Elasticsearch. It also supports CRON timer and real-time synchronization.
Logstash can be integrated with Elastic Stack and allows you to view monitoring data in Kibana.
Logstash runs on a single instance, without the need for cluster communication between multiple instances. It does not support the cluster pattern, making it easy to deploy.
Logstash is a lightweight non-platform tool.

Logstash is more popular in the O&M community than in the development community. This is largely due to the popularity of Elasticsearch Stack. In fact, Logstash is one of the simplest and most useful tools for data synchronization.

Data X

DataX is a data synchronization tool provided by Alibaba. It is used for offline data synchronization between databases.

DataX was developed based on Java and adopts a plug-in mechanism.
The architecture is simple, consisting of two conceptual modules: reader -> writer.
DataX supports data synchronization from DBs to Elasticsearch, uses SQL statements, and is limited to offline synchronization.
DataX runs on a single instance, without the need for cluster communication between multiple instances. It does not support the cluster pattern, making it easy to use.
DataX is a lightweight non-platform tool.

DataX is not the best tool. However, by virtue of its simplicity, good performance, and large throughput, plus the reputation of Alibaba, DataX is highly supported in the development community and widely used in one-time offline data synchronization projects.

NiFi

NiFi is a top-level Apache product. It is used for data synchronization.

NiFi was developed based on Java and adopts the plug-in mechanism, supporting custom development.
The architecture is excellent and views processors as core modules.
NiFi has powerful features and supports complex scripting functions, including Java, JavaScript, Python, and Ruby.
NiFi allows you to synchronize data from DBs to Elasticsearch by using SQL statements or the CDC mechanism.
NiFi supports real-time synchronization and offline synchronization. It also supports a variety of processor combinations.
NiFi provides a friendly user interface for visual configuration.
NiFi is a platform product that supports cluster deployment.
NiFi is easy to use for beginners, but difficult to master.

NiFi has a long history. However, it is not as popular as Sqoop from the Hadoop ecosystem in China. In any case, Command and Data Handling (CDH) has been integrated in the latest version. I personally like NiFi very much due to its platform-based system architecture. Compared with NiFi, DataX and Logstash are nothing but small tools.

Flink

Currently, Flink is the most popular stream processing product in the big data community. It is widely used in the real-time computing field.

Flink is a platform product developed based on Java.
Flink adopts a distributed architecture and has diversified mechanism features. It supports the checkpoint mechanism, program disaster recovery, and state persistence.
Flink adopts an enriched programming model, with a hierarchy of Streaming -> Dataset -> Table -> SQL.
Flink supports stream computing and offline computing, allowing you to synchronize data from a JDBC data source to Elasticsearch in offline mode.
Flink runs in a cluster mode. Therefore, you need to write your own data processing code.

Flink is positioned in the real-time computing field, but it also provides strong support for offline computing due to the characteristics of its underlying architecture design. Flink provides a friendly programming model for quick development and deployment. Spark is similar to Flink, so I will not describe it in detail here.

ETL Tools

Many professional extract, load, and transform (ELT) products are available. For more information, see the BI materials. I will not provide much detail here.

Kettle
DataStage
Informatica

Balancing Your Technologies

The preceding sections describe several popular data synchronization tools. Some tools support offline synchronization while other tools support offline and real-time synchronization. In fact, more tools are available. Each product has its own limitations and advantages. Therefore, do not treat them as the same, nor use them all together. Balance your tools based on your actual business and technology needs.

Objective Thinking

Each tool is specifically positioned and needs to be evaluated objectively. The requirements for offline data synchronization in business systems are diverse and there is no perfect product that meets the requirements of all scenarios.

For example, DataX is a direct synchronization tool without data processing features. Although its [MySQL module supports incremental synchronization, its batch-to-batch continuous synchronization feature requires manual triggering. This causes some inconvenience, but the data throughput is very good.

Logstash does not support JDBC-based write plug-ins. This may lead some people to jump to the conclusion that Logstash is not a good tool due to the lack of support for data synchronization between multiple databases. In fact, Logstash is a great tool suitable for many data synchronization scenarios. It provides a batch-batch continuous timing mechanism that is unavailable in DataX, greatly reducing your workload.

Hybrid and Combined Use

A wide range of offline data synchronization scenarios are encountered in business systems. A single data synchronization tool cannot handle all these scenarios. To use multiple data synchronization tools, you must create a hybrid system.

For example, NiFi provides a lot of powerful features and can read data from local files. However, the NiFi installation package may occupy several gigabytes of your disk space. If the size of the data to be synchronized is small or even smaller than the NiFi installation package, select a standalone tool such as Logstash.

Technology Integration

Each tool has its own architecture design concept and unique technical characteristics. To master and integrate them well, you have to invest a certain amount of time and effort. Otherwise, you may run into many problems. For example, at the development level, Flink is easy for beginners and allows you to develop data synchronization code. However, as a platform product, Flink itself is highly complex, especially in terms of architecture and O&M.

Levels of Technology Integration Capabilities

At the development level, you must understand various application program interface (API) features of the products to skillfully address various needs.
At the architecture level, you should have an in-depth understanding of the fundamental principles of the products, and an objective understanding of their basic capabilities.
At the O&M level, you must master the O&M capabilities of the products and develop countermeasures to protect against various exceptions.

Summary

Lessons Learned

Different people approach data synchronization in different ways. Readily-available tools can quickly respond to demands, while re-written programs can better meet custom demands. Therefore, select tools based on your preferences and needs.

Elasticsearch has many excellent features, allowing it to handle almost any scenario, including real-time data and historical data scenarios. Elasticsearch saves you a lot of time.

Content Source

This article summarizes our practices when combining DBs and Elasticsearch in business systems. This article is intended for reference. This is an original article. If you wish to reproduce it, please indicate the source.

Special Courses

Currently, Elasticsearch is used by almost all information technology companies in China, from small studios to large companies with thousands of employees. It is widely used and highly regarded in many fields. Elasticsearch is easy to get started with. However, you need to invest a lot of time to become an Elasticsearch expert. Therefore, we have designed special courses to help more individual and enterprise users master Elasticsearch.

The courses start with data synchronization, including offline and real-time data synchronization, methods of importing DB data to Elasticsearch, and methods of handling different types of data. For more information, see Practices for DB and Elasticsearch Data Synchronization. The special courses concentrate more on sample code from case studies than theory. We provide the following course series:

Logstash series
DataX series
NiFi series
Flink series
CDC series

About the Author

Li Meng is an Elasticsearch Stack user and a certified Elasticsearch engineer. Since his first explorations into Elasticsearch in 2012, he has gained in-depth experience in the development, architecture, and operation and maintenance (O&M) of the Elastic Stack and has carried out a variety of large and mid-sized projects. He provides enterprises with Elastic Stack consulting, training, and tuning services. He has years of practical experience and is an expert in various technical fields, such as big data, machine learning, and system architecture.

Declaration: This article is reproduced with authorization from Li Meng, the original author. The author reserves the right to hold users legally liable in the case of unauthorized use.

Community