Application of RocketMQ in Data Heterogeneous System


The era of data has arrived, and the value of data is becoming increasingly important. In the face of ubiquitous data, in order to utilize it, a data center has emerged.

The data center roughly has six functions:

1. Data collection

2. ETL

3. Data calculation

4. Storage

5. Data analysis

6. Data Display

This article explains the core role of RocketMQ in data collection, ETL, and data calculation in the data center


There are already many data synchronization solutions in the industry, which mainly focus on two aspects

1. Peer to peer synchronization

2. Offline synchronization

The architecture of most synchronization solutions in the industry is similar to the following figure:

As an industrial grade component, I don't know how to do multiple-choice questions. Point to point synchronization is necessary, as is multi to many synchronization. Offline synchronization cannot be omitted, and real-time synchronization is also necessary. Heterogeneous and synchronous data components, with the following component architecture:

Heterogeneous synchronization components based on message middleware and their advantages

Multiple types of sources

From the above figure, it can be observed that there are many types of sources.

Gatherer/sdk: mainly external data, such as customer data, and the source of data collected by the APP

RPC: Mainly valuable data generated by internal business systems

Agent: Collect logs, system and hardware operation information

Data source: Reading data from various stores

Peak shaving and valley filling

The read speed of the source and the write data of the link are both uncontrollable. In general, the efficiency of a source is often several times that of a sink, which may lead to the unavailability of the gather service and cause serious accidents. No one knows when the peak will occur, and such unpredictable events pose a great threat to the stability and high availability of the entire system. So message middleware was introduced as a buffer

Heterogeneous multiple data sources

Customers have two product systems, online and offline, and the recommended behavior of the planned two systems can be shared. It is necessary to simultaneously collect product data from two customer systems and synchronize five storage. A heterogeneous solution based on message middleware can handle problems with great elegance.

Better resource allocation

The first process represents an architecture with multiple sources and sinks, while the second process represents an architecture with multiple sources and sinks. Under heterogeneous architecture design, the number of sources and sinks and whether they run can be flexibly matched. This greatly saves server resources.


Why do we choose RocketMQ in many message oriented middleware? Because RocketMQ's many features help us solve many problems. The specific issues are as follows:

data security

There is a basic principle in the components of data synchronization that data cannot be lost. For scenarios like the SAAS platform, there are many uncertain factors and unpredictable situations, and data recovery is a very troublesome task. If there are differences between synchronized data and customer internal data in real-time scenarios, it may lead to very fatal events. Compared to other messaging middleware, data security is the most important thing. The following features of RocketMQ ensure that data is not lost

The overall architect design of RocketMQ ensures data security. Master slave synchronous replication, broker synchronous drop disk

2. Retrying messages after consumption failure

3. Deadletter queue:

1) No need to maintain additional storage points after message failure

1) Abnormal operations that cause the queue to not exist and other exceptions can be sent to the private message queue

Parallel consumption and synchronous consumption

Data synchronization is divided into three behaviors: add, modify, and delete. Tables (result sets) can be classified as: appending tables and modifying tables

The addition table only has the addition behavior, which is suitable for parallel consumption

There are three behaviors for modifying a table: adding, modifying, and deleting. In the case of high concurrency and multithreading of big data, it is easy for the data center to not execute according to the data behavior of the business, resulting in inconsistency between the data in the data center and the business data. In order to ensure consistency between execution sequence and business operation sequence, RocketMQ's synchronous consumption feature has been selected to ensure that the operation execution sequence remains unchanged

The above figure illustrates the situation of inconsistent data in parallel consumption. In an ideal scenario, the sequence of source is 1234, so the execution sequence of sink is 6785. But the actual execution sequence is 5678. So using RocketMQ's sequential messaging feature ensures data consistency

Sequential messages actually only allow one consumer to receive messages, while other consumers will continue to compete for consumption rights

Classification parallelism

The efficiency of synchronous consumption is much lower than that of parallel consumption, and the write speed of sink is far lower than that of source reading data, often resulting in a large amount of data accumulation, resulting in poor consistency between synchronous data and business data, sometimes unbearable for low-end businesses.

At the first synchronization, it is full synchronization and there are no modification operations. So parallel synchronization is used. Afterwards, it was changed to synchronous consumption. To improve the performance of synchronous consumption. After in-depth analysis and research, it was found that classifying data can improve efficiency. Therefore, RocketMQ based queue and tag implemented parallel classification.

Observation and operation and maintenance capabilities

There are currently a large number of synchronized topics, and creating, deleting, testing, locating, observing, and searching these topics is a challenge. Every development, testing, product, etc. involved in the project must observe and maintain the topics. RocketMQ console can easily help us solve these problems, greatly improving overall development efficiency and progress

Preliminary implementation of data tracking based on message trajectories

The above is a diagram illustrating the flow of data within the system in the recommended business scenario. Such a flow is called a "task" internally, and each node is an "operator". A certain piece of data may not produce the expected result of the "task" due to an uncontrollable factor. In complex tasks and systems with high concurrency and performance, it is necessary to monitor the flow of data, detect anomalies in the flow in a timely manner, and quickly correct data and issues. So we need a capability: data tracking

Through observation, it was found that most of the current operators read data from RocketMQ. Therefore, the first generation of data tracking capability was designed based on RocketMQ's message trajectories


RocketMQ connect is a heterogeneous open source component based on RocketMQ implementation that supports synchronization between multiple data sources. Nowadays, more and more enterprises and companies are using RocketMQ to support business and data platforms, and the data on each platform is in a flow state. Using RocketMQ connect can easily and quickly build a streaming data platform under the existing architecture.

The RocketMQ connect architecture has two distinct characteristics:

1. Decentralized design and dependency free architecture design

2. SPI based pluggable design

Decentralized design

Connect cli sends heterogeneous tasks to any connect runtime, while runtime simply processes the task information and sends it to the broker. All connect runtime within the cluster will receive tasks and store them locally. The runtime starts and runs tasks without directly relying on the broker.

In the overall RocketMQ connect architecture design, there are no other components used, ensuring the overall simplicity and elegance

Design of pluggable array based on SPI








Connector class: The execution object of source

Source record converter: data processing object

Topic: Configuration of file source

Filename: Configuration of file source

From the task information, it can be seen that starting a task requires providing the required source or sink execution class for the task. RocketMQ connect will search for the startup class in the plugin directory.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us