Application of RocketMQ in Data Heterogeneous System
Scenarios
The era of data has arrived, and the value of data is becoming increasingly important. In the face of ubiquitous data, in order to utilize it, a data center has emerged.
The data center roughly has six functions:
1. Data collection
2. ETL
3. Data calculation
4. Storage
5. Data analysis
6. Data Display
This article explains the core role of RocketMQ in data collection, ETL, and data calculation in the data center
choice
There are already many data synchronization solutions in the industry, which mainly focus on two aspects
1. Peer to peer synchronization
2. Offline synchronization
The architecture of most synchronization solutions in the industry is similar to the following figure:
As an industrial grade component, I don't know how to do multiple-choice questions. Point to point synchronization is necessary, as is multi to many synchronization. Offline synchronization cannot be omitted, and real-time synchronization is also necessary. Heterogeneous and synchronous data components, with the following component architecture:
Heterogeneous synchronization components based on message middleware and their advantages
Multiple types of sources
From the above figure, it can be observed that there are many types of sources.
Gatherer/sdk: mainly external data, such as customer data, and the source of data collected by the APP
RPC: Mainly valuable data generated by internal business systems
Agent: Collect logs, system and hardware operation information
Data source: Reading data from various stores
Peak shaving and valley filling
The read speed of the source and the write data of the link are both uncontrollable. In general, the efficiency of a source is often several times that of a sink, which may lead to the unavailability of the gather service and cause serious accidents. No one knows when the peak will occur, and such unpredictable events pose a great threat to the stability and high availability of the entire system. So message middleware was introduced as a buffer
Heterogeneous multiple data sources
Customers have two product systems, online and offline, and the recommended behavior of the planned two systems can be shared. It is necessary to simultaneously collect product data from two customer systems and synchronize five storage. A heterogeneous solution based on message middleware can handle problems with great elegance.
Better resource allocation
The first process represents an architecture with multiple sources and sinks, while the second process represents an architecture with multiple sources and sinks. Under heterogeneous architecture design, the number of sources and sinks and whether they run can be flexibly matched. This greatly saves server resources.
thorough
Why do we choose RocketMQ in many message oriented middleware? Because RocketMQ's many features help us solve many problems. The specific issues are as follows:
data security
There is a basic principle in the components of data synchronization that data cannot be lost. For scenarios like the SAAS platform, there are many uncertain factors and unpredictable situations, and data recovery is a very troublesome task. If there are differences between synchronized data and customer internal data in real-time scenarios, it may lead to very fatal events. Compared to other messaging middleware, data security is the most important thing. The following features of RocketMQ ensure that data is not lost
The overall architect design of RocketMQ ensures data security. Master slave synchronous replication, broker synchronous drop disk
2. Retrying messages after consumption failure
3. Deadletter queue:
1) No need to maintain additional storage points after message failure
1) Abnormal operations that cause the queue to not exist and other exceptions can be sent to the private message queue
Parallel consumption and synchronous consumption
Data synchronization is divided into three behaviors: add, modify, and delete. Tables (result sets) can be classified as: appending tables and modifying tables
The addition table only has the addition behavior, which is suitable for parallel consumption
There are three behaviors for modifying a table: adding, modifying, and deleting. In the case of high concurrency and multithreading of big data, it is easy for the data center to not execute according to the data behavior of the business, resulting in inconsistency between the data in the data center and the business data. In order to ensure consistency between execution sequence and business operation sequence, RocketMQ's synchronous consumption feature has been selected to ensure that the operation execution sequence remains unchanged
The above figure illustrates the situation of inconsistent data in parallel consumption. In an ideal scenario, the sequence of source is 1234, so the execution sequence of sink is 6785. But the actual execution sequence is 5678. So using RocketMQ's sequential messaging feature ensures data consistency
Sequential messages actually only allow one consumer to receive messages, while other consumers will continue to compete for consumption rights
Classification parallelism
The efficiency of synchronous consumption is much lower than that of parallel consumption, and the write speed of sink is far lower than that of source reading data, often resulting in a large amount of data accumulation, resulting in poor consistency between synchronous data and business data, sometimes unbearable for low-end businesses.
At the first synchronization, it is full synchronization and there are no modification operations. So parallel synchronization is used. Afterwards, it was changed to synchronous consumption. To improve the performance of synchronous consumption. After in-depth analysis and research, it was found that classifying data can improve efficiency. Therefore, RocketMQ based queue and tag implemented parallel classification.
Observation and operation and maintenance capabilities
There are currently a large number of synchronized topics, and creating, deleting, testing, locating, observing, and searching these topics is a challenge. Every development, testing, product, etc. involved in the project must observe and maintain the topics. RocketMQ console can easily help us solve these problems, greatly improving overall development efficiency and progress
Preliminary implementation of data tracking based on message trajectories
The above is a diagram illustrating the flow of data within the system in the recommended business scenario. Such a flow is called a "task" internally, and each node is an "operator". A certain piece of data may not produce the expected result of the "task" due to an uncontrollable factor. In complex tasks and systems with high concurrency and performance, it is necessary to monitor the flow of data, detect anomalies in the flow in a timely manner, and quickly correct data and issues. So we need a capability: data tracking
Through observation, it was found that most of the current operators read data from RocketMQ. Therefore, the first generation of data tracking capability was designed based on RocketMQ's message trajectories
RocketMQ-connect
RocketMQ connect is a heterogeneous open source component based on RocketMQ implementation that supports synchronization between multiple data sources. Nowadays, more and more enterprises and companies are using RocketMQ to support business and data platforms, and the data on each platform is in a flow state. Using RocketMQ connect can easily and quickly build a streaming data platform under the existing architecture.
The RocketMQ connect architecture has two distinct characteristics:
1. Decentralized design and dependency free architecture design
2. SPI based pluggable design
Decentralized design
Connect cli sends heterogeneous tasks to any connect runtime, while runtime simply processes the task information and sends it to the broker. All connect runtime within the cluster will receive tasks and store them locally. The runtime starts and runs tasks without directly relying on the broker.
In the overall RocketMQ connect architecture design, there are no other components used, ensuring the overall simplicity and elegance
Design of pluggable array based on SPI
json
{
"connector-class":"org.apache.rocketmq.connect.file.FileSourceConnector",
"topic":"fileTopic",
"filename":"/opt/source-file/source-file.txt",
"source-record-converter":"org.apache.rocketmq.connect.runtime.converter.JsonConverter"
}
Connector class: The execution object of source
Source record converter: data processing object
Topic: Configuration of file source
Filename: Configuration of file source
From the task information, it can be seen that starting a task requires providing the required source or sink execution class for the task. RocketMQ connect will search for the startup class in the plugin directory.
The era of data has arrived, and the value of data is becoming increasingly important. In the face of ubiquitous data, in order to utilize it, a data center has emerged.
The data center roughly has six functions:
1. Data collection
2. ETL
3. Data calculation
4. Storage
5. Data analysis
6. Data Display
This article explains the core role of RocketMQ in data collection, ETL, and data calculation in the data center
choice
There are already many data synchronization solutions in the industry, which mainly focus on two aspects
1. Peer to peer synchronization
2. Offline synchronization
The architecture of most synchronization solutions in the industry is similar to the following figure:
As an industrial grade component, I don't know how to do multiple-choice questions. Point to point synchronization is necessary, as is multi to many synchronization. Offline synchronization cannot be omitted, and real-time synchronization is also necessary. Heterogeneous and synchronous data components, with the following component architecture:
Heterogeneous synchronization components based on message middleware and their advantages
Multiple types of sources
From the above figure, it can be observed that there are many types of sources.
Gatherer/sdk: mainly external data, such as customer data, and the source of data collected by the APP
RPC: Mainly valuable data generated by internal business systems
Agent: Collect logs, system and hardware operation information
Data source: Reading data from various stores
Peak shaving and valley filling
The read speed of the source and the write data of the link are both uncontrollable. In general, the efficiency of a source is often several times that of a sink, which may lead to the unavailability of the gather service and cause serious accidents. No one knows when the peak will occur, and such unpredictable events pose a great threat to the stability and high availability of the entire system. So message middleware was introduced as a buffer
Heterogeneous multiple data sources
Customers have two product systems, online and offline, and the recommended behavior of the planned two systems can be shared. It is necessary to simultaneously collect product data from two customer systems and synchronize five storage. A heterogeneous solution based on message middleware can handle problems with great elegance.
Better resource allocation
The first process represents an architecture with multiple sources and sinks, while the second process represents an architecture with multiple sources and sinks. Under heterogeneous architecture design, the number of sources and sinks and whether they run can be flexibly matched. This greatly saves server resources.
thorough
Why do we choose RocketMQ in many message oriented middleware? Because RocketMQ's many features help us solve many problems. The specific issues are as follows:
data security
There is a basic principle in the components of data synchronization that data cannot be lost. For scenarios like the SAAS platform, there are many uncertain factors and unpredictable situations, and data recovery is a very troublesome task. If there are differences between synchronized data and customer internal data in real-time scenarios, it may lead to very fatal events. Compared to other messaging middleware, data security is the most important thing. The following features of RocketMQ ensure that data is not lost
The overall architect design of RocketMQ ensures data security. Master slave synchronous replication, broker synchronous drop disk
2. Retrying messages after consumption failure
3. Deadletter queue:
1) No need to maintain additional storage points after message failure
1) Abnormal operations that cause the queue to not exist and other exceptions can be sent to the private message queue
Parallel consumption and synchronous consumption
Data synchronization is divided into three behaviors: add, modify, and delete. Tables (result sets) can be classified as: appending tables and modifying tables
The addition table only has the addition behavior, which is suitable for parallel consumption
There are three behaviors for modifying a table: adding, modifying, and deleting. In the case of high concurrency and multithreading of big data, it is easy for the data center to not execute according to the data behavior of the business, resulting in inconsistency between the data in the data center and the business data. In order to ensure consistency between execution sequence and business operation sequence, RocketMQ's synchronous consumption feature has been selected to ensure that the operation execution sequence remains unchanged
The above figure illustrates the situation of inconsistent data in parallel consumption. In an ideal scenario, the sequence of source is 1234, so the execution sequence of sink is 6785. But the actual execution sequence is 5678. So using RocketMQ's sequential messaging feature ensures data consistency
Sequential messages actually only allow one consumer to receive messages, while other consumers will continue to compete for consumption rights
Classification parallelism
The efficiency of synchronous consumption is much lower than that of parallel consumption, and the write speed of sink is far lower than that of source reading data, often resulting in a large amount of data accumulation, resulting in poor consistency between synchronous data and business data, sometimes unbearable for low-end businesses.
At the first synchronization, it is full synchronization and there are no modification operations. So parallel synchronization is used. Afterwards, it was changed to synchronous consumption. To improve the performance of synchronous consumption. After in-depth analysis and research, it was found that classifying data can improve efficiency. Therefore, RocketMQ based queue and tag implemented parallel classification.
Observation and operation and maintenance capabilities
There are currently a large number of synchronized topics, and creating, deleting, testing, locating, observing, and searching these topics is a challenge. Every development, testing, product, etc. involved in the project must observe and maintain the topics. RocketMQ console can easily help us solve these problems, greatly improving overall development efficiency and progress
Preliminary implementation of data tracking based on message trajectories
The above is a diagram illustrating the flow of data within the system in the recommended business scenario. Such a flow is called a "task" internally, and each node is an "operator". A certain piece of data may not produce the expected result of the "task" due to an uncontrollable factor. In complex tasks and systems with high concurrency and performance, it is necessary to monitor the flow of data, detect anomalies in the flow in a timely manner, and quickly correct data and issues. So we need a capability: data tracking
Through observation, it was found that most of the current operators read data from RocketMQ. Therefore, the first generation of data tracking capability was designed based on RocketMQ's message trajectories
RocketMQ-connect
RocketMQ connect is a heterogeneous open source component based on RocketMQ implementation that supports synchronization between multiple data sources. Nowadays, more and more enterprises and companies are using RocketMQ to support business and data platforms, and the data on each platform is in a flow state. Using RocketMQ connect can easily and quickly build a streaming data platform under the existing architecture.
The RocketMQ connect architecture has two distinct characteristics:
1. Decentralized design and dependency free architecture design
2. SPI based pluggable design
Decentralized design
Connect cli sends heterogeneous tasks to any connect runtime, while runtime simply processes the task information and sends it to the broker. All connect runtime within the cluster will receive tasks and store them locally. The runtime starts and runs tasks without directly relying on the broker.
In the overall RocketMQ connect architecture design, there are no other components used, ensuring the overall simplicity and elegance
Design of pluggable array based on SPI
json
{
"connector-class":"org.apache.rocketmq.connect.file.FileSourceConnector",
"topic":"fileTopic",
"filename":"/opt/source-file/source-file.txt",
"source-record-converter":"org.apache.rocketmq.connect.runtime.converter.JsonConverter"
}
Connector class: The execution object of source
Source record converter: data processing object
Topic: Configuration of file source
Filename: Configuration of file source
From the task information, it can be seen that starting a task requires providing the required source or sink execution class for the task. RocketMQ connect will search for the startup class in the plugin directory.
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00