DeltaLake's practical sharing in industrial brain
With the release of Yunqi Industrial Brain 3.0 in 2020, Industrial Brain has experienced many years of development. This article will share with you the excellent practice of using DeltaLake in the construction of industrial data center, mainly including:
(1) Processing of heterogeneous flow messages in different places
(2) Data analysis of stream batch fusion
(3) Transaction processing and algorithm support
1. Processing of heterogeneous flow messages in different places
For industrial enterprises, data sources are often scattered around the world. Group-level users often want to obtain data from the data center
DeltaLake and Structured Streaming work together to accomplish the following two things:
1. Summarize the Kafka real-time data of each plant area and write it to Kafka in the middle station for real-time data application
2. Archive the Kafka data of each plant area into the HDFS of the middle station for offline analysis type data application
There are many big data components that can complete the above tasks, such as Flink and Flume, but the following reasons make us finally choose DeltaLake:
1. Support regular consumption of multiple Kafka topics
Using SubscribePattern, you can use regular rules to consume data from multiple topics at the same time, which is very convenient when there are many topics in a park that need to be consumed
2. Support for HDFS and encapsulation of small file merging
When encountering the scenario of "writing Kafka's data to HDFS in real time", it is also convenient to use DeltaLake. There are two main reasons:
1. Natural support for writing HDFS can eliminate the need to write HDFS Sinker when using Flink, or the trouble caused by additional operation and maintenance of Flume cluster
2. Each streaming warehousing scenario is a trade-off between performance and timeliness for data architects. Whether it is Flink, Flume or SparkStreaming, there will be "rolling write capacity (or number) threshold" and "rolling write time threshold" designs. In the actual implementation process, the two will be weighed according to the different business requirements for data latency and performance. For example, for scenarios with low latency tolerance, you can set the capacity or number threshold to be very small (even 1) to allow new data to be written quickly. However, the side effect of this is the frequent IO of Sinker. For example, many small files are generated in HDFS, which affects the performance of data read and write or DataNode; In scenarios with high latency tolerance, delivery engineers often choose to increase the number threshold and time threshold to bring better IO performance, but sacrifice data latency. This is a general method, but in the actual production process, you will find that the cost of maintaining many different configurations for many stream jobs is still not small.
Using DeltaLake to process can be much easier. You can set the rolling write threshold of all stream jobs to the same (for example, they are small), so that all stream jobs can get better data delay. At the same time, combined with DeltaLake's feature functions, Optimize and Vacuum, configure scheduled tasks to execute periodically, merge or delete small files, to ensure HDFS performance, This can make the whole data development work much simpler and better operation and maintenance.
Reference to the Optimize feature
2. Data analysis of stream batch fusion
In the process of production and manufacturing, the stable operation of machinery and equipment is crucial to the quality of finished products. The most intuitive way to determine whether the equipment is stable is to view the historical trend of some sensors for a long time. During the implementation of actual projects, delivery engineers often use stream operations to process and write a large number of sensor timing data in Kafka into OLAP storage (such as Alibaba Cloud ADB, TSDB or HBase), To support the real-time query requirements of high concurrency and low response time of the upper data analysis application.
But the actual situation is often much more complicated than this. Because the level of informatization and digitalization of industrial enterprises is generally not high, and the degree of automation of production processes in different industries is also uneven, there are many real-time data of equipment that are not accurate in fact. They need to be used after a number of time (minutes or hours), after manual intervention or recalculation.
Therefore, in the actual implementation process, a "rolling coverage" mode is often used to continuously rewrite the data in OLAP storage, and OLAP is divided into "real-time incremental area" and "periodic coverage area", as shown in the following figure:
The above figure is an OLAP storage, and all data is divided into orange and blue parts. The upper data application can query the data of these two areas without distinction. The only difference is that the latest orange data is obtained from Kafka in real time by the stream computing job and written after processing; The blue area is written after the historical data is periodically calculated (adding correction logic), and the real-time data of yesterday or longer ago is revised. In this way, the cycle repeats, while ensuring the timeliness of the data, the historical data is revised and covered to ensure the correctness of the data.
In the past, a Lambda architecture of flow+batch is often used, and two different computing engines are used to process flow and batch, as shown in the following figure:
The disadvantages of Lambda architecture can also be seen from this. It is a laborious task to maintain two codes on two different platforms, and also to ensure that their computing logic is completely consistent. After the introduction of DeltaLake, things become relatively simple. Spark's natural design of streaming and batch integration has well solved the problem of code reuse and cross-platform logic unification. In combination with the characteristics of DeltaLake (such as ACID, OPTIMIZE, etc.), This work can be done more elegantly, as shown in the following figure:
It is also worth mentioning that the integration of streaming and batch is not a unique feature of Spark, but Alibaba Cloud EMR encapsulates SQL on top of SparkSQL and Spark Streaming, enabling business personnel to use the syntax similar to Flink SQL for job development at a lower threshold, making the code reuse and operation and maintenance work in streaming and batch scenarios easier, which is of great significance for improving project delivery efficiency, Click here for specific reference.
3. Transaction processing and algorithm support
The traditional data warehouse rarely introduces transactions in the modeling process. Because the data warehouse needs to reflect the changes of data, it often uses methods such as slowly changing dimensions to record the changes of data status, instead of using ACID to keep the data warehouse consistent with the business system.
However, during the implementation of the industrial data center, transactions have their own unique use scenarios, such as production scheduling, which is a major issue of concern to every industrial enterprise. Production scheduling is often carried out at the group level, and the production demand of the current period is reasonably decomposed and arranged according to customer orders, material inventory and factory capacity, so as to achieve reasonable capacity allocation; Scheduling is often more microscopic. At the factory level, production plans are dynamically adjusted in real time according to work orders, materials and actual production conditions to maximize resource utilization. They are all planning problems that require many data fusion solutions, as shown in the following figure:
The raw data required by the production scheduling algorithm often comes from multiple business systems, such as ERP providing order and plan data, WMS providing material data, and MES providing work order and process data. These data must be integrated (physically and logically) to be used as the effective input of the production scheduling algorithm. Therefore, in the implementation process, a unified storage is often required to store the data from each system. At the same time, the production scheduling algorithm also has certain requirements for the effectiveness of data. It needs to input data that can be consistent with all business systems as much as possible, so as to truly reflect the production situation at that time, so as to better schedule.
In the past, we dealt with this scenario as follows:
1) Use the CDC capability of each business system, or write separate programs to poll and obtain data changes in quasi-real time
2) Write to the relational database, and process the logic of data merge in this process, so that the data in the relational database and the business system data are consistent in quasi-real-time
3) When the production scheduling engine is triggered, it pulls data from the RDB for calculation
This architecture has some obvious problems, mainly including:
1) Replace big data storage with RDB, and query data into memory during calculation, which is very difficult for large data volume
2) If Hive engine is used to replace the middle RDB, although ACID is supported in Hive3. X, real-time performance and MapReduce programming framework's support for algorithm (solver) cannot meet the engineering requirements
At present, we are trying to introduce DeltaLake and optimize the architecture by combining the features of Spark
The optimized architecture has the following advantages:
1) Use HDFS+Spark instead of RDB as the middle storage to solve the storage problem when the data volume is large
2) Use Spark Streaming+DeltaLake to dock the original data, use the ACID feature of DeltaLake to process the merge logic when the data enters the middle storage, and simultaneously merge+optimize the data when streaming in to ensure the read and write performance
3) The production scheduling engine no longer transfers the query data from the middle desk to the memory calculation, but encapsulates the algorithm tasks into Spark jobs and sends them to the computing platform to complete the calculation. In this way, using the good support of Spark ML programming framework for algorithms and Python, as well as Spark's own distributed computing ability, the planning algorithm that requires multiple rounds of iteration can be distributed
4) Use DeltaLake's Time Travel feature to manage or rollback the data version, which is very beneficial for the debugging and evaluation of the algorithm model
4. Summary
1) The core capability of DeltaLake, ACID, is very helpful for applications with high requirements for real-time and accuracy of data, especially algorithm applications, which can make more effective use of Spark's natural support for ML
2) Combining DeltaLake's Optimize+Vacuum and Streaming's streaming warehousing capability, it will have better compatibility when docking upstream Kafka data in large quantities, and can effectively reduce operation and maintenance costs
3) Streaming SQL development stream jobs encapsulated by Alibaba Cloud EMR team can effectively reduce the development threshold and cost during the implementation of large-scale data center projects
At present, the application of DeltaLake in Industrial Brain is still in the experimental stage. For example, multiple scenarios such as streaming warehousing, production scheduling engine, and streaming batch fusion are being applied in multiple projects of Industrial Brain. At the same time, these scenarios are gradually becoming standard products of Industrial Brain. Later, combined with the visual editing and copying capabilities of Industrial Brain 3.0's data+algorithm scenes, they can be quickly copied to discrete manufacturing, automobile In the scenario of steel and other industries, AI capabilities are used to benefit Chinese industry.
(1) Processing of heterogeneous flow messages in different places
(2) Data analysis of stream batch fusion
(3) Transaction processing and algorithm support
1. Processing of heterogeneous flow messages in different places
For industrial enterprises, data sources are often scattered around the world. Group-level users often want to obtain data from the data center
DeltaLake and Structured Streaming work together to accomplish the following two things:
1. Summarize the Kafka real-time data of each plant area and write it to Kafka in the middle station for real-time data application
2. Archive the Kafka data of each plant area into the HDFS of the middle station for offline analysis type data application
There are many big data components that can complete the above tasks, such as Flink and Flume, but the following reasons make us finally choose DeltaLake:
1. Support regular consumption of multiple Kafka topics
Using SubscribePattern, you can use regular rules to consume data from multiple topics at the same time, which is very convenient when there are many topics in a park that need to be consumed
2. Support for HDFS and encapsulation of small file merging
When encountering the scenario of "writing Kafka's data to HDFS in real time", it is also convenient to use DeltaLake. There are two main reasons:
1. Natural support for writing HDFS can eliminate the need to write HDFS Sinker when using Flink, or the trouble caused by additional operation and maintenance of Flume cluster
2. Each streaming warehousing scenario is a trade-off between performance and timeliness for data architects. Whether it is Flink, Flume or SparkStreaming, there will be "rolling write capacity (or number) threshold" and "rolling write time threshold" designs. In the actual implementation process, the two will be weighed according to the different business requirements for data latency and performance. For example, for scenarios with low latency tolerance, you can set the capacity or number threshold to be very small (even 1) to allow new data to be written quickly. However, the side effect of this is the frequent IO of Sinker. For example, many small files are generated in HDFS, which affects the performance of data read and write or DataNode; In scenarios with high latency tolerance, delivery engineers often choose to increase the number threshold and time threshold to bring better IO performance, but sacrifice data latency. This is a general method, but in the actual production process, you will find that the cost of maintaining many different configurations for many stream jobs is still not small.
Using DeltaLake to process can be much easier. You can set the rolling write threshold of all stream jobs to the same (for example, they are small), so that all stream jobs can get better data delay. At the same time, combined with DeltaLake's feature functions, Optimize and Vacuum, configure scheduled tasks to execute periodically, merge or delete small files, to ensure HDFS performance, This can make the whole data development work much simpler and better operation and maintenance.
Reference to the Optimize feature
2. Data analysis of stream batch fusion
In the process of production and manufacturing, the stable operation of machinery and equipment is crucial to the quality of finished products. The most intuitive way to determine whether the equipment is stable is to view the historical trend of some sensors for a long time. During the implementation of actual projects, delivery engineers often use stream operations to process and write a large number of sensor timing data in Kafka into OLAP storage (such as Alibaba Cloud ADB, TSDB or HBase), To support the real-time query requirements of high concurrency and low response time of the upper data analysis application.
But the actual situation is often much more complicated than this. Because the level of informatization and digitalization of industrial enterprises is generally not high, and the degree of automation of production processes in different industries is also uneven, there are many real-time data of equipment that are not accurate in fact. They need to be used after a number of time (minutes or hours), after manual intervention or recalculation.
Therefore, in the actual implementation process, a "rolling coverage" mode is often used to continuously rewrite the data in OLAP storage, and OLAP is divided into "real-time incremental area" and "periodic coverage area", as shown in the following figure:
The above figure is an OLAP storage, and all data is divided into orange and blue parts. The upper data application can query the data of these two areas without distinction. The only difference is that the latest orange data is obtained from Kafka in real time by the stream computing job and written after processing; The blue area is written after the historical data is periodically calculated (adding correction logic), and the real-time data of yesterday or longer ago is revised. In this way, the cycle repeats, while ensuring the timeliness of the data, the historical data is revised and covered to ensure the correctness of the data.
In the past, a Lambda architecture of flow+batch is often used, and two different computing engines are used to process flow and batch, as shown in the following figure:
The disadvantages of Lambda architecture can also be seen from this. It is a laborious task to maintain two codes on two different platforms, and also to ensure that their computing logic is completely consistent. After the introduction of DeltaLake, things become relatively simple. Spark's natural design of streaming and batch integration has well solved the problem of code reuse and cross-platform logic unification. In combination with the characteristics of DeltaLake (such as ACID, OPTIMIZE, etc.), This work can be done more elegantly, as shown in the following figure:
It is also worth mentioning that the integration of streaming and batch is not a unique feature of Spark, but Alibaba Cloud EMR encapsulates SQL on top of SparkSQL and Spark Streaming, enabling business personnel to use the syntax similar to Flink SQL for job development at a lower threshold, making the code reuse and operation and maintenance work in streaming and batch scenarios easier, which is of great significance for improving project delivery efficiency, Click here for specific reference.
3. Transaction processing and algorithm support
The traditional data warehouse rarely introduces transactions in the modeling process. Because the data warehouse needs to reflect the changes of data, it often uses methods such as slowly changing dimensions to record the changes of data status, instead of using ACID to keep the data warehouse consistent with the business system.
However, during the implementation of the industrial data center, transactions have their own unique use scenarios, such as production scheduling, which is a major issue of concern to every industrial enterprise. Production scheduling is often carried out at the group level, and the production demand of the current period is reasonably decomposed and arranged according to customer orders, material inventory and factory capacity, so as to achieve reasonable capacity allocation; Scheduling is often more microscopic. At the factory level, production plans are dynamically adjusted in real time according to work orders, materials and actual production conditions to maximize resource utilization. They are all planning problems that require many data fusion solutions, as shown in the following figure:
The raw data required by the production scheduling algorithm often comes from multiple business systems, such as ERP providing order and plan data, WMS providing material data, and MES providing work order and process data. These data must be integrated (physically and logically) to be used as the effective input of the production scheduling algorithm. Therefore, in the implementation process, a unified storage is often required to store the data from each system. At the same time, the production scheduling algorithm also has certain requirements for the effectiveness of data. It needs to input data that can be consistent with all business systems as much as possible, so as to truly reflect the production situation at that time, so as to better schedule.
In the past, we dealt with this scenario as follows:
1) Use the CDC capability of each business system, or write separate programs to poll and obtain data changes in quasi-real time
2) Write to the relational database, and process the logic of data merge in this process, so that the data in the relational database and the business system data are consistent in quasi-real-time
3) When the production scheduling engine is triggered, it pulls data from the RDB for calculation
This architecture has some obvious problems, mainly including:
1) Replace big data storage with RDB, and query data into memory during calculation, which is very difficult for large data volume
2) If Hive engine is used to replace the middle RDB, although ACID is supported in Hive3. X, real-time performance and MapReduce programming framework's support for algorithm (solver) cannot meet the engineering requirements
At present, we are trying to introduce DeltaLake and optimize the architecture by combining the features of Spark
The optimized architecture has the following advantages:
1) Use HDFS+Spark instead of RDB as the middle storage to solve the storage problem when the data volume is large
2) Use Spark Streaming+DeltaLake to dock the original data, use the ACID feature of DeltaLake to process the merge logic when the data enters the middle storage, and simultaneously merge+optimize the data when streaming in to ensure the read and write performance
3) The production scheduling engine no longer transfers the query data from the middle desk to the memory calculation, but encapsulates the algorithm tasks into Spark jobs and sends them to the computing platform to complete the calculation. In this way, using the good support of Spark ML programming framework for algorithms and Python, as well as Spark's own distributed computing ability, the planning algorithm that requires multiple rounds of iteration can be distributed
4) Use DeltaLake's Time Travel feature to manage or rollback the data version, which is very beneficial for the debugging and evaluation of the algorithm model
4. Summary
1) The core capability of DeltaLake, ACID, is very helpful for applications with high requirements for real-time and accuracy of data, especially algorithm applications, which can make more effective use of Spark's natural support for ML
2) Combining DeltaLake's Optimize+Vacuum and Streaming's streaming warehousing capability, it will have better compatibility when docking upstream Kafka data in large quantities, and can effectively reduce operation and maintenance costs
3) Streaming SQL development stream jobs encapsulated by Alibaba Cloud EMR team can effectively reduce the development threshold and cost during the implementation of large-scale data center projects
At present, the application of DeltaLake in Industrial Brain is still in the experimental stage. For example, multiple scenarios such as streaming warehousing, production scheduling engine, and streaming batch fusion are being applied in multiple projects of Industrial Brain. At the same time, these scenarios are gradually becoming standard products of Industrial Brain. Later, combined with the visual editing and copying capabilities of Industrial Brain 3.0's data+algorithm scenes, they can be quickly copied to discrete manufacturing, automobile In the scenario of steel and other industries, AI capabilities are used to benefit Chinese industry.
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00