Disclaimer: This is a translated work of Qinxia's 漫谈分布式系统. All rights reserved to the original author.
In the previous articles, we mentioned because MR was too slow, Spark emerged and significantly improved its performance. However, Spark is not fast enough. The MPP architecture developed from traditional relational databases meets our high-performance query requirements. After combining with HDFS and proposing concepts (like virtual segment), it also solves the extensibility problem to a certain extent.
However, neither Batch nor MPP may be fast enough in some complex scenarios. For example, the query conditions are complex, there are many join tables, and the computing is particularly large.
Multi-Dimensional OLAP (MOLAP) is such a scene. As a type of OLAP, MOLAP usually involves many complex dimensions. Different dimensions may be arbitrarily combined, resulting in computing and data explosion.
For this scenario, it can also be optimized in the MPP architecture, but it will be difficult and costly. On the other hand, Apache Kylin offers a different solution.
The idea of Apache Kylin is not groundbreaking. To put it simply, it is space for time. Many fields are solving problems using this idea (such as HashTable), but it is unique to apply this idea to the MOLAP field.
This means all statistics are computed in advance and put on the online storage engine. After the query request arrives, the corresponding result is directly queried without performing complex computing on-site.
The overall architecture is not complex:
Let's simplify it. It is mainly divided into three parts:
Usually, it may be possible to make build tasks run in cycles, so the peripherals also need a task scheduling system (such as Oozie, Azkaban, and Airflow), but it is not reflected in this architecture diagram.
The core concept in Kylin or MOLAP, as mentioned above, is the Cube. As the name implies, a cube is a three-dimensional structure. If there are more dimensions, it will become a multi-dimensional cube.
Each dimension of a cube corresponds to a dimension in OLAP. Different dimensions are given different values, and there are different combinations. Each combination is called a cuboid, and all cuboids together are a cube.
MOLAP is computationally intensive because there are too many Cuboid combinations. For a 20-dimension cube, the number of cuboids is 2 ^ 20, which is a very large number. Considering the cardinality of each dimension, computing is intimidating.
Even though Kylin uses precomputation to reduce the latency of queries, precomputation does not reduce computing.
Dimensional pruning is the most intuitive and effective way to reduce computing and is the point where Kylin should focus on optimization.
Pruning can only be done from a business perspective. Otherwise, it will cause accidental injury. Kylin abstracts the following common methods from past business experience to reduce the number of cuboids:
Aggregation Group supports the following rules:
The derived dimension will be easier to understand through the following example:
As shown in the preceding tables, when no processing is performed, the dimensions have the following combinations:
But both A and B can be determined by X. If we set A to derived, the dimension combination becomes:
The number of combinations is reduced from 6 to 3. Storage and computing overhead are also correspondingly reduced.
However, when the query is executed, the derived dimension A is still supported, so conversion is required. First, find out all X, replace them with A according to the mapping relationship of the dimension table, and aggregate A.
Since precomputation does not contain A, this conversion and aggregation operation is done on-site during the query, which will have some impact on the response time. However, compared with the saving of precomputation resources, it is usually acceptable (Strictly speaking, you need to decide whether to set it according to the business scenario).
In addition to cuboid pruning, Kylin provides other methods to reduce computing. If some scenarios do not need precise deduplication, you can use the count distinct based on HyperLogLog to do fuzzy deduplication. If precise deduplication is required, you can use the count distinct based on the bitmap.
The storage of the computing results is another point that needs optimization.
Kylin stores the results in HBase by default. Considering the data structure and querying methods, HBase is indeed a good choice.
However, Kylin defines the data of each partition as a segment, and each segment corresponds to an HBase table. This puts a lot of pressure on HBase.
The number of segments may increase rapidly, resulting in a rapid increase in the number of HBase tables and a sharp increase in metadata management pressure. These factors may pose great burdens on the stability and performance of the cluster.
There are two ideas to solve this problem.
One idea is to merge segments to reduce the number of tables.
After merging, the problem can be alleviated, but once you need to repaint some historical data, you can only repaint the entire segment. It is possible to find that a day's data is abnormal, but it is necessary to repaint the data for a whole year.
This is another scenario that requires a trade-off. You can consider adjusting through a time window by lagging for some time to avoid large-scale recalculations as much as possible.
Another more thorough idea is to replace HBase.
The community version considered this idea early on, and some companies in the industry have their practices (such as Kylin on Druid).
However, the commercial version of Kylin, Kylingence Enterprise, finally adopted the solution of Spark residence session + Parquet. The open-source version is also transforming and following up and is bound to become the mainstream solution.
In the past, it was generally believed that databases are suitable for storing result data, but with the support of extensive optimizations of Spark and Parquet, you can get enough performance, at least for the MOLAP scenario. This is not surprising. The database is also based on a custom file format, combined with a large number of optimizations to achieve high performance.
After solving the two major problems of dimensional pruning and storage engine replacement, Kylin has earned a place in the OLAP field by using precomputation.
I think this is also the most important thing when we are designing our architecture. In many cases, innovation does not need to be groundbreaking. A little change in idea may bring unexpected gains.
In the last ten articles, from MR to Spark, from MPP to Kylin, we have focused on the topic of batch processing and solved one problem after another. However, it does not mean the framework mentioned later is better than the previous ones and can replace them.
Most of the time, there are no silver bullets, even if many frameworks have the ambition to solve all the problems. More often than not, we all need to choose the most suitable framework in a specific application scenario, while in other scenarios, other frameworks may be more suitable. It is true. In large-scale companies, these frameworks often coexist.
This is a carefully conceived series of 20-30 articles. I hope to give everyone a core grasp of the distributed system in a storytelling way. Stay tuned for the next one!
降云 - January 12, 2021
Hologres - July 7, 2021
Alibaba Cloud Community - May 31, 2022
Alibaba Cloud MaxCompute - September 18, 2019
Alibaba Clouder - January 4, 2021
Apache Flink Community China - December 25, 2019
Plan and optimize your storage budget with flexible storage servicesLearn More
Block-level data storage attached to ECS instances to achieve high performance, low latency, and high reliabilityLearn More
Deploy custom Alibaba Cloud solutions for business-critical scenarios with Quick Start templates.Learn More
A cost-effective, efficient and easy-to-manage hybrid cloud storage solution.Learn More
More Posts by Alibaba Cloud_Academy