[Share]Challenges for OLAP in the Big Data Era
Created#More Posted time:Oct 10, 2016 14:41 PM
Value of data
Before talking about specific technologies, first let’s think about why we need OLAP? What’s its value or is it irreplaceable for our department or company? What value does it bring? Is it cashable directly or indirectly? If you can’t answer these questions, it is a big problem, as it indicates data quality is unacceptable. Massive data with poor quality is worthless.
It is assumed that we already have a well-performing OLTP system, in which case it works quite well as long as there is no excessive data processing. As business increases, analytical functions will definitely separate from OLTP, which results in OLAP and OLTP existing independently.
Since a lot of historical data is stored in OLAP, data storage becomes the first and top issue. There are various solutions, including HDFS, NoSQL or Distributed RDBMS, and you may select different ones based on specific circumstances. The following involves specific aspects for consideration.
Data synchronization and ETL
How to move data from OLTP to OLAP. This synchronization mechanism needs to consider data consistency, zero data-loss, real-timeliness and other requirements.
How to quickly locate the required records among mass historical data? If data size is greater than 1 TB, it's necessary to use solr or elasticsearch.
Any employer would have trouble accepting the fact that the costly mass data is used for only simple and detailed queries. They'd prefer historical data to be used for more complex analysis. In this case, a high-performance distributive computing engine is required, such as spark/presto/impala.
Data mining is a kind of data analysis which is more complicated than traditional data analysis. This is only my personal understanding and may be a little bit confusing. Now, algorithms and machine learning can give full play to their functions.
Big data or fast data
Data analysis needs to consider another restrictive factor: time. It's recommended to use systems like druid if you wish to obtain analysis results as soon as possible.
If data size is lower than 10 TB, data includes structural and semi-structural data and detailed query conditions are fixed, it's unnecessary to conduct full-text search. If you wish to obtain complex analysis results in a short time (such as within seconds), distributed rdbms is recommended.
If the data size is much greater than 10 TB, it's necessary to distribute these tasks including data storage, data query and data analysis into various systems for processing. At this time a technology stack is required to process the total data size, such as HDFS/solr, elasticsearch/Spark, Presto or Impala. To improve analysis efficiency, optimization should be conducted not only via distributed computing engines but also through the storage. Usage of advanced storage forms such as parquet, orc and carbondata can greatly improve analysis performance.