Database equivalent query and statistical information
Introduction: Statistical information is used to provide data support for cost estimation of the optimizer. One of the most important requirements is cardinality estimation in the equivalent query (EQUALS, IN, etc.) scenario.
Concept
Statistical information is used to provide data support for cost estimation of the optimizer. One of the most important requirements is cardinality estimation in the equivalent query (EQUALS, IN, etc.) scenario. Consider the following cases
Our customer has set the index for gender and name at the same time. At this time, the optimizer will filter the number of rows based on each condition to estimate the cost of selecting different indexes. For this case, if the gender index is selected, the query process is probably not good.
If the statistics can provide an accurate cardinality, for example, Name='wy 'can filter out 5 rows, and Gender='female' can only filter out half of the data, the optimizer will know, and then select KEY_ NAME is a cheaper option.
In order to support the estimation process here, statistical information needs to decide what algorithm to use, which also determines how to collect and estimate.
Algorithm
The cardinality estimation in the database is similar to Data Sketching. There are generally two options for cardinality estimation in the equivalent scenario, based on sample or full data.
SAMPLE
When the data volume grows, sample is a very effective means to improve the estimation efficiency. But sample must at least face the following problems
● Randomization of sampling
● Difference estimation between sampling data and overall data
○ How many samples can ensure the error rate
● How to sample when new data appears
Sampling randomization is a very practical problem. For example, there are two methods for sampling in storage, one is to sample by page, and the other is to collect a part of data on each page. Although the first scheme is very efficient, poor randomness easily leads to poor estimation effect. Although the second scheme has higher estimation effect, it is equivalent to a full scan in IO.
In the paper Every Row Counts: Combining Sketches and Sampling for Accurate Group By Result Estimates, the author compares the GEE/AE algorithm based on Sample with the HyperLogLog sketch algorithm oriented to full data, and the accuracy of HyperLogLog is perfect. Of course, this is just a scenario of NDV calculation.
When we need to determine the characteristics of a single data in the cluster, sample is difficult to answer such questions. For example, there is a problem with data, that is, whether a data exists in the cluster. If the data does not exist in the sample set, it may not exist as a whole, or it may just not be sampled.
For another example, when estimating ndv through sampling, if the ndv in the sampled data (assuming 100000 rows) is close to 100000, the ndv of the overall data (assuming 10000000 rows) may be 100000 or 10000000.
NDV: number of the different value
The data set of Sample is more suitable for solving the problem of data characteristics under range conditions. For example, the histogram is constructed from the sample data to deal with the query problem of range conditions, or the cardinal number of multi table equivalent join is estimated from the sample set. These issues are beyond the scope of this article and will not be discussed temporarily.
Non Sample
● frequency: The cardinal number estimation of the equivalent scenario is to obtain the number of a value in the set, which is also a problem to obtain frequency
CountMin Sketch
CountMin Sketch is usually used to solve frequency problems. It is a bit like Bloom filter. It also requires an array and a collection of hash functions. When traversing data, each data will be mapped to a position in the array by each hash function, and the counter at that position will be increased by 1. In the estimation phase, take the minimum value of all counters corresponding to this data. It can be predicted that it will ensure that the estimated value is greater than or equal to the true value, and then ensure the error rate through some means. The specific principle will not be repeated here.
The biggest problem with CountMin Sketch is that when the size of the Sketch is not appropriate, it is easy to overestimate the estimation due to Hash collision. Whether you increase the number of rows (hash functions) or columns (counters), you can reduce the error rate. CountMin Sketch has advantages in processing a large number of duplicate data, but it is easy to produce overly inflated results when processing non skewed data.
Of course, CountMin was also used to handle hot scenarios at the beginning. However, it is not very friendly in scenarios where the data is distributed evenly.
HyperLogLog+TopN
In PolarDBX, we use HyperLogLog+TopN to deal with frequency after weighing. The specific approach is as follows.
HyperLogLog is used to estimate the accurate NDV value, and TopN is used to process data skew. The specific algorithm is roughly as follows:
if (topN.get(i) != 0) {
return (long)(topN.get(i) / sampleRate);
}
return (rowcount / cardinality);
HyperLogLog needs to scan full data, and TopN can be built through sample data.
This kind of data structure is excellent in processing data with an overall average distribution and only a few data scenarios with severe tilt. Use rowcount/cardinality to stabilize most non skewed data, and use TopN to solve skewed data. However, the sample of TopN may cause large deviation in skewed data.
Considering the requirements of the optimizer, it is possible to clearly distinguish tilted data from non tilted data by estimating values, which has met the selection requirements of Plan.
Test
Based on the simulated random data, we compared the performance of CountMin Sketch and HyperLogLog+TopN in several specific scenarios.
The main observed data are as follows:
● Maximum error: the maximum difference between the estimated value and the true value
● Error square error: represents the stability of the error range
The algorithm initialization parameters are as follows:
● CountMin Sketch initialization parameter epsOfTotalCount=0.0001, confidence=0.99, forming a long array of 7 * 20000, and the skeleton occupies about 1M space
● TopN has a fixed sample of 10W data, with no more than 5000 maximum values reserved. The HyperLogLog accuracy is 1.04/sqrt (2 ^ 16)=0.004, and the space is about 50K or less
Among the maximum errors, CountMin mainly comes from the expansion of equilibrium data. The maximum error of Hll+TopN at 10W comes from a single value estimated as rowcount/cardinality, but this error is very small. In 100W and 1000W scenarios, the maximum error of Hll+TopN is far greater than CountMin, but on the whole, the square error of HLL+TopN is more stable than the square error. These errors mainly come from the unrecorded Skew data in TopN (no sample or insufficient TopN capacity).
On the whole, with the growth of data volume, the estimation of CountMin Sketch for skewed data will be more accurate, but the estimation of non skewed data will have a serious deviation, which is reflected in the high square error in big data scenarios. The estimation of Hll+TopN non skewed data is very stable and accurate. The estimation of skewed data will expand seriously with the growth of data volume or the non randomness of sample.
Considering the scenario of base number estimation of equivalent conditions, if you want to ensure the accuracy of skewed data, you should choose CountMin Sketch. If you want to ensure the accuracy of non skewed data, as long as the skewed data can be identified, it is better to use Hll+TopN.
Note that the above test results are only for simulated data scenarios.
collection
Although CountMin and HyperLogLog answer different questions (one is frequency, the other is ndv), they are essentially based on Sketch to describe the overall data. In addition to small memory consumption, the algorithm based on Sketch has a great advantage that multiple Sketches can be combined and used to describe multi shard data.
Most statistics modules need to collect data without considering the impact of multiple partitions, such as the maximum and minimum values, rowcount, number of null values, and column length. But ndv is different, and ndv values of multiple partitions cannot be simply summed up. However, the Sketch data of HyperLogLog can be used to estimate the overall ndv value by merging multiple partitions. CountMin Sketches can also be merged.
CountMin and HyperLogLog must scan full data if they want high accuracy, but this will at least have a huge impact on IO. The Sketch type algorithm can create different Sketches for each shard data, and then only scan the shards with data changes and update their Sketches. This avoids scanning the full amount of data while still obtaining accurate estimates globally.
Epilogue
The approximate approach is often faster and more efficient
So far, the field of data characteristics has been studied for a long time, and a lot of tools have been developed. This paper briefly introduces the differences and impacts of some of the Sketch based technologies in database equivalent query scenarios.
It can be seen that these technologies provide the ability to solve specific problems in different scenarios, but they are not silver bullets. Only carefully weighed selection tools can bring better optimization effects.
Concept
Statistical information is used to provide data support for cost estimation of the optimizer. One of the most important requirements is cardinality estimation in the equivalent query (EQUALS, IN, etc.) scenario. Consider the following cases
Our customer has set the index for gender and name at the same time. At this time, the optimizer will filter the number of rows based on each condition to estimate the cost of selecting different indexes. For this case, if the gender index is selected, the query process is probably not good.
If the statistics can provide an accurate cardinality, for example, Name='wy 'can filter out 5 rows, and Gender='female' can only filter out half of the data, the optimizer will know, and then select KEY_ NAME is a cheaper option.
In order to support the estimation process here, statistical information needs to decide what algorithm to use, which also determines how to collect and estimate.
Algorithm
The cardinality estimation in the database is similar to Data Sketching. There are generally two options for cardinality estimation in the equivalent scenario, based on sample or full data.
SAMPLE
When the data volume grows, sample is a very effective means to improve the estimation efficiency. But sample must at least face the following problems
● Randomization of sampling
● Difference estimation between sampling data and overall data
○ How many samples can ensure the error rate
● How to sample when new data appears
Sampling randomization is a very practical problem. For example, there are two methods for sampling in storage, one is to sample by page, and the other is to collect a part of data on each page. Although the first scheme is very efficient, poor randomness easily leads to poor estimation effect. Although the second scheme has higher estimation effect, it is equivalent to a full scan in IO.
In the paper Every Row Counts: Combining Sketches and Sampling for Accurate Group By Result Estimates, the author compares the GEE/AE algorithm based on Sample with the HyperLogLog sketch algorithm oriented to full data, and the accuracy of HyperLogLog is perfect. Of course, this is just a scenario of NDV calculation.
When we need to determine the characteristics of a single data in the cluster, sample is difficult to answer such questions. For example, there is a problem with data, that is, whether a data exists in the cluster. If the data does not exist in the sample set, it may not exist as a whole, or it may just not be sampled.
For another example, when estimating ndv through sampling, if the ndv in the sampled data (assuming 100000 rows) is close to 100000, the ndv of the overall data (assuming 10000000 rows) may be 100000 or 10000000.
NDV: number of the different value
The data set of Sample is more suitable for solving the problem of data characteristics under range conditions. For example, the histogram is constructed from the sample data to deal with the query problem of range conditions, or the cardinal number of multi table equivalent join is estimated from the sample set. These issues are beyond the scope of this article and will not be discussed temporarily.
Non Sample
● frequency: The cardinal number estimation of the equivalent scenario is to obtain the number of a value in the set, which is also a problem to obtain frequency
CountMin Sketch
CountMin Sketch is usually used to solve frequency problems. It is a bit like Bloom filter. It also requires an array and a collection of hash functions. When traversing data, each data will be mapped to a position in the array by each hash function, and the counter at that position will be increased by 1. In the estimation phase, take the minimum value of all counters corresponding to this data. It can be predicted that it will ensure that the estimated value is greater than or equal to the true value, and then ensure the error rate through some means. The specific principle will not be repeated here.
The biggest problem with CountMin Sketch is that when the size of the Sketch is not appropriate, it is easy to overestimate the estimation due to Hash collision. Whether you increase the number of rows (hash functions) or columns (counters), you can reduce the error rate. CountMin Sketch has advantages in processing a large number of duplicate data, but it is easy to produce overly inflated results when processing non skewed data.
Of course, CountMin was also used to handle hot scenarios at the beginning. However, it is not very friendly in scenarios where the data is distributed evenly.
HyperLogLog+TopN
In PolarDBX, we use HyperLogLog+TopN to deal with frequency after weighing. The specific approach is as follows.
HyperLogLog is used to estimate the accurate NDV value, and TopN is used to process data skew. The specific algorithm is roughly as follows:
if (topN.get(i) != 0) {
return (long)(topN.get(i) / sampleRate);
}
return (rowcount / cardinality);
HyperLogLog needs to scan full data, and TopN can be built through sample data.
This kind of data structure is excellent in processing data with an overall average distribution and only a few data scenarios with severe tilt. Use rowcount/cardinality to stabilize most non skewed data, and use TopN to solve skewed data. However, the sample of TopN may cause large deviation in skewed data.
Considering the requirements of the optimizer, it is possible to clearly distinguish tilted data from non tilted data by estimating values, which has met the selection requirements of Plan.
Test
Based on the simulated random data, we compared the performance of CountMin Sketch and HyperLogLog+TopN in several specific scenarios.
The main observed data are as follows:
● Maximum error: the maximum difference between the estimated value and the true value
● Error square error: represents the stability of the error range
The algorithm initialization parameters are as follows:
● CountMin Sketch initialization parameter epsOfTotalCount=0.0001, confidence=0.99, forming a long array of 7 * 20000, and the skeleton occupies about 1M space
● TopN has a fixed sample of 10W data, with no more than 5000 maximum values reserved. The HyperLogLog accuracy is 1.04/sqrt (2 ^ 16)=0.004, and the space is about 50K or less
Among the maximum errors, CountMin mainly comes from the expansion of equilibrium data. The maximum error of Hll+TopN at 10W comes from a single value estimated as rowcount/cardinality, but this error is very small. In 100W and 1000W scenarios, the maximum error of Hll+TopN is far greater than CountMin, but on the whole, the square error of HLL+TopN is more stable than the square error. These errors mainly come from the unrecorded Skew data in TopN (no sample or insufficient TopN capacity).
On the whole, with the growth of data volume, the estimation of CountMin Sketch for skewed data will be more accurate, but the estimation of non skewed data will have a serious deviation, which is reflected in the high square error in big data scenarios. The estimation of Hll+TopN non skewed data is very stable and accurate. The estimation of skewed data will expand seriously with the growth of data volume or the non randomness of sample.
Considering the scenario of base number estimation of equivalent conditions, if you want to ensure the accuracy of skewed data, you should choose CountMin Sketch. If you want to ensure the accuracy of non skewed data, as long as the skewed data can be identified, it is better to use Hll+TopN.
Note that the above test results are only for simulated data scenarios.
collection
Although CountMin and HyperLogLog answer different questions (one is frequency, the other is ndv), they are essentially based on Sketch to describe the overall data. In addition to small memory consumption, the algorithm based on Sketch has a great advantage that multiple Sketches can be combined and used to describe multi shard data.
Most statistics modules need to collect data without considering the impact of multiple partitions, such as the maximum and minimum values, rowcount, number of null values, and column length. But ndv is different, and ndv values of multiple partitions cannot be simply summed up. However, the Sketch data of HyperLogLog can be used to estimate the overall ndv value by merging multiple partitions. CountMin Sketches can also be merged.
CountMin and HyperLogLog must scan full data if they want high accuracy, but this will at least have a huge impact on IO. The Sketch type algorithm can create different Sketches for each shard data, and then only scan the shards with data changes and update their Sketches. This avoids scanning the full amount of data while still obtaining accurate estimates globally.
Epilogue
The approximate approach is often faster and more efficient
So far, the field of data characteristics has been studied for a long time, and a lot of tools have been developed. This paper briefly introduces the differences and impacts of some of the Sketch based technologies in database equivalent query scenarios.
It can be seen that these technologies provide the ability to solve specific problems in different scenarios, but they are not silver bullets. Only carefully weighed selection tools can bring better optimization effects.
Related Articles

A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team

What Does IOT Mean
Knowledge Base Team

6 Optional Technologies for Data Storage
Knowledge Base Team

What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers

Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00