PipelineDB uses five aggregate data types when it comes to physical storage. These data types play an important role in processing continuous data. In this article, we will take a look at what these data types are-along with their structures and algorithms-and then we will go through how PipelineDB uses each one of these data types.
Of course, you can learn more about the PipelineDB's data types, data structures, and algorithms by checking out the official documentation. The relevant page for it is this one. Here, we're going to try to keep things a bit more simple.
So first, let's quickly go over the five data types.
Later in this article, we will go over how PipelineDB uses these data types in the sections that follow this one.
You can also find the corresponding data types and functions on these websites. These data types can be used in common tables if you do not want to use them in Continuous Views (CVs), as we will do in this article.
The specific code is implemented in
src/backend/utils/adt/. Okay, now with all of that out of the way, let's look at how PipelineDB uses all of the above data types.
So let's start with Bloom Filter. In PipelineDB, Bloom Filter is implemented in
src/backend/pipeline/bloom.c. Consider the following:
The Bloom Filter data type is used very frequently and is a built-in data type of PipelineDB. We can use Bloom to store data in Continuous Views (CVs).
To illustrate this point, consider the following example. When creating a CV, you can use UDFs directly to generate the Bloom Filter-type fields.
After doing that, you'll want to insert some test data, as shown below:
There are also many UDFs available at this page.
Now there's Count-Min Sketch. In PipelineDB, a Count-Min (CM) Sketch is implemented in
As discussed before, a CM Sketch estimates the frequency of an element in a large stream, which is similar to a Bloom filter. However, a Bloom filter records whether the element exists or not. An error exists when you use the cm sketch.
Now consider the following example:
To this, let's insert some test data, as shown below:
To learn more about this, you can check out its description in the PipelineDB's official documentation.
In PipelineDB, Filtered-Space Saving (FSS) Top-K is implemented in
FSS Top-K is used to find the top-k elements in data streams. Our implementation that we show here is based on this paper.
In an actual environment, for example, where there may be computing top-10 elements in the last 5 minutes, FSS Top-K can efficiently save time and memory space. The following shows, more particularly, how FSS Top-K is used in PipelineDB:
Now, into this, insert some test data, as shown below:
Now, insert six data records, and query
v_fss, as done below:
After this, five records are found when
fss_topk is used. In each record, the value is
1. The inserted value
6 is missing. However, the output of
fss_topk_values contains all the values, and this output is the number of each element.
Now, insert a record again, as shown below:
Two records contain the value
5. Now, insert the value
6 appears and takes the place of
If you'd like to learn more, check out the description provided in PipelineDB's official documentation.
In PipelineDB, HyperLogLog (HLL) is implemented in
In most practical applications, HyperLogLog (HLL) is used to solve the count (distinct) problem, but in PipelineDB, it is also used to store the count (distinct) data.
Consider this example:
The physical storage format of the count (distinct) field corresponding to the created
v_hll is HLL. In the query process, the parser parses this field into the following function:
The following shows the execution plan:
Here's an example:
Regardless of the number of insert operations, the result is count (distinct).
In PipelineDB versions earlier than 0.9.7, this has a rather mediocre efficiency with scan operations sometimes negatively affect performance. In PipelineDB 0.9.7, however, this function has been improved significantly, mitigating these issues. To learn more about the releases and versions of PipelineDB, check out this page. This issue is also discussed at detail in this article, too.
Last, there's T-Digest. In PipelineDB, T-Digest is implemented in
As discussed above, T-Digest is mainly used for estimation and calculation based on percentiles and medians. It is similar to the built-in function percentile_cont in the database. This function is seldom used.
The following briefly introduces how to use T-Digest in PipelineDB.
First, consider the following example:
Now, in it, insert data as follows:
For it, the median of
2, and the median of
If you're interested in learning more about T-Digest, I recommend that you check out its official description in the PipelineDB documentation. In it, you'll learn how to use probabilistic data structures and algorithms.
digoal - September 6, 2019
digoal - September 20, 2019
Alibaba Clouder - September 6, 2018
digoal - June 26, 2019
digoal - September 17, 2019
digoal - June 26, 2019
An online MPP warehousing service based on the Greenplum Database open source programLearn More
Mitigate the scalability problem of single machine relational databases for large-scale online databases.Learn More
A reliable, cost-efficient backup service for continuous data protection.Learn More
An on-demand database hosting service for PPAS with automated monitoring, backup and disaster recovery capabilitiesLearn More
More Posts by digoal