Use Open Source PolarDB Bloom Filter Index to Filter Data

By digoal

Background

PolarDB's cloud-native computing-storage separation architecture offers cost-effective data storage, scalable operations, high-speed multi-machine parallel computing, and efficient data search and processing. By combining PolarDB with computing algorithms, it enables businesses to derive value from their data and turn it into productivity.

This article explores how to leverage the open-source PolarDB Bloom Filter Index to filter data based on any combination of fields. The test environment used for this demonstration is macOS + Docker. For detailed instructions on PolarDB deployment, please refer to the following article: Simple Deployment of PolarDB

Principle

Filtering data based on any combination of fields is typically required in data analysis scenarios, such as for BI analysts. This operation involves filtering numbers according to specific requirements, which often entails a lot of trial and error. To address this need for filtering any combination of fields, an efficient solution is required.

Creating an index for each combination of fields can accelerate the filtering process for any field. PolarDB for PostgreSQL also supports bitmap scans of multiple B-tree indexes, so you only need to create one index for each field. However, some drawbacks still exist:

• The total space occupied by each column's indexes is relatively large.

• The presence of more indexes can increase the response time (RT) caused by data manipulation language (DML) operations, impacting the write throughput.

To overcome these challenges, the concept of the bloom filter was introduced. The bloom filter is an index that supports data filtering based on any combination of fields (Note: bloom filter only supports equivalent combinations). Here's how it works:

• The bloom filter consists of an N-bit hash value.

• After hashing each value, it is mapped to N bits in the hash value. For example, with 2 bits, the value "hello" (11) may map to bits 1 and 8, while "world" (11) may map to bits 2 and 8.

To query the existence of "hello," you only need to check if the corresponding bit value matches the hash value.

The bloom filter has some lossy characteristics: the absence does not necessarily mean the value doesn't exist, and the presence does not necessarily mean it does exist. This is due to bit conflicts, where other column values may also map to the same bits (bit collisions). Therefore, a recheck is required for verification.

Why do bit conflicts occur?

For example, if 80 bits are used to store values from 32 columns, and each column has 2 bits, different column values may be mapped to the same bits simultaneously. The more fields involved, the higher the probability of collision.

In any case, the bloom filter provides a cost-effective and efficient method for filtering data based on any combination of fields, facilitating seamless queries.

Scenario Simulation and Architecture Design Practices

Create a test table and write 10 million records

=# CREATE TABLE tbloom AS  
   SELECT  
     (random() * 1000000)::int as i1,  
     (random() * 1000000)::int as i2,  
     (random() * 1000000)::int as i3,  
     (random() * 1000000)::int as i4,  
     (random() * 1000000)::int as i5,  
     (random() * 1000000)::int as i6  
   FROM  
  generate_series(1,10000000);  
SELECT 10000000

Full table scan performance

=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;  
                                              QUERY PLAN  
------------------------------------------------------------------------------------------------------  
 Seq Scan on tbloom  (cost=0.00..2137.14 rows=3 width=24) (actual time=16.971..16.971 rows=0 loops=1)  
   Filter: ((i2 = 898732) AND (i5 = 123451))  
   Rows Removed by Filter: 100000  
 Planning Time: 0.346 ms  
 Execution Time: 16.988 ms  
(5 rows)

Create a bloom index to cover six fields, consuming only 1.5 MB memory and improving query performance by nearly 50 times.

create extension bloom ;

=# CREATE INDEX bloomidx ON tbloom USING bloom (i1, i2, i3, i4, i5, i6);  
CREATE INDEX  
=# SELECT pg_size_pretty(pg_relation_size('bloomidx'));  
 pg_size_pretty  
----------------  
 1584 kB  
(1 row)  
=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;  
                                                     QUERY PLAN  
---------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on tbloom  (cost=1792.00..1799.69 rows=2 width=24) (actual time=0.388..0.388 rows=0 loops=1)  
   Recheck Cond: ((i2 = 898732) AND (i5 = 123451))  
   Rows Removed by Index Recheck: 29  
   Heap Blocks: exact=28  
   ->  Bitmap Index Scan on bloomidx  (cost=0.00..1792.00 rows=2 width=0) (actual time=0.356..0.356 rows=29 loops=1)  
         Index Cond: ((i2 = 898732) AND (i5 = 123451))  
 Planning Time: 0.099 ms  
 Execution Time: 0.408 ms  
(8 rows)

For a common btree index, create one index for each field, and there is a total of six indexes. The combination of multiple fields uses bitmap and|or scan, which has the highest efficiency, but takes up a lot of space and has a greater impact on the write throughput performance.

=# CREATE INDEX btreeidx1 ON tbloom (i1);  
CREATE INDEX  
=# CREATE INDEX btreeidx2 ON tbloom (i2);  
CREATE INDEX  
=# CREATE INDEX btreeidx3 ON tbloom (i3);  
CREATE INDEX  
=# CREATE INDEX btreeidx4 ON tbloom (i4);  
CREATE INDEX  
=# CREATE INDEX btreeidx5 ON tbloom (i5);  
CREATE INDEX  
=# CREATE INDEX btreeidx6 ON tbloom (i6);  
CREATE INDEX  
=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;  
                                                        QUERY PLAN  
---------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on tbloom  (cost=24.34..32.03 rows=2 width=24) (actual time=0.028..0.029 rows=0 loops=1)  
   Recheck Cond: ((i5 = 123451) AND (i2 = 898732))  
   ->  BitmapAnd  (cost=24.34..24.34 rows=2 width=0) (actual time=0.027..0.027 rows=0 loops=1)  
         ->  Bitmap Index Scan on btreeidx5  (cost=0.00..12.04 rows=500 width=0) (actual time=0.026..0.026 rows=0 loops=1)  
               Index Cond: (i5 = 123451)  
         ->  Bitmap Index Scan on btreeidx2  (cost=0.00..12.04 rows=500 width=0) (never executed)  
               Index Cond: (i2 = 898732)  
 Planning Time: 0.491 ms  
 Execution Time: 0.055 ms  
(9 rows)

Community

Use Open Source PolarDB Bloom Filter Index to Filter Data

Background

Principle

Scenario Simulation and Architecture Design Practices

Read previous post:

Read next post:

digoal

You may also like

Comments

digoal

Related Products

PolarDB for PostgreSQL

PolarDB for Xscale

PolarDB for MySQL

Tair