Bloom filter indexes in ApsaraDB for SelectDB - ApsaraDB for SelectDB

This topic describes the usage notes and related operations when you use Bloom filter indexes in ApsaraDB for SelectDB.

Feature description

A Bloom filter, conceived by Burton Howard Bloom in 1970, is a probabilistic data structure based on the mapping of multiple hash functions. It is used to check whether an element belongs to a set.

Scenarios: A Bloom filter is designed for the IN and = operators. This helps filter out a large amount of irrelevant data.
Features:
- Space-efficient data structure: consists of an extremely long binary bit array and a series of hash functions.
- Probabilistic result: A Bloom filter returns the following types of results when the Bloom filter checks whether an element belongs to a set:
  - true: The element may belong to the set. A probability of misjudgment exists.
  - false: The element does not belong to the set.
Implementation:
More principles
A Bloom filter is composed of an extremely long binary bit array and a series of hash functions. Initially, all the bits in the binary bit array are set to 0. If an element to be queried is specified, the element is calculated by using the hash functions and mapped to a series of bits. The offsets of all bits are set to 1 in the bit array.
The following figure shows an example of a Bloom filter with m set to 18 and k set to 3. m indicates the size of the bit array, and k indicates the number of hash functions. The x, y, and z elements in the set are hashed into the bit array by using three different hash functions. When the element w is queried, the element w is calculated by hash functions. Because one bit is 0, a result is returned, indicating that the element w does not belong to the set.
You can check whether an element belongs to a set based on the offsets that are calculated by the hash functions. If all the offsets are 1, the element belongs to the set. If one of the offsets is not 1, the element does not belong to the set.

Usage notes

You cannot create a Bloom filter index for a column of the TINYINT, FLOAT, or DOUBLE data type.
Bloom filter indexes accelerate only the queries that contain the filter conditions that are specified by the in and = operators.
To check whether a query matches a Bloom filter index, you can view the profile information of the query.

Create an index

Bloom filter indexes are created based on blocks. In each block, the values of a column are used as a set to generate a Bloom filter index.

In ApsaraDB for SelectDB, you can specify a Bloom filter index when you create a table or by performing ALTER operations on a table.

Create an index when you create a table

If you want to create a Bloom filter index during table creation, you can append the "bloom_filter_columns"="k1,k2,k3" setting to PROPERTIES in the table creation statement. In this example, k1,k2,k3 indicates the names of the key columns on which you want to create Bloom filter indexes.

The following sample code provides an example on how to create the Bloom filter indexes saler_id and category_id for the sale_detail_bloom table.

CREATE TABLE IF NOT EXISTS sale_detail_bloom(
    sale_date date NOT NULL COMMENT "The date on which the product was sold",
    customer_id int NOT NULL COMMENT "The ID of the customer",
    saler_id int NOT NULL COMMENT "The ID of the seller",
    sku_id int NOT NULL COMMENT "The ID of the product",
    category_id int NOT NULL COMMENT "The ID of the category to which the product belongs",
    sale_count int NOT NULL COMMENT "The number of products that were sold",
    sale_price DECIMAL(12,2) NOT NULL COMMENT "The unit price of the product",
    sale_amt DECIMAL(20,2) COMMENT "The total sales amount"
)
Duplicate KEY(sale_date, customer_id, saler_id, sku_id, category_id)
distributed BY hash(customer_id) buckets 3
PROPERTIES("bloom_filter_columns"="saler_id, category_id");

Create an index for an existing table

Execute the following statement to create an index for an existing table to modify the bloom_filter_columns parameter of the existing table:

ALTER TABLE <db.table_name> SET ("bloom_filter_columns" = "k1,k3");

View an index

Execute the following statement to view the Bloom filter index created for a table:

SHOW CREATE TABLE <table_name>;

The following example describes how to query the Bloom filter index created for the sale_detail_bloom table.

SHOW CREATE TABLE sale_detail_bloom;

The following result is returned:

+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table             | Create Table                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| sale_detail_bloom | CREATE TABLE `sale_detail_bloom` (
  `sale_date` datev2 NOT NULL COMMENT 'The date on which the product was sold',
  `customer_id` int(11) NOT NULL COMMENT 'The ID of the customer',
  `saler_id` int(11) NOT NULL COMMENT 'The ID of the seller',
  `sku_id` int(11) NOT NULL COMMENT 'The ID of the product',
  `category_id` int(11) NOT NULL COMMENT 'The ID of the category to which the product belongs',
  `sale_count` int(11) NOT NULL COMMENT 'The number of products that were sold',
  `sale_price` decimalv3(12, 2) NOT NULL COMMENT 'The unit price of the product',
  `sale_amt` decimalv3(20, 2) NULL COMMENT 'The total sales amount'
) ENGINE=OLAP
DUPLICATE KEY(`sale_date`, `customer_id`, `saler_id`, `sku_id`, `category_id`)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(`customer_id`) BUCKETS 3
PROPERTIES (
"bloom_filter_columns" = "category_id, saler_id",
"light_schema_change" = "true"
);                               |
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.03 sec)

Delete an index

Execute the following statement to delete a Bloom filter index. During the deletion, the bloom_filter_columns parameter in the index column is deleted.

ALTER TABLE <db.table_name> SET ("bloom_filter_columns" = "");

Query the index change progress

Index creation, modification, and deletion are asynchronous. You can execute the following statement to view the index change progress:

SHOW ALTER TABLE COLUMN;

Best practices

If the following conditions are met, you can create a Bloom filter index for a column.

Data is not filtered based on prefixes.
The data to be queried is frequently filtered based on the column, and most of the filter conditions contain in and =.
Different from Bitmap indexes, Bloom filter indexes are suitable for high cardinality columns, such as a column that stores user IDs. This is because if Bloom filter indexes are created for a low-cardinality column, such as a column that stores user genders, each block almost contains all values. In this case, the Bloom filter indexes do not improve query performance.

Example

To query a short row that occupies 100 bytes in length, an HFile data block of 64 KB in size contains (64 × 1,024)/100 = 655.53 rows, which is about 700 rows. If a Bloom filter index can be created only on the initial row key of the data block, the Bloom filter index cannot provide fine-grained index information. This is because the row data to be queried may belong to the row range of the data block, the row data may not belong to the data block, the row data may not exist in the table, or the row data may belong to another HFile data block or even be stored in Memstore. In the preceding cases, additional I/O overheads are required when the data block is read from the disk, and the cache of the data block is abused. If a large data set is read with high concurrency, the performance of the cluster is seriously compromised.

Therefore, HBase provides Bloom filters, which allow you to perform a reverse test on the data stored in each data block. When you request to access a row, a Bloom filter first checks whether the row belongs to the data block. The Bloom filter returns a result that the row does not belong to the data block or that it does not know whether the row belongs to the data block. This is called a reverse test. Bloom filters also apply to cells in a row, and the same reverse test can be used when a column identifier is accessed.

However, Bloom filters cause costs. Bloom filter indexes occupy additional storage space. The number of Bloom filters increases as their index objects increase. Therefore, row-level Bloom filters occupy less storage space than column identifier-level Bloom filters. If the storage space is sufficient, Bloom filters can help you exploit the potential performance of your system.

You can specify Bloom filter indexes in ApsaraDB for SelectDB when you create a table or by performing the ALTER operations on a table.

Bloom filter indexes can also be created based on blocks. In each block, the values of the specified columns are used as a set to generate a Bloom filter index catalog, which is used to quickly filter data that does not meet the conditions during queries.