All Products
Search
Document Center

Tair:Bloom

Last Updated:Mar 04, 2024

Bloom is a space-efficient probabilistic data structure that consumes minimal memory to check whether an element exists among large amounts of data. TairBloom is an implementation of scalable Bloom filters (SBFs). It features dynamic scalability while maintaining a stable false positive rate during scaling.

Overview

You can use bitmaps on Redis data structures, such as hashes, sets, and strings, to implement similar features of TairBloom. However, these data structures may consume a large amount of memory or fail to maintain a stable false positive rate during dynamic scaling. For this reason, TairBloom is ideal for checking for large amounts of data when a specific false positive rate is allowed. You can use the built-in Bloom filters of TairBloom without further encapsulation or the need to create extra Bloom filters on your on-premises devices.

Features

  • Minimum consumption of memory.

  • Dynamic scaling.

  • Stable custom false positive rate during scaling.

Typical scenarios

TairBloom can be used for recommendation and crawler systems in industries such as live-streaming, music, and e-commerce.

  • Recommendation system: TairBloom records the IDs of articles that may have been recommended, queries these articles, determines duplicate articles, and then recommends new articles in which users might be interested.

  • Crawler system: TairBloom filters out the URLs that have been crawled to improve productivity.

Best practices

Build recommendation systems based on TairBloom

TairBloom records the IDs of articles that may have been recommended, queries these articles, determines duplicate articles, and then recommends new articles in which users might be interested. Pseudocode:

void recommendedSystem(userid) {
    while (true) {
        // Obtain a random article ID or a desired article ID. 
        docid = getDocByRandom()
        if (bf.exists(userid, docid)) {
            // If the article may have been recommended to a user, the next article is recommended. 
            continue;
        } else {
            // If the article has not been recommended to the user, the article is recommended. 
            sendRecommendMsg(docid);
            // If the article may have been recommended, the article ID is recorded in a Bloom filter. 
            bf.add(userid, docid);
            break;
        }
    }
}

Optimize crawler systems based on TairBloom

TairBloom filters out the URLs that may have been crawled to improve productivity. Pseudocode:

bool crawlerSystem( ) {
    while (true) {
        // Obtain a URL that you want to crawl. 
        url = getURLFromQueue()
        if (bf.exists(url_bloom, url)) {
            // If the URL may have been crawled, the URL is skipped. 
            continue;
        } else {
            // Crawl this URL. 
            doDownload(url)
            // Add this URL to a Bloom filter. 
            bf.add(url_bloom, url);
        }
    }
}

How it works

TairBloom is an implementation of SBFs. It features dynamic scalability while maintaining a stable false positive rate. SBFs are optimized Bloom filters. The following section describes the basic principles of Bloom filters and SBFs.

  • Bloom filters

    A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set.

    A new Bloom filter is a bit array of m bits, all of which are set to 0. The Bloom filter also includes a set of k different hash functions to generate a uniform random distribution. k is a constant that is less than m. When you add elements to the Bloom filter, the k hash functions map these elements to k bits in the bit array and set the k bits to a value of 1. In this case, a bit can be shared by multiple pieces of data. The following figure shows how to insert X1 and X2 into a Bloom filter that includes 3 hash functions.布隆过滤器插入x1、x2的过程

    When you query an element in a Bloom filter, you can use k hash functions to obtain k bits. If all of k bits have a value of 1, the element exists in the Bloom filter. Otherwise, the element does not exist. The following figure shows how to query Y1 and Y2 in the Bloom filter.查询y1y2

    The preceding figure demonstrates that although Y2 was never inserted into the Bloom filter, Y2 is determined to be existent in the Bloom filter. This scenario shows that Bloom filters have a false positive rate. The preceding analysis shows that Bloom filters have the following features:

    • A bit can be shared by multiple pieces of data.

    • Bloom filters have a false positive rate and the rate gets higher as the number of elements in a Bloom filter increases. However, Bloom filters do not have a false negative rate. If an element exists in a Bloom filter, the element is always determined to be existent.

    • An element can be added to a Bloom filter but cannot be removed from the Bloom filter. The reason is that bits can be shared. If you remove an element from the Bloom filter, other elements in the Bloom filter are affected.

  • SBFs

    When the number of elements increases, the false positive rate also increases. If you want to maintain a stable false positive rate, you must accordingly increase the size of the Bloom filter. However, the size cannot be increased due to structural limits. SBFs emerged in response to this issue. SBFs are a new type of Bloom filter that combines multiple Bloom filters into one.

    The following figure shows the basic model of an SBF. This SBF has two layers: BF0 and BF1. At first, the SBF contains only the BF0 layer. If you insert elements a, b, and c into this SBF and the BF0 layer is not large enough to maintain a specified false positive rate, the BF1 layer is created to increase the SBF size. Then, elements d, e, and f are inserted into the BF1 layer. A new BF2 layer is created if the BF1 layer also cannot maintain the specified positive rate.SBF模型

    SBFs insert data only into the last layer and query data from the last layer to the BF0 layer. For more information, see Scalable Bloom Filters.

Prerequisites

A Tair DRAM-based instance is created.

Note

The latest minor version provides more features and higher stability. We recommend that you update the instance to the latest minor version. For more information, see Update the minor version of an instance. If your instance is a cluster or read/write splitting instance, we recommend that you update the proxy nodes in the instance to the latest minor version. This ensures that all commands can be run as expected.

Precautions

  • The TairBloom data that you want to manage is stored in a Tair instance.

  • The initial capacity and false positive rate that meet your requirements must be calculated in advance. To create a TairBloom key that has a capacity far more than 100 elements, run the BF.RESERVE command instead of the BF.ADD command.

    The following section demonstrates the difference between the BF.ADD command and the BF.RESERVE command:

    • BF.ADD (also known as BF.MADD): If a TairBloom key does not exist when this command is run, the key is automatically created. The default capacity of the key is 100, and the default false positive rate of the key is 0.01. If you require a key capacity that is far more than 100, you can only scale up the key to support more elements.

      When a key is scaled up, more layers of Bloom filters are added to it. Every time a Bloom filter layer is added, the key capacity increases. In this case, if you perform a query on the key, multiple layers of Bloom filters need to be traversed and the performance of the key significantly deteriorates.

    • BF.RESERVE (also known as BF.INSERT): An initial capacity is specified when this command is run. This command specifies an initial capacity at the first layer of a key. If a key contains a smaller number of Bloom filter layers, the query speed of the key is faster.

    Note

    For example, assume that you want to insert 10,000,000 elements into a key and allows a false positive rate of 0.01. The created key is 176 MB in size if you use the BF.ADD command, or 16 MB in size if you use the BF.RESERVE command.

    Although a key can be scaled up, we recommend that you consider key scale-ups only as a safety measure and do not frequently use them. If the actual capacity exceeds the provisioned capacity of a key, you can scale up the key to make sure that write operations can be performed on the key and prevent live accidents.

    The following table lists a variety of key sizes supported by the BF.RESERVE command and their matched initial capacities and false positive rates.

    Capacity (number of elements)

    false positive:0.01

    false positive:0.001

    false positive:0.0001

    100,000

    0.12 MB

    0.25 MB

    0.25 MB

    1,000,000

    2 MB

    2 MB

    4 MB

    10,000,000

    16 MB

    32 MB

    32 MB

    100,000,000

    128 MB

    256 MB

    256 MB

    1,000,000,000

    2 GB

    2 GB

    4 GB

  • The size of a key cannot be decreased because elements can only be added to a key, not removed. To prevent capacity issues such as an out of memory (OOM) error, we recommend that you implement the following solutions:

    • Split business data. You can split business data to prevent large keys that affect query performance. If large keys exist, most requests are made to the instance that contains these keys. This may cause hotkeys or even skewed requests.

      You can distribute the split data to multiple keys. If your business data is stored in a Tair cluster instance, you can distribute the split data to multiple nodes in the instance to ensure that large keys and hotkeys do not occur.

    • Regularly rebuild keys. If possible, you can run the DEL command to delete data from a key, and then insert data from backend databases into the key to manage the key size.

      You can also create multiple keys to interchangeably use them. This way, the size of a single key is kept appropriate. The benefit of this solution is that you only need to create keys once. However, multiple keys must be created and some memory may be wasted.

Supported commands

Table 1. TairBloom commands

Command

Syntax

Description

BF.RESERVE

BF.RESERVE key error_rate capacity

Creates an empty TairBloom key that has a specified capacity and false positive rate. The capacity parameter specifies the capacity of the key and the error_rate parameter specifies the false positive rate of the key.

BF.ADD

BF.ADD key item

Adds an element to a TairBloom key.

BF.MADD

BF.MADD key item [item ...]

Adds multiple elements to a TairBloom key.

BF.EXISTS

BF.EXISTS key item

Checks whether an element exists in a TairBloom key.

BF.MEXISTS

BF.MEXISTS key item [item ...]

Checks whether multiple elements exist in a TairBloom key.

BF.INSERT

BF.INSERT key [CAPACITY cap] [ERROR error] [NOCREATE] ITEMS item [item ...]

Adds multiple elements to a TairBloom key. If the key does not exist, you can specify whether to create the key. You can also specify the capacity and false positive rate of the key.

BF.INFO

BF.INFO key

Retrieves information about a TairBloom key. The information includes the number of layers, the number of items at each layer, and the false positive rate.

DEL

DEL key [key ...]

Deletes one or more TairBloom keys.

Note

The elements that are already added to a key cannot be deleted. You can run the DEL command to delete all data from a key.

Note The following list describes the conventions for the command syntax used in this topic:
  • Uppercase keyword: indicates the command keyword.
  • Italic text: indicates variables.
  • [options]: indicates that the enclosed parameters are optional. Parameters that are not enclosed by brackets must be specified.
  • A|B: indicates that the parameters separated by the vertical bars (|) are mutually exclusive. Only one of the parameters can be specified.
  • ...: indicates that the parameter preceding this symbol can be repeatedly specified.

BF.RESERVE

Item

Description

Syntax

BF.RESERVE key error_rate capacity

Time complexity

O(1)

Command description

Creates an empty TairBloom key that has a specified capacity and false positive rate. The capacity parameter specifies the capacity of the key and the error_rate parameter specifies the false positive rate of the key.

Parameter

  • Key: the name of the key that you want to manage by running this command.

  • error_rate: the expected false positive rate. The value of this parameter must be between 0 and 1. A lower value indicates higher accuracy, memory usage, and CPU utilization of the key.

  • capacity: the initial capacity of the key. This parameter specifies the maximum number of elements that can be added to the key.

    When the number of elements added to the key exceeds the capacity value, a Bloom filter layer is added for the key. This process may deteriorate the query performance of the key. Every time a Bloom filter layer is added, the key capacity increases. In this case, if you perform a query on the key, multiple layers of Bloom filters may need to be traversed. If your workloads require high performance, we recommend that you add elements to a key based on your business requirements to prevent automatic scaling.

Output

  • If the operation is successful, OK is returned.

  • Otherwise, an error message is returned.

Example

Sample command:

BF.RESERVE BFKEY 0.01 100

Sample output:

OK

BF.ADD

Item

Description

Syntax

BF.ADD key item

Time complexity

O(log N), where N specifies the number of Bloom filter layers.

Command description

Adds an element to a TairBloom key.

Note

If the key does not exist, the key is automatically created. The default capacity of the key is 100, and the default false positive rate of the key is 0.01.

Parameter

  • Key: the name of the key that you want to manage by running this command.

  • item: the element that you want to add to the key.

Output

  • If the element does not exist, the element is added and a value of 1 is returned.

  • If the element may already exist, the element is not added or updated and a value of 0 is returned.

  • Otherwise, an error message is returned.

Example

Sample command:

BF.ADD BFKEY item1

Sample output:

(integer) 1

BF.MADD

Item

Description

Syntax

BF.MADD key item [item ...]

Time complexity

O(log N), where N specifies the number of Bloom filter layers.

Command description

Adds multiple elements to a TairBloom key.

Note

If the key does not exist, the key is automatically created. The default capacity of the key is 100, and the default false positive rate of the key is 0.01.

Parameter

  • Key: the name of the key that you want to manage by running this command.

  • item: the element that you want to add to the key. Multiple elements can be specified.

Output

  • If the element does not exist, the element is added and a value of 1 is returned.

  • If the element may already exist, the element is not added or updated and a value of 0 is returned.

  • Otherwise, an error message is returned.

Example

Sample command:

BF.MADD BFKEY item1 item2 item3

Sample output:

(integer) 1
(integer) 1
(integer) 1

BF.EXISTS

Item

Description

Syntax

BF.EXISTS key item

Time complexity

O(log N), where N specifies the number of Bloom filter layers.

Command description

Checks whether an element exists in a TairBloom key.

Parameter

  • Key: the name of the key that you want to manage by running this command.

  • item: the element that you want to query.

Output

  • If the element does not exist, a value of 0 is returned.

  • If the element may exist, a value of 1 is returned.

  • Otherwise, an error message is returned.

Example

Sample command:

BF.EXISTS BFKEY item1

Sample output:

(integer) 1

BF.MEXISTS

Item

Description

Syntax

BF.MEXISTS key item [item ...]

Time complexity

O(log N), where N specifies the number of Bloom filter layers.

Command description

Checks whether multiple elements exist in a TairBloom key.

Parameter

  • Key: the name of the key that you want to manage by running this command.

  • item: the element that you want to query. Multiple elements can be specified.

Output

  • If the element does not exist, a value of 0 is returned.

  • If the element may exist, a value of 1 is returned.

  • Otherwise, an error message is returned.

Example

Sample command:

BF.MEXISTS BFKEY item1 item5

Sample output:

(integer) 1
(integer) 0

BF.INSERT

Item

Description

Syntax

BF.INSERT key [CAPACITY cap] [ERROR error] [NOCREATE] ITEMS item [item ...]

Time complexity

O(log N), where N specifies the number of Bloom filter layers.

Command description

Adds multiple elements to a TairBloom key. If the key does not exist, you can specify whether to create the key. You can also specify the capacity and false positive rate of the key.

Parameter

  • Key: the name of the key that you want to manage by running this command.

  • capacity: the initial capacity of the key. This parameter specifies the maximum number of elements that can be added to the key. If the key exists, this parameter is ignored.

    If the number of elements that are added to a key exceeds the capacity value, another Bloom filter layer is automatically added for the key. During the scaling, the number of elements in the key increases and the performance decreases. To query a specific element after a layer is added to the key, multiple layers may be traversed. The capacity of each new layer is twice that of the previous layer. If your workloads require high performance, we recommend that you add elements to a key based on your business requirements to prevent automatic scaling.

  • error_rate: the expected false positive rate. The value of this parameter must be between 0 and 1. A lower value indicates higher accuracy, memory usage, and CPU utilization of the key.

  • NOCREATE: specifies that the key is not automatically created if the key does not exist. This parameter cannot be specified together with the capacity or error_rate parameter.

  • item: the element that you want to add. Multiple elements can be specified.

Output

  • If the element does not exist, the element is added and a value of 1 is returned.

  • If the element may already exist, the element is not added or updated and a value of 0 is returned.

  • Otherwise, an error message is returned.

Example

Sample command:

BF.INSERT bfkey1 CAPACITY 10000 ERROR 0.001 ITEMS item1 item2 item3

Sample output:

(integer) 1
(integer) 1
(integer) 1

BF.INFO

Item

Description

Syntax

BF.INFO key

Time complexity

O(log N), where N specifies the number of Bloom filter layers.

Command description

Retrieves information about a TairBloom key. The information includes the number of layers, the number of items at each layer, and the false positive rate.

Parameter

  • Key: the name of the key that you want to manage by running this command.

Output

  • If the operation is successful, information about the TairBloom key is returned.

  • Otherwise, an error message is returned.

Example

Sample command:

BF.INFO bk1

Sample output:

1) "total_items:6,num_blooms:2"
2) "bytes:4 bits:32 hashes:7 hashwidth:64 capacity:3 items:3 error_ratio:0.01"
3) "bytes:16 bits:128 hashes:9 hashwidth:64 capacity:10 items:3 error_ratio:0.0025"

BF.INFO response parameters:

  • total_items indicates the total number of elements, and num_blooms indicates the total number of Bloom filter layers.

  • Information of each Bloom filter layer:

    • bytes: the number of occupied bytes.

    • bits: the number of occupied bits, which is calculated as the number of bytes multiplied by 8.

    • hashes: the number of hash functions.

    • hashwidth: the width of hash functions.

    • capacity: the capacity.

    • items: the number of elements.

    • error_ratio: the false positive rate.