TairBloom is a space-efficient probabilistic data structure that consumes minimal memory to check whether an element exists. It features dynamic scalability while maintaining a stable false positive rate during scaling.

Introduction to TairBloom

TairBloom is a space-efficient probabilistic data structure that consumes minimal memory to check whether an element exists. It features dynamic scalability while maintaining a stable false positive rate during scaling.

You can use bitmaps on Redis data structures, such as hashes, sets, and strings, to implement similar features of TairBloom. However, these data structures may consume a large amount of memory or fail to maintain a stable false positive rate during dynamic scaling. For this reason, TairBloom is ideal for checking for large amounts of data when a specific false positive rate is allowed. You can use the built-in Bloom filters of TairBloom without further encapsulation or the need to create extra Bloom filters on your on-premises devices.

Main features
  • Minimum consumption of memory.
  • Dynamic scaling.
  • Stable custom false positive rate during scaling.
Typical scenarios
TairBloom can be used for recommendation and crawler systems in industries such as live-streaming, music, and e-commerce.
  • Recommendation system: TairBloom records the IDs of articles that may have been recommended, queries these articles, determines duplicate articles, and then recommends new articles in which users might be interested.
  • Crawler system: TairBloom filters out the URLs that have been crawled to improve productivity.
Best practices
TairBloom records the IDs of articles that may have been recommended, queries these articles, determines duplicate articles, and then recommends new articles in which users might be interested. Pseudocode:
void recommendedSystem(userid) {
    while (true) {
        // Obtain a random article ID or a desired article ID. 
        docid = getDocByRandom()
        if (bf.exists(userid, docid)) {
            // If the article may have been recommended to a user, the next article is recommended. 
            continue;
        } else {
            // If the article has not been recommended to the user, the article is recommended. 
            sendRecommendMsg(docid);
            // If the article may have been recommended, the article ID is recorded in a Bloom filter. 
            bf.add(userid, docid);
            break;
        }
    }
}
TairBloom filters out the URLs that may have been crawled to improve productivity. Pseudocode:
bool crawlerSystem( ) {
    while (true) {
        // Obtain a URL that you want to crawl. 
        url = getURLFromQueue()
        if (bf.exists(url_bloom, url)) {
            // If the URL may have been crawled, the URL is skipped. 
            continue;
        } else {
            // Crawl this URL. 
            doDownload(url)
            // Add this URL to a Bloom filter. 
            bf.add(url_bloom, url);
        }
    }
}

How it works

TairBloom is an implementation of scalable Bloom filters (SBFs). It features dynamic scalability while maintaining a stable false positive rate. SBFs are optimized Bloom filters. The following section describes the basic principles of Bloom filters and SBFs.

  • Bloom Filter

    A Bloom filter is a space-efficient probabilistic data structure conceived by Burton Howard Bloom in 1970. It is used to test whether an element is a member of a set.

    A new Bloom filter is a bit array of m bits, all of which are set to 0. The Bloom filter also includes a set of k different hash functions to generate a uniform random distribution. k is a constant that is less than m. When you add elements to the Bloom filter, the k hash functions map these elements to k bits in the bit array and set the k bits to a value of 1. In this case, a bit can be shared by multiple pieces of data. The following figure shows how to insert X1 and X2 into a Bloom filter that includes 3 hash functions. Insert X1 and X2 into a Bloom filter
    When you query an element in a Bloom filter, you can use k hash functions to obtain k bits. If all of k bits have a value of 1, the element exists in the Bloom filter. Otherwise, the element does not exist. The following figure shows how to query Y1 and Y2 in the Bloom filter. Query Y1 and Y2 in a Bloom filter
    The preceding figure demonstrates that although Y2 was never inserted into the Bloom filter, Y2 is determined to be existent in the Bloom filter. This scenario shows that Bloom filters have a false positive rate. The preceding analysis shows that Bloom filters have the following features:
    • A bit can be shared by multiple pieces of data.
    • Bloom filters have a false positive rate and the rate gets higher as the number of elements in a Bloom filter increases. However, Bloom filters do not have a false negative rate. If an element exists in a Bloom filter, the element is always determined to be existent.
    • An element can be added to a Bloom filter but cannot be removed from the Bloom filter. The reason is that bits can be shared. If you remove an element from the Bloom filter, other elements in the Bloom filter are affected.
  • Scalable Bloom Filter

    When an increasing number of elements are added to a Bloom filter, the false positive rate becomes higher. If you want to maintain a stable false positive rate, you must accordingly increase the size of the Bloom filter. However, the size cannot be increased due to structural limits. SBFs emerged in response to this issue. SBFs are a new type of Bloom filter that combines multiple Bloom filters into one.

    The following figure shows the basic model of an SBF. This SBF has two layers: BF0 and BF1. At first, the SBF contains only the BF0 layer. If you insert elements a, b, and c into this SBF and the BF0 layer is not large enough to maintain a specified false positive rate, the BF1 layer is created to increase the SBF size. Then, elements d, e, and f are inserted into the BF1 layer. A new BF2 layer is created if the BF1 layer also cannot maintain the specified positive rate. Model of SBFs

    SBFs insert data only into the last layer and query data from the last layer to the BF0 layer. For more information, see Scalable Bloom Filters.

Prerequisites

The instance is a performance-enhanced instance of the ApsaraDB for Redis Enhanced Edition (Tair). For more information about performance-enhanced instances, see Performance-enhanced instances.
Note The latest minor version provides more features and higher stability. We recommend that you update your instance to the latest minor version. For more information, see Update the minor version. If your instance is a cluster or read/write splitting instance, we recommend that you update the proxy nodes in the instance to the latest minor version. This ensures that all commands can be run as expected. For more information about cluster and read/write splitting instances, see Cluster master-replica instances and Read/write splitting instances.

Precautions

  • The TairBloom data that you want to manage is stored on a performance-enhanced instance.
  • The initial capacity and false positive rate that meet your requirements must be calculated in advance. To create a TairBloom key that has a capacity far more than 100 elements, run the BF.RESERVE command instead of the BF.ADD command.

    The following section demonstrates the difference between the BF.ADD command and the BF.RESERVE command:

    • BF.ADD (also known as BF.MADD): If a TairBloom key does not exist when this command is run, the key is automatically created. The default capacity of the key is 100, and the default false positive rate of the key is 0.01. If you require a key capacity that is far more than 100, you can only scale up the key to support more elements.

      When a key is scaled up, more layers of Bloom filters are added to it. Every time a Bloom filter layer is added, the key capacity increases. In this case, if you perform a query on the key, multiple layers of Bloom filters need to be traversed and the performance of the key significantly deteriorates.

    • BF.RESERVE (also known as BF.INSERT): An initial capacity is specified when this command is run. This command specifies an initial capacity at the first layer of a key. If a key contains a smaller number of Bloom filter layers, the query speed of the key is faster.
    Note For example, assume that you want to insert 10,000,000 elements into a key and allows an false positive rate of 0.01. The created key is 176 MB in size if you use the BF.ADD command, or 16 MB in size if you use the BF.RESERVE command.

    Although a key can be scaled up, we recommend that you consider key scale-ups only as a safety measure and do not frequently use them. If the actual capacity exceeds the provisioned capacity of a key, you can scale up the key to make sure that write operations can be performed on the key and prevent live accidents.

    The following table lists a variety of key sizes supported by the BF.RESERVE command and their matched initial capacities and false positive rates.

    Capacity (number of elements) false positive:0.01 false positive:0.001 false positive:0.0001
    100,000 0.12 MB 0.25 MB 0.25 MB
    1,000,000 2 MB 2 MB 4 MB
    10,000,000 16 MB 32 MB 32 MB
    100,000,000 128 MB 256 MB 256 MB
    1,000,000,000 2 GB 2 GB 4 GB
  • The size of a key cannot be decreased because elements can only be added to a key, not removed. To prevent capacity issues such as an out of memory (OOM) error, we recommend that you implement the following solutions:
    • Split business data. You can split business data to prevent large keys that affect query performance. If large keys exist, most requests are made to the instance that contains these keys. This may cause hotkeys or even skewed requests.

      You can distribute the split data to multiple keys. If your business data is stored in an ApsaraDB for Redis cluster instance, you can distribute the split data to multiple nodes in the instance to ensure that large keys and hotkeys do not occur.

    • Regularly rebuild keys. If possible, you can run the DEL command to delete data from a key, and then insert data from backend databases into the key to manage the key size.

      You can also create multiple keys to interchangeably use them. This way, the size of a single key is kept appropriate. The benefit of this solution is that you need to create keys only once. However, multiple keys must be created and some memory may be wasted.

Supported commands

Table 1. TairBloom commands
Command Syntax Description
BF.RESERVE BF.RESERVE key error_rate capacity

Creates an empty TairBloom key that has a specified capacity and false positive rate. The capacity parameter specifies the capacity of the key and the error_rate parameter specifies the false positive rate of the key.

BF.ADD BF.ADD key item

Adds an element to a TairBloom key.

BF.MADD BF.MADD key item [item ...]

Adds multiple elements to a TairBloom key.

BF.EXISTS BF.EXISTS key item

Checks whether an element exists in a TairBloom key.

BF.MEXISTS BF.MEXISTS key item [item ...]

Checks whether multiple elements exist in a TairBloom key.

BF.INSERT BF.INSERT key [CAPACITY cap] [ERROR error] [NOCREATE] ITEMS item [item ...]

Adds multiple elements to a TairBloom key. If the key does not exist, you can specify whether to create the key. You can also specify the capacity and false positive rate of the key.

DEL DEL key [key ...] Deletes one or more TairBloom keys.
Note The elements that are already added to a key cannot be deleted. You can run the DEL command to delete all data from a key.
Note The following section describes command syntax used in this topic:
  • Uppercase keyword: the command keyword.
  • Italic: Words in italic indicate variable information that you supply.
  • [options]: optional parameters. Parameters that are not included in brackets are required.
  • AB: specifies that these parameters are mutually exclusive. Select one of two or more parameters.
  • ...: specifies to repeat the preceding content.

BF.RESERVE

Item Description
Syntax BF.RESERVE key error_rate capacity
Time complexity O(1)
Command description

Creates an empty TairBloom key that has a specified capacity and false positive rate. The capacity parameter specifies the capacity of the key and the error_rate parameter specifies the false positive rate of the key.

Parameter
  • Key: the name of the key that you want to manage by running this command.
  • error_rate: the expected false positive rate. The value of this parameter must be between 0 and 1. A lower value indicates higher accuracy, memory usage, and CPU utilization of the key.
  • capacity: the initial capacity of the key. This parameter specifies the maximum number of elements that can be added to the key.

    When the number of elements added to the key exceeds the capacity value, a Bloom filter layer is added for the key. This process may deteriorate the query performance of the key. Every time a Bloom filter layer is added, the key capacity increases. In this case, if you perform a query on the key, multiple layers of Bloom filters may need to be traversed. If your workloads require high performance, we recommend that you add elements to a key based on your business requirements to prevent automatic scaling.

Output
  • If the operation is successful, OK is returned.
  • Otherwise, an error message is returned.
Example

Sample command:

BF.RESERVE BFKEY 0.01 100

Sample output:

OK

BF.ADD

Item Description
Syntax BF.ADD key item
Time complexity

O(log N), where N specifies the number of Bloom filter layers.

Command description

Adds an element to a TairBloom key.

Note If the key does not exist, the key is automatically created. The default capacity of the key is 100, and the default false positive rate of the key is 0.01.
Parameter
  • Key: the name of the key that you want to manage by running this command.
  • item: the element that you want to add to the key.
Output
  • If the element does not exist, the element is added and a value of 1 is returned.
  • If the element may already exist, the element is not added or updated and a value of 0 is returned.
  • Otherwise, an error message is returned.
Example

Sample command:

BF.ADD BFKEY item1

Sample output:

(integer) 1

BF.MADD

Item Description
Syntax BF.MADD key item [item ...]
Time complexity

O(log N), where N specifies the number of Bloom filter layers.

Command description

Adds multiple elements to a TairBloom key.

Note If the key does not exist, the key is automatically created. The default capacity of the key is 100, and the default false positive rate of the key is 0.01.
Parameter
  • Key: the name of the key that you want to manage by running this command.
  • item: the element that you want to add to the key. Multiple elements can be specified.
Output
  • If the element does not exist, the element is added and a value of 1 is returned.
  • If the element may already exist, the element is not added or updated and a value of 0 is returned.
  • Otherwise, an error message is returned.
Example

Sample command:

BF.MADD BFKEY item1 item2 item3

Sample output:

(integer) 1
(integer) 1
(integer) 1

BF.EXISTS

Item Description
Syntax BF.EXISTS key item
Time complexity

O(log N), where N specifies the number of Bloom filter layers.

Command description

Checks whether an element exists in a TairBloom key.

Parameter
  • Key: the name of the key that you want to manage by running this command.
  • item: the element that you want to query.
Output
  • If the element does not exist, a value of 0 is returned.
  • If the element may exist, a value of 1 is returned.
  • Otherwise, an error message is returned.
Example

Sample command:

BF.EXISTS BFKEY item1

Sample output:

(integer) 1

BF.MEXISTS

Item Description
Syntax BF.MEXISTS key item [item ...]
Time complexity

O(log N), where N specifies the number of Bloom filter layers.

Command description

Checks whether multiple elements exist in a TairBloom key.

Parameter
  • Key: the name of the key that you want to manage by running this command.
  • item: the element that you want to query. Multiple elements can be specified.
Output
  • If the element does not exist, a value of 0 is returned.
  • If the element may exist, a value of 1 is returned.
  • Otherwise, an error message is returned.
Example

Sample command:

BF.MEXISTS BFKEY item1 item5

Sample output:

(integer) 1
(integer) 0

BF.INSERT

Item Description
Syntax BF.INSERT key [CAPACITY cap] [ERROR error] [NOCREATE] ITEMS item [item ...]
Time complexity

O(log N), where N specifies the number of Bloom filter layers.

Command description

Adds multiple elements to a TairBloom key. If the key does not exist, you can specify whether to create the key. You can also specify the capacity and false positive rate of the key.

Parameter
  • Key: the name of the key that you want to manage by running this command.
  • capacity: the initial capacity of the key. This parameter specifies the maximum number of elements that can be added to the key. If the key exists, this parameter is ignored.

    If the number of elements that are added to a key exceeds the capacity value, another Bloom filter layer is automatically added for the key. During the scaling, the number of elements in the key increases and the performance decreases. To query a specific element after a layer is added to the key, multiple layers may be traversed. The capacity of each new layer is twice that of the previous layer. If your workloads require high performance, we recommend that you add elements to a key based on your business requirements to prevent automatic scaling.

  • error_rate: the expected false positive rate. The value of this parameter must be between 0 and 1. A lower value indicates higher accuracy, memory usage, and CPU utilization of the key.
  • NOCREATE: specifies that the key is not automatically created if the key does not exist. This parameter cannot be specified together with the capacity or error_rate parameter.
  • item: the element that you want to add. Multiple elements can be specified.
Output
  • If the element does not exist, the element is added and a value of 1 is returned.
  • If the element may already exist, the element is not added or updated and a value of 0 is returned.
  • Otherwise, an error message is returned.
Example

Sample command:

BF.INSERT bfkey1 CAPACITY 10000 ERROR 0.001 ITEMS item1 item2 item3

Sample output:

(integer) 1
(integer) 1
(integer) 1