TairBloom is a Bloom filter that supports dynamic scaling. This topic describes the commands that are supported by the TairBloom data structure.

Introduction to TairBloom

TairBloom is a space-efficient probabilistic data structure that consumes minimal memory to check whether an element exists. It features dynamic scalability while maintaining a stable false positive rate during scaling.

You can use bitmaps on Redis data structures, such as hashes, sets, and strings, to implement similar features of TairBloom. However, these data structures may consume a large amount of memory or fail to maintain a stable false positive rate during dynamic scaling. For this reason, TairBloom is especially suitable for checking whether large amounts of data exist and a specific false positive rate is allowed in this situation. You can use the built-in Bloom filters of TairBloom without further encapsulation or the need to create extra Bloom filters on your on-premises machine.

Main features
  • Minimum consumption of memory.
  • Dynamic scaling.
  • Stable custom false positive rate during scaling.
Typical scenarios
TairBloom can be used for recommendation and crawler systems in industries such as live-streaming, music, and e-commerce.
  • Recommendation system: TairBloom records the IDs of articles that may have been recommended, queries these articles, determines duplicate articles, and then recommends new articles that users might be interested in.
  • Crawler system: TairBloom filters out the URLs that have been crawled to improve productivity.
Best practices
TairBloom records the IDs of articles that may have been recommended, queries these articles, determines duplicate articles, and then recommends new articles that users might be interested in. Pseudocode:
void recommendedSystem(userid) {
    while (true) {
        // Obtain a random article ID or a desired article ID. 
        docid = getDocByRandom()
        if (bf.exists(userid, docid)) {
            // If the article may have been recommended to a user, the next article is recommended. 
            continue;
        } else {
            // If the article has not been recommended to the user, the article is recommended. 
            sendRecommendMsg(docid);
            // If the article may have been recommended, the article ID is recorded in a Bloom filter. 
            bf.add(userid, docid);
            break;
        }
    }
}
TairBloom filters out the URLs that may have been crawled to improve productivity. Pseudocode:
bool crawlerSystem( ) {
    while (true) {
        // Obtain a URL that you want to crawl. 
        url = getURLFromQueue()
        if (bf.exists(url_bloom, url)) {
            // If the URL may have been crawled, the URL is skipped. 
            continue;
        } else {
            // Crawl this URL. 
            doDownload(url)
            // Add this URL to a Bloom filter. 
            bf.add(url_bloom, url);
        }
    }
}

How it works

TairBloom is an implementation of scalable Bloom filters (SBFs). It features dynamic scalability while maintaining a stable false positive rate. SBFs are optimized Bloom filters. The following section describes the basic principles of Bloom filters and SBFs.

  • Bloom Filter

    A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set.

    A new Bloom filter is a bit array of m bits, all of which are set to 0. The Bloom filter also includes a set of k different hash functions to generate a uniform random distribution. k is a constant that is smaller than m. When you add elements to the Bloom filter, the k hash functions map these elements to k bits in the bit array and set the k bits to a value of 1. In this case, a bit can be shared by multiple pieces of data. The following figure shows how to insert x1 and x2 into a Bloom filter that includes 3 hash functions. Insert x1 and x2 into a Bloom filter
    When you query an element in a Bloom filter, you can use k hash functions to obtain k bits. If all of k bits have a value of 1, the element exists in the Bloom filter. Otherwise, the element does not exist. The following figure shows how to query y1 and y2 in the Bloom filter. Query y1 and y2 in a Bloom filter
    The preceding figure demonstrates that although y2 was never inserted into the Bloom filter, y2 is determined to be existent in the Bloom filter. This scenario shows that Bloom filters have a false positive rate. The preceding analysis indicates that Bloom filters have the following features:
    • A bit can be shared by multiple pieces of data.
    • Bloom filters have a false positive rate and the rate gets higher as the number of elements in a Bloom filter increases. However, Bloom filters do not have a false negative rate. If an element exists in a Bloom filter, the element is always determined to be existent.
    • An element can be added to a Bloom filter but cannot be removed from the Bloom filter. The reason is that bits can be shared. If you remove an element from the Bloom filter, other elements in the Bloom filter are affected.
  • Scalable Bloom Filter

    When an increasing number of elements are added to a Bloom filter, the false positive rate becomes higher. If you want to maintain a stable false positive rate, you must accordingly increase the size of the Bloom filter. However, the size cannot be increased due to structural limits. SBFs emerged in response to this issue. SBFs are a new type of Bloom filter that combines multiple Bloom filters into one.

    The following figure shows the basic model of a SBF. This SBF has two layers: BF0 and BF1. At first, the SBF contains only the BF0 layer. If you insert elements a, b, and c into this SBF and this BF0 layer is not large enough to maintain a specified false positive rate, the BF1 layer is created to increase the SBF size. Then, elements d, e, and f are inserted into the BF1 layer. A new BF2 layer is created if the BF1 layer also cannot maintain the specified positive rate. Model of SBFs

    SBFs insert data only into the last layer and query data starting from the last layer to the BF0 layer. For more information, see Scalable Bloom Filters.

Prerequisites

  • The instance is a performance-enhanced instance of the ApsaraDB for Redis Enhanced Edition (Tair). For more information, see Performance-enhanced instances.
  • The instance is updated to the latest minor version. For more information, see Update the minor version.
    Note If your instance is a cluster instance or read/write splitting instance, your proxy nodes must also be of the latest minor version to ensure that all commands can be executed as expected. For more information about cluster instances and read/write splitting instances, see Cluster master-replica instances and Read/write splitting instances.

Precautions

  • The TairBloom data that you want to manage is stored on a performance-enhanced instance.
  • The initial capacity and false positive rate that meet your requirements must be calculated in advance. To create a TairBloom key that has a capacity far more than 100 elements, run the BF.RESERVE command instead of the BF.ADD command.

    The following section demonstrates the difference between the BF.ADD command and the BF.RESERVE command:

    • BF.ADD (also known as BF.MADD): If a TairBloom key does not exist when this command is run, the key is automatically created. The default capacity of the key is 100, and the default false positive rate of the key is 0.01. If you require a key capacity that is far greater than 100, you can only scale up the key to support more elements.

      When a key is scaled up, more layers of Bloom filters are added to it. Every time a Bloom filter layer is added, the key capacity exponentially increases. In this case, if you perform a query on the key, multiple layers of Bloom filters need to be traversed and the performance of the key significantly deteriorates.

    • BF.RESERVE (also known as BF.INSERT): An initial capacity is specified when this command is run. This command specifies an initial capacity at the first layer of a key. If a key contains a smaller number of Bloom filter layers, the query speed of the key is faster.
    Note For example, assume that you want to insert 10,000,000 elements into a key and allows an false positive rate of 0.01. The created key is 176 MB in size if you use the BF.ADD command, or 16 MB in size if you use the BF.RESERVE command.

    Although a key can be scaled up, we recommend that you do not frequently use key scale-ups and consider them only as a safety measure. If the actual capacity exceeds the provisioned capacity of a key, you can scale up the key to make sure that write operations can be performed on the key and prevent live accidents.

    The following table lists a variety of key sizes supported by the BF.RESERVE command and their matched initial capacities and false positive rates.

    Capacity (number of elements) false positive:0.01 false positive:0.001 false positive:0.0001
    100,000 0.12 MB 0.25 MB 0.25 MB
    1,000,000 2 MB 2 MB 4 MB
    10,000,000 16 MB 32 MB 32 MB
    100,000,000 128 MB 256 MB 256 MB
    1,000,000,000 2 GB 2 GB 4 GB
  • The size of a key cannot be decreased because elements can only be added to a key, not removed. To prevent capacity issues such as an out of memory (OOM) error, we recommend that you implement the following solutions:
    • Split business data. You can split business data to prevent large keys that affect query performance. If large keys exist, most requests are made to the instance that contains these keys. This may cause hotkeys or even skewed requests.

      You can distribute the split data to multiple keys. If your business data is stored in an ApsaraDB for Redis cluster instance, you can distribute the split data to multiple nodes in the instance to ensure that large keys and hotkeys do not occur.

    • Regularly rebuild keys. If possible, you can run the DEL command to delete data from a key, and then insert data from backend databases into the key to manage the key size.

      You can also create multiple keys to interchangeably use them. This way, the size of a single key is kept appropriate. The benefit of this solution is that you need only to create keys for once. However, multiple keys must be created and some memory may be wasted.

Supported commands

Table 1. TairBloom commands
Parameter Syntax Description
BF.RESERVE BF.RESERVE <key> <error_rate> <capacity>

Creates an empty TairBloom key that has a specified capacity and false positive rate. The capacity parameter specifies the capacity of the key and the error_rate parameter specifies the false positive rate of the key.

BF.ADD BF.ADD <key> <item>

Adds an element to a TairBloom key.

BF.MADD BF.MADD <key> <item> [item...]

Adds multiple elements to a TairBloom key.

BF.EXISTS BF.EXISTS <key> <item>

Checks whether an element exists in a TairBloom key.

BF.MEXISTS BF.MEXISTS <key> <item> [item...]

Checks whether multiple elements exist in a TairBloom key.

BF.INSERT BF.INSERT <key> [CAPACITY cap] [ERROR error] [NOCREATE] ITEMS <item...>

Adds multiple elements to a TairBloom key. If the key does not exist, you can specify whether to create the key. You can also specify the capacity and false positive rate of the key.

DEL DEL <key> [key ...] Deletes one or more TairBloom keys.
Note The elements that are already added to a key cannot be deleted. You can run the DEL command to delete all data from a key.

BF.RESERVE

Item Description
Syntax BF.RESERVE <key> <error_rate> <capacity>
Time complexity O(1)
Command description

Creates an empty TairBloom key that has a specified capacity and false positive rate. The capacity parameter specifies the capacity of the key and the error_rate parameter specifies the false positive rate of the key.

Parameter
  • Key: the name of the key that you want to manage by running this command.
  • error_rate: the expected false positive rate. The value of this parameter must be between 0 and 1. A lower value indicates higher accuracy, memory usage, and CPU utilization of the key.
  • capacity: the initial capacity of the key. This parameter specifies the maximum number of elements that can be added to the key.

    When the number of elements added to the key exceeds the capacity value, a Bloom filter layer is added for the key. This process may deteriorate key performance. Every time a Bloom filter layer is added, the key capacity exponentially increases. In this case, if you perform a query on the key, multiple layers of Bloom filters may need to be traversed. If your workloads require high performance, we recommend that you add elements to a key based on your business requirements to prevent automatic scaling.

Output
  • If the operation is successful, OK is returned.
  • Otherwise, an error message is returned.
Example

Sample command:

BF.RESERVE BFKEY 0.01 100

Sample output:

OK

BF.ADD

Item Description
Syntax BF.ADD <key> <item>
Time complexity

O(log N), where N specifies the number of Bloom filter layers.

Command description

Adds an element to a TairBloom key.

Note If the key does not exist, the key is automatically created. The default capacity of the key is 100, and the default false positive rate of the key is 0.01.
Parameter
  • Key: the name of the key that you want to manage by running this command.
  • item: the element that you want to add to the key.
Output
  • If the element does not already exist, the element is added and a value of 1 is returned.
  • If the element may already exist, the element is not added or updated and a value of 0 is returned.
  • Otherwise, an error message is returned.
Example

Sample command:

BF.ADD BFKEY item1

Sample output:

(integer) 1

BF.MADD

Item Description
Syntax BF.MADD <key> <item> [item...]
Time complexity

O(log N), where N specifies the number of Bloom filter layers.

Command description

Adds multiple elements to a TairBloom key.

Note If the key does not exist, the key is automatically created. The default capacity of the key is 100, and the default false positive rate of the key is 0.01.
Parameter
  • Key: the name of the key that you want to manage by running this command.
  • item: the element that you want to add to the key. Multiple elements can be specified.
Output
  • If the element does not already exist, the element is added and a value of 1 is returned.
  • If the element may already exist, the element is not added or updated and a value of 0 is returned.
  • Otherwise, an error message is returned.
Example

Sample command:

BF.MADD BFKEY item1 item2 item3

Sample output:

(integer) 1
(integer) 1
(integer) 1

BF.EXISTS

Item Description
Syntax BF.EXISTS <key> <item>
Time complexity

O(log N), where N specifies the number of Bloom filter layers.

Command description

Checks whether an element exists in a TairBloom key.

Parameter
  • Key: the name of the key that you want to manage by running this command.
  • item: the element that you want to query.
Output
  • If the element does not exist, a value of 0 is returned.
  • If the element may exist, a value of 1 is returned.
  • Otherwise, an error message is returned.
Example

Sample command:

BF.EXISTS BFKEY item1

Sample output:

(integer) 1

BF.MEXISTS

Item Description
Syntax BF.MEXISTS <key> <item> [item...]
Time complexity

O(log N), where N specifies the number of Bloom filter layers.

Command description

Checks whether multiple elements exist in a TairBloom key.

Parameter
  • Key: the name of the key that you want to manage by running this command.
  • item: the element that you want to query. Multiple elements can be specified.
Output
  • If the element does not exist, a value of 0 is returned.
  • If the element may exist, a value of 1 is returned.
  • Otherwise, an error message is returned.
Example

Sample command:

BF.MEXISTS BFKEY item1 item5

Sample output:

(integer) 1
(integer) 0

BF.INSERT

Item Description
Syntax BF.INSERT <key> [CAPACITY cap] [ERROR error] [NOCREATE] ITEMS <item...>
Time complexity

O(log N), where N specifies the number of Bloom filter layers.

Command description

Adds multiple elements to a TairBloom key. If the key does not exist, you can specify whether to create the key. You can also specify the capacity and false positive rate of the key.

Parameter
  • Key: the name of the key that you want to manage by running this command.
  • capacity: the initial capacity of the key. This parameter specifies the maximum number of elements that can be added to the key. If the key already exists, this parameter is ignored.

    If the number of elements that are added to a key exceeds the capacity value, another Bloom filter layer is automatically added for the key. During the scaling, the number of elements in the key exponentially increases and the performance linearly decreases. To query a specific element after a layer is added to the key, multiple layers may be traversed. The capacity of each new layer is twice that of the previous layer. If your workloads require high performance, we recommend that you add elements to a key based on your business requirements to prevent automatic scaling.

  • error_rate: the expected false positive rate. The value of this parameter must be between 0 and 1. A lower value indicates higher accuracy, memory usage, and CPU utilization of the key.
  • NOCREATE: specifies that the key is not automatically created if the key does not exist. This parameter cannot be specified together with the capacity or error_rate parameter.
  • item: the element that you want to query. Multiple elements can be specified.
Output
  • If the element does not already exist, the element is added and a value of 1 is returned.
  • If the element may already exist, the element is not added or updated and a value of 0 is returned.
  • Otherwise, an error message is returned.
Example

Sample command:

BF.INSERT bfkey1 CAPACITY 10000 ERROR 0.001 ITEMS item1 item2 item3

Sample output:

(integer) 1
(integer) 1
(integer) 1