You can use the smlar extension in a database to calculate data similarity. The smlar extension also provides similarity operators that support GiST and GIN indexes.
Prerequisites
The extension is supported on the PolarDB for PostgreSQL clusters that run the following engine:- PostgreSQL 14 (revision version 14.5.1.0 or later)
- PostgreSQL 11 (revision version 1.1.28 or later)
Note You can execute one of the following statements to view the revision version of a PolarDB for PostgreSQL cluster:
- PostgreSQL 14
select version();
- PostgreSQL 11
show polar_version;
Overview
smlar is an open-source third-party extension for PolarDB for PostgreSQL. It provides functions that can calculate data similarity in databases and similarity operators that support GiST and GIN indexes. The smlar extension supports all built-in data types of PostgreSQL.
Usage
- Install the smlar extension.
CREATE EXTENSION smlar;
- Execute the following statement to calculate the similarity of arrays.
SELECT smlar('{3,2}'::int[], '{3,2,1}'); smlar ---------- 0.816497 (1 row) SELECT smlar('{1,4,6}'::int[], '{5,4,6}', 'N.i / (N.a + N.b)' ); smlar ---------- 0.333333 (1 row)
Note For more information about other functions, see Functions and operators. - Remove the smlar extension.
DROP EXTENSION smlar;
Functions and operators
Function and operator | Description |
---|---|
float4 smlar(anyarray, anyarray) | Calculates the similarity of two arrays of the same data type. The arrays must use the same data type. |
float4 smlar(anyarray, anyarray, bool useIntersect) | Calculates the similarity of two arrays of custom composite types (elements and weights). The useIntersect parameter specifies whether only overlapping elements or all elements are involved in the operation. Note Execute the following statement to define composite types:
|
float4 smlar(anyarray a, anyarray b, text formula) | Calculates the similarity between two arrays of the same data type. The arrays are specified by the formula parameter. The predefined variables for the formula parameter:
|
anyarray % anyarray | Returns TRUE if the similarity of the two arrays exceeds the threshold. Otherwise, returns FALSE. |
text[] tsvector2textarray(tsvector) | Converts an array of the tsvector type to a string. |
anyarray array_unique(anyarray) | Sorts and deduplicates the array. |
float4 inarray(anyarray, anyelement) | Returns 1.0 if the element exists in the array. Otherwise, returns 0. |
float4 inarray(anyarray, anyelement, float4, float4) | Returns the third parameter if the element exists in the array. Otherwise, returns the fourth parameter. |
Parameters
Parameter | Description |
---|---|
smlar.threshold FLOAT | The similarity threshold. It is used to determine whether two arrays combined with the percent sign (%) operator are similar. |
smlar.persistent_cache BOOL | Specifies whether to store the cached global statistics in transaction-independent memory. |
smlar.type STRING | The formula used to calculate the similarity. The following similarity types are available: consine (default), tfidf, and overlap. |
smlar.stattable STRING | The name of the table that stores collection range statistics. You can execute the following statement to create a table:
|
smlar.tf_method STRING | The method used to calculate the term frequency (TF). Default value: n. Valid values:
|
smlar.idf_plus_one BOOL | The method used to calculate the inverse document frequency (IDF). Default value: FALSE. Valid values:
|
References
For more information about the smlar extension, see: