You can use the smlar extension in a database to calculate data similarity. The smlar extension also provides similarity operators that support GiST and GIN indexes.

Prerequisites

The extension is supported on the PolarDB for PostgreSQL clusters that run the following engine:
  • PostgreSQL 14 (revision version 14.5.1.0 or later)
  • PostgreSQL 11 (revision version 1.1.28 or later)
Note You can execute one of the following statements to view the revision version of a PolarDB for PostgreSQL cluster:
  • PostgreSQL 14
    select version();
  • PostgreSQL 11
    show polar_version;

Overview

smlar is an open-source third-party extension for PolarDB for PostgreSQL. It provides functions that can calculate data similarity in databases and similarity operators that support GiST and GIN indexes. The smlar extension supports all built-in data types of PostgreSQL.

Usage

  1. Install the smlar extension.
    CREATE EXTENSION smlar;
  2. Execute the following statement to calculate the similarity of arrays.
    SELECT smlar('{3,2}'::int[], '{3,2,1}');
      smlar
    ----------
     0.816497
    (1 row)
    
    SELECT smlar('{1,4,6}'::int[], '{5,4,6}', 'N.i / (N.a + N.b)' );
      smlar
    ----------
     0.333333
    (1 row)
    Note For more information about other functions, see Functions and operators.
  3. Remove the smlar extension.
    DROP EXTENSION smlar;

Functions and operators

Function and operatorDescription
float4 smlar(anyarray, anyarray)Calculates the similarity of two arrays of the same data type. The arrays must use the same data type.
float4 smlar(anyarray, anyarray, bool useIntersect)Calculates the similarity of two arrays of custom composite types (elements and weights). The useIntersect parameter specifies whether only overlapping elements or all elements are involved in the operation.
Note Execute the following statement to define composite types:
CREATE TYPE type_name AS (element_name anytype, weight_name FLOAT4);
float4 smlar(anyarray a, anyarray b, text formula)Calculates the similarity between two arrays of the same data type. The arrays are specified by the formula parameter. The predefined variables for the formula parameter:
  • N.i: the number of identical elements in two arrays (intersection).
  • N.a: the number of unique elements in the first array.
  • N.b: the number of unique elements in the second array.
anyarray % anyarrayReturns TRUE if the similarity of the two arrays exceeds the threshold. Otherwise, returns FALSE.
text[] tsvector2textarray(tsvector)Converts an array of the tsvector type to a string.
anyarray array_unique(anyarray)Sorts and deduplicates the array.
float4 inarray(anyarray, anyelement)Returns 1.0 if the element exists in the array. Otherwise, returns 0.
float4 inarray(anyarray, anyelement, float4, float4)Returns the third parameter if the element exists in the array. Otherwise, returns the fourth parameter.

Parameters

ParameterDescription
smlar.threshold FLOATThe similarity threshold. It is used to determine whether two arrays combined with the percent sign (%) operator are similar.
smlar.persistent_cache BOOLSpecifies whether to store the cached global statistics in transaction-independent memory.
smlar.type STRINGThe formula used to calculate the similarity. The following similarity types are available: consine (default), tfidf, and overlap.
smlar.stattable STRINGThe name of the table that stores collection range statistics. You can execute the following statement to create a table:
CREATE TABLE table_name (
  value   data_type UNIQUE,
  ndoc    int4 (or bigint)  NOT NULL CHECK (ndoc>0)
);
smlar.tf_method STRINGThe method used to calculate the term frequency (TF). Default value: n. Valid values:
  • n: simple counting
  • log: 1 + log(n)
  • const: 1
smlar.idf_plus_one BOOLThe method used to calculate the inverse document frequency (IDF). Default value: FALSE. Valid values:
  • FALSE :log(d/df)
  • TRUE :log(1 + d/df).

References

For more information about the smlar extension, see: