pg_similarity - PolarDB - Alibaba Cloud Documentation Center

pg_similarity is a third-party extension supported by PolarDB for PostgreSQL. You can use this extension for similarity distance calculation.

Prerequisites

The feature is supported on PolarDB for PostgreSQL clusters that run the following engine:

PostgreSQL 14 (revision version 14.10.18.0 or later)

Note

You can execute the following statement to query the revision version of your PolarDB for PostgreSQL cluster:

SELECT version();

Overview

The pg_similarity extension is used for similarity distance calculation. In addition to the traditional operators including =and <>, you can use ~~~ and! that are defined by pg_similarity for similarity queries. Each of these two operators represents a similarity function. pg_similarity contains three main components:

Functions: a set of functions that implements similarity algorithms that are available in the literature. These functions can be used as UDFs and will be the base for implementing the similarity operators.
Operators: a set of operators that are defined at the top of similarity functions. They use similarity functions to obtain the similarity threshold and compare the value to a user-defined threshold to determine whether it is a match.
Session variables: a set of variables that store similarity function parameters. These variables can be defined at run time.

Functions and operators

pg_similarity supports many well-known similarity algorithms. Each algorithm is suitable for a specific domain. The following algorithms are provided:

L1 Distance (as known as City Block or Manhattan Distance);
Cosine Distance;
Dice Coefficient;
Euclidean Distance;
Hamming Distance;
Jaccard Coefficient;
Jaro Distance;
Jaro-Winkler Distance;
Levenshtein Distance;
Matching Coefficient;
Monge-Elkan Coefficient;
Needleman-Wunsch Coefficient;
Overlap Coefficient;
Q-Gram Distance;
Smith-Waterman Coefficient;
Smith-Waterman-Gotoh Coefficient;
Soundex Distance.

Algorithms	Function definition	Operator	Whether to use Index	Parameters
L1 Distance	block(text, text) returns float8	~++	yes	pg_similarity.block_tokenizer (enum) pg_similarity.block_threshold (float8) pg_similarity.block_is_normalized (bool)
Cosine Distance	cosine(text, text) returns float8	~##	yes	pg_similarity.cosine_tokenizer (enum) pg_similarity.cosine_threshold (float8) pg_similarity.cosine_is_normalized (bool)
Dice Coefficient	dice(text, text) returns float8	~-~	yes	pg_similarity.dice_tokenizer (enum) pg_similarity.dice_threshold (float8) pg_similarity.dice_is_normalized (bool)
Euclidean Distance	euclidean(text, text) returns float8	~!!	yes	pg_similarity.euclidean_tokenizer (enum) pg_similarity.euclidean_threshold (float8) pg_similarity.euclidean_is_normalized (bool)
Hamming Distance	hamming(bit varying, bit varying) returns float8 hamming_text(text, text) returns float8	~@~	no	pg_similarity.hamming_threshold (float8) pg_similarity.hamming_is_normalized (bool)
Jaccard Coefficient	jaccard(text, text) returns float8	~??	yes	pg_similarity.jaccard_tokenizer (enum) pg_similarity.jaccard_threshold (float8) pg_similarity.jaccard_is_normalized (bool)
Jaro Distance	jaro(text, text) returns float8	~%%	no	pg_similarity.jaro_threshold (float8) pg_similarity.jaro_is_normalized (bool)
Jaro-Winkler Distance	jarowinkler(text, text) returns float8	~@@	no	pg_similarity.jarowinkler_threshold (float8) pg_similarity.jarowinkler_is_normalized (bool)
Levenshtein Distance	lev(text, text) returns float8	~==	no	pg_similarity.levenshtein_threshold (float8) pg_similarity.levenshtein_is_normalized (bool)
Matching Coefficient	matchingcoefficient(text, text) returns float8	~^^	yes	pg_similarity.matching_tokenizer (enum) pg_similarity.matching_threshold (float8) pg_similarity.matching_is_normalized (bool)
Monge-Elkan Coefficient	mongeelkan(text, text) returns float8	~\|\|	no	pg_similarity.mongeelkan_tokenizer (enum) pg_similarity.mongeelkan_threshold (float8) pg_similarity.mongeelkan_is_normalized (bool)
Needleman-Wunsch Coefficient	needlemanwunsch(text, text) returns float8	~#~	no	pg_similarity.nw_threshold (float8) pg_similarity.nw_is_normalized (bool)
Overlap Coefficient	overlapcoefficient(text, text) returns float8	~**	yes	pg_similarity.overlap_tokenizer (enum) pg_similarity.overlap_threshold (float8) pg_similarity.overlap_is_normalized (bool)
Q-Gram Distance	qgram(text, text) returns float8	~~~	yes	pg_similarity.qgram_threshold (float8) pg_similarity.qgram_is_normalized (bool)
Smith-Waterman Coefficient	smithwaterman(text, text) returns float8	~=~	no	pg_similarity.sw_threshold (float8) pg_similarity.sw_is_normalized (bool)
Smith-Waterman-Gotoh Coefficient	smithwatermangotoh(text, text) returns float8	~!~	no	pg_similarity.swg_threshold (float8) pg_similarity.swg_is_normalized (bool)
Soundex Distance	soundex(text, text) returns float8	~*~	no	-

Some parameters control the behavior of the pg_similarity functions and operators. These parameters can be classified into three categories: tokenizer, threshold, and normalized.

tokenizer: controls how the strings are tokenized. Valid values are alnum, gram, word, and camelcase. The default value is alnum. All tokens are lowercase. This option can be set at compile time. For more information, see PGS_IGNORE_CASE in the source code.
- alnum: Delimiters are any non-alphanumeric characters. This means that only alphabetic characters in the standard C locale and digits (0-9) are accepted in tokens. For example, the string "Euler_Taveira_de_Oliveira 22/02/2011" is tokenized as "Euler", "Taveira", "de", "Oliveira", "22", "02", "2011".
- gram: An n-gram is a subsequence of length n. N-grams can be extracted from a string by using the sliding-by-one technique, which is sliding a window of length n throughout the string by one character. For example, the string "euler taveira" (using n = 3) is tokenized as "eul", "ule", "ler", "er ", "r t", " ta", "tav", "ave", "vei", "eir", and "ira". If you enable full n-gram, " e", " eu", "ra ", and "a " are added to the set of tokens. This option can be set at compile time. See PGS_FULL_NGRAM in the source code.
- word: Delimiters are white space characters including space, form-feed, newline, carriage return, horizontal tab, and vertical tab. For example, the string "Euler Taveira de Oliveira 22/02/2011" is tokenized as "Euler", "Taveira", "de", "Oliveira", and "22/02/2011".
- camelcase: Delimiters are capitalized characters that are included as first token characters. For example, the string "EulerTaveira de Oliveira" is tokenized as "Euler", "Taveira de ", and "Oliveira".
Threshold: controls how flexible the match is. For each pair of strings, if the value that is calculated by using the corresponding similarity function is greater than or equal to the threshold value, it is considered a match. The valid value ranges from 0.0 to 1.0. The default value is 0.7.
normalized: controls whether the similarity coefficient or distance is normalized (between 0.0 and 1.0) or not. Normalized values are used automatically by operators to match strings, which means this parameter is applicable only if you are using similarity functions. The default value is true.

Usage

Create the extension

CREATE EXTENSION pg_similarity;

Configure parameters at run time

SHOW pg_similarity.levenshtein_threshold;
 pg_similarity.levenshtein_threshold
-------------------------------------
 0.7
(1 row)

SET pg_similarity.levenshtein_threshold TO 0.5;
SHOW pg_similarity.levenshtein_threshold;
 pg_similarity.levenshtein_threshold
-------------------------------------
 0.5
(1 row)

SET pg_similarity.cosine_tokenizer TO camelcase;

SET pg_similarity.euclidean_is_normalized TO false;

Create a test table

CREATE TABLE foo (a TEXT);
INSERT INTO foo VALUES
  ('Euler'),
  ('Oiler'),
  ('Euler Taveira de Oliveira'),
  ('Maria Taveira dos Santos'),
  ('Carlos Santos Silva');

CREATE TABLE bar (b TEXT);
INSERT INTO bar VALUES
  ('Euler T. de Oliveira'),
  ('Euller'),
  ('Oliveira, Euler Taveira'),
  ('Sr. Oliveira');

Use the jaro function

SELECT a, b, jaro(a, b) FROM foo, bar;
             a             |            b            |        jaro
---------------------------+-------------------------+--------------------
 Euler                     | Euler T. de Oliveira    |               0.75
 Euler                     | Euller                  | 0.9444444444444444
 Euler                     | Oliveira, Euler Taveira | 0.6057971014492753
 Euler                     | Sr. Oliveira            | 0.5055555555555555
 Oiler                     | Euler T. de Oliveira    | 0.4722222222222222
 Oiler                     | Euller                  |                0.7
 Oiler                     | Oliveira, Euler Taveira |  0.672463768115942
 Oiler                     | Sr. Oliveira            | 0.6722222222222223
 Euler Taveira de Oliveira | Euler T. de Oliveira    | 0.7980701754385964
 Euler Taveira de Oliveira | Euller                  | 0.6777777777777777
 Euler Taveira de Oliveira | Oliveira, Euler Taveira | 0.7731884057971014
 Euler Taveira de Oliveira | Sr. Oliveira            | 0.5922222222222222
 Maria Taveira dos Santos  | Euler T. de Oliveira    | 0.6023504273504273
 Maria Taveira dos Santos  | Euller                  | 0.3055555555555556
 Maria Taveira dos Santos  | Oliveira, Euler Taveira | 0.5350241545893719
 Maria Taveira dos Santos  | Sr. Oliveira            | 0.6342592592592593
 Carlos Santos Silva       | Euler T. de Oliveira    | 0.5421052631578946
 Carlos Santos Silva       | Euller                  | 0.3128654970760234
 Carlos Santos Silva       | Oliveira, Euler Taveira | 0.6066615814899567
 Carlos Santos Silva       | Sr. Oliveira            | 0.5077276524644945
(20 rows)

Use the levenshtein operator

SHOW pg_similarity.levenshtein_threshold;
 pg_similarity.levenshtein_threshold
-------------------------------------
 0.5
(1 row)

SELECT a, b, lev(a,b) FROM foo, bar WHERE a ~== b;
             a             |          b           |        lev
---------------------------+----------------------+--------------------
 Euler                     | Euller               | 0.8333333333333334
 Oiler                     | Euller               |                0.5
 Euler Taveira de Oliveira | Euler T. de Oliveira |               0.76
(3 rows)

SET pg_similarity.levenshtein_threshold = 0.7;
SELECT a, b, lev(a,b) FROM foo, bar WHERE a ~== b;
             a             |          b           |        lev
---------------------------+----------------------+--------------------
 Euler                     | Euller               | 0.8333333333333334
 Euler Taveira de Oliveira | Euler T. de Oliveira |               0.76
(2 rows)

Results comparison

SELECT * FROM bar WHERE b ~@@ 'euler'; -- jaro-winkler operator
          b
----------------------
 Euler T. de Oliveira
 Euller
(2 rows)

SELECT * FROM bar WHERE b ~~~ 'euler'; -- qgram operator
 b
---
(0 rows)

SELECT * FROM bar WHERE b ~== 'euler'; -- levenshtein operator
   b
--------
 Euller
(1 row)

SELECT * FROM bar WHERE b ~## 'euler'; -- cosine operator
 b
---
(0 rows)