pg_similarity is a third-party extension supported by PolarDB for PostgreSQL. You can use this extension for similarity distance calculation.
Prerequisites
The feature is supported on PolarDB for PostgreSQL clusters that run the following engine:
PostgreSQL 14 (revision version 14.10.18.0 or later)
You can execute the following statement to query the revision version of your PolarDB for PostgreSQL cluster:
SELECT version();Overview
The pg_similarity extension is used for similarity distance calculation. In addition to the traditional operators including =and <>, you can use ~~~ and! that are defined by pg_similarity for similarity queries. Each of these two operators represents a similarity function. pg_similarity contains three main components:
Functions: a set of functions that implements similarity algorithms that are available in the literature. These functions can be used as UDFs and will be the base for implementing the similarity operators.
Operators: a set of operators that are defined at the top of similarity functions. They use similarity functions to obtain the similarity threshold and compare the value to a user-defined threshold to determine whether it is a match.
Session variables: a set of variables that store similarity function parameters. These variables can be defined at run time.
Functions and operators
pg_similarity supports many well-known similarity algorithms. Each algorithm is suitable for a specific domain. The following algorithms are provided:
L1 Distance (as known as City Block or Manhattan Distance);
Cosine Distance;
Dice Coefficient;
Euclidean Distance;
Hamming Distance;
Jaccard Coefficient;
Jaro Distance;
Jaro-Winkler Distance;
Levenshtein Distance;
Matching Coefficient;
Monge-Elkan Coefficient;
Needleman-Wunsch Coefficient;
Overlap Coefficient;
Q-Gram Distance;
Smith-Waterman Coefficient;
Smith-Waterman-Gotoh Coefficient;
Soundex Distance.
Algorithms | Function definition | Operator | Whether to use Index | Parameters |
L1 Distance | block(text, text) returns float8 | ~++ | yes | pg_similarity.block_tokenizer (enum) pg_similarity.block_threshold (float8) pg_similarity.block_is_normalized (bool) |
Cosine Distance | cosine(text, text) returns float8 | ~## | yes | pg_similarity.cosine_tokenizer (enum) pg_similarity.cosine_threshold (float8) pg_similarity.cosine_is_normalized (bool) |
Dice Coefficient | dice(text, text) returns float8 | ~-~ | yes | pg_similarity.dice_tokenizer (enum) pg_similarity.dice_threshold (float8) pg_similarity.dice_is_normalized (bool) |
Euclidean Distance | euclidean(text, text) returns float8 | ~!! | yes | pg_similarity.euclidean_tokenizer (enum) pg_similarity.euclidean_threshold (float8) pg_similarity.euclidean_is_normalized (bool) |
Hamming Distance | hamming(bit varying, bit varying) returns float8 hamming_text(text, text) returns float8 | ~@~ | no | pg_similarity.hamming_threshold (float8) pg_similarity.hamming_is_normalized (bool) |
Jaccard Coefficient | jaccard(text, text) returns float8 | ~?? | yes | pg_similarity.jaccard_tokenizer (enum) pg_similarity.jaccard_threshold (float8) pg_similarity.jaccard_is_normalized (bool) |
Jaro Distance | jaro(text, text) returns float8 | ~%% | no | pg_similarity.jaro_threshold (float8) pg_similarity.jaro_is_normalized (bool) |
Jaro-Winkler Distance | jarowinkler(text, text) returns float8 | ~@@ | no | pg_similarity.jarowinkler_threshold (float8) pg_similarity.jarowinkler_is_normalized (bool) |
Levenshtein Distance | lev(text, text) returns float8 | ~== | no | pg_similarity.levenshtein_threshold (float8) pg_similarity.levenshtein_is_normalized (bool) |
Matching Coefficient | matchingcoefficient(text, text) returns float8 | ~^^ | yes | pg_similarity.matching_tokenizer (enum) pg_similarity.matching_threshold (float8) pg_similarity.matching_is_normalized (bool) |
Monge-Elkan Coefficient | mongeelkan(text, text) returns float8 | ~|| | no | pg_similarity.mongeelkan_tokenizer (enum) pg_similarity.mongeelkan_threshold (float8) pg_similarity.mongeelkan_is_normalized (bool) |
Needleman-Wunsch Coefficient | needlemanwunsch(text, text) returns float8 | ~#~ | no | pg_similarity.nw_threshold (float8) pg_similarity.nw_is_normalized (bool) |
Overlap Coefficient | overlapcoefficient(text, text) returns float8 | ~** | yes | pg_similarity.overlap_tokenizer (enum) pg_similarity.overlap_threshold (float8) pg_similarity.overlap_is_normalized (bool) |
Q-Gram Distance | qgram(text, text) returns float8 | ~~~ | yes | pg_similarity.qgram_threshold (float8) pg_similarity.qgram_is_normalized (bool) |
Smith-Waterman Coefficient | smithwaterman(text, text) returns float8 | ~=~ | no | pg_similarity.sw_threshold (float8) pg_similarity.sw_is_normalized (bool) |
Smith-Waterman-Gotoh Coefficient | smithwatermangotoh(text, text) returns float8 | ~!~ | no | pg_similarity.swg_threshold (float8) pg_similarity.swg_is_normalized (bool) |
Soundex Distance | soundex(text, text) returns float8 | ~*~ | no | - |
Some parameters control the behavior of the pg_similarity functions and operators. These parameters can be classified into three categories: tokenizer, threshold, and normalized.
tokenizer: controls how the strings are tokenized. Valid values are
alnum,gram,word, andcamelcase. The default value isalnum. All tokens are lowercase. This option can be set at compile time. For more information, seePGS_IGNORE_CASEin the source code.alnum: Delimiters are any non-alphanumeric characters. This means that only alphabetic characters in the standard C locale and digits (0-9) are accepted in tokens. For example, the string "Euler_Taveira_de_Oliveira 22/02/2011" is tokenized as "Euler", "Taveira", "de", "Oliveira", "22", "02", "2011".gram: An n-gram is a subsequence of length n. N-grams can be extracted from a string by using the sliding-by-one technique, which is sliding a window of length n throughout the string by one character. For example, the string "euler taveira" (using n = 3) is tokenized as "eul", "ule", "ler", "er ", "r t", " ta", "tav", "ave", "vei", "eir", and "ira". If you enable full n-gram, " e", " eu", "ra ", and "a " are added to the set of tokens. This option can be set at compile time. SeePGS_FULL_NGRAMin the source code.word: Delimiters are white space characters including space, form-feed, newline, carriage return, horizontal tab, and vertical tab. For example, the string "Euler Taveira de Oliveira 22/02/2011" is tokenized as "Euler", "Taveira", "de", "Oliveira", and "22/02/2011".camelcase: Delimiters are capitalized characters that are included as first token characters. For example, the string "EulerTaveira de Oliveira" is tokenized as "Euler", "Taveira de ", and "Oliveira".
Threshold: controls how flexible the match is. For each pair of strings, if the value that is calculated by using the corresponding similarity function is greater than or equal to the threshold value, it is considered a match. The valid value ranges from 0.0 to 1.0. The default value is 0.7.
normalized: controls whether the similarity coefficient or distance is normalized (between 0.0 and 1.0) or not. Normalized values are used automatically by operators to match strings, which means this parameter is applicable only if you are using similarity functions. The default value is
true.
Usage
Create the extension
CREATE EXTENSION pg_similarity;Configure parameters at run time
SHOW pg_similarity.levenshtein_threshold;
pg_similarity.levenshtein_threshold
-------------------------------------
0.7
(1 row)
SET pg_similarity.levenshtein_threshold TO 0.5;
SHOW pg_similarity.levenshtein_threshold;
pg_similarity.levenshtein_threshold
-------------------------------------
0.5
(1 row)SET pg_similarity.cosine_tokenizer TO camelcase;SET pg_similarity.euclidean_is_normalized TO false;Create a test table
CREATE TABLE foo (a TEXT);
INSERT INTO foo VALUES
('Euler'),
('Oiler'),
('Euler Taveira de Oliveira'),
('Maria Taveira dos Santos'),
('Carlos Santos Silva');CREATE TABLE bar (b TEXT);
INSERT INTO bar VALUES
('Euler T. de Oliveira'),
('Euller'),
('Oliveira, Euler Taveira'),
('Sr. Oliveira');Use the jaro function
SELECT a, b, jaro(a, b) FROM foo, bar;
a | b | jaro
---------------------------+-------------------------+--------------------
Euler | Euler T. de Oliveira | 0.75
Euler | Euller | 0.9444444444444444
Euler | Oliveira, Euler Taveira | 0.6057971014492753
Euler | Sr. Oliveira | 0.5055555555555555
Oiler | Euler T. de Oliveira | 0.4722222222222222
Oiler | Euller | 0.7
Oiler | Oliveira, Euler Taveira | 0.672463768115942
Oiler | Sr. Oliveira | 0.6722222222222223
Euler Taveira de Oliveira | Euler T. de Oliveira | 0.7980701754385964
Euler Taveira de Oliveira | Euller | 0.6777777777777777
Euler Taveira de Oliveira | Oliveira, Euler Taveira | 0.7731884057971014
Euler Taveira de Oliveira | Sr. Oliveira | 0.5922222222222222
Maria Taveira dos Santos | Euler T. de Oliveira | 0.6023504273504273
Maria Taveira dos Santos | Euller | 0.3055555555555556
Maria Taveira dos Santos | Oliveira, Euler Taveira | 0.5350241545893719
Maria Taveira dos Santos | Sr. Oliveira | 0.6342592592592593
Carlos Santos Silva | Euler T. de Oliveira | 0.5421052631578946
Carlos Santos Silva | Euller | 0.3128654970760234
Carlos Santos Silva | Oliveira, Euler Taveira | 0.6066615814899567
Carlos Santos Silva | Sr. Oliveira | 0.5077276524644945
(20 rows)Use the levenshtein operator
SHOW pg_similarity.levenshtein_threshold;
pg_similarity.levenshtein_threshold
-------------------------------------
0.5
(1 row)
SELECT a, b, lev(a,b) FROM foo, bar WHERE a ~== b;
a | b | lev
---------------------------+----------------------+--------------------
Euler | Euller | 0.8333333333333334
Oiler | Euller | 0.5
Euler Taveira de Oliveira | Euler T. de Oliveira | 0.76
(3 rows)SET pg_similarity.levenshtein_threshold = 0.7;
SELECT a, b, lev(a,b) FROM foo, bar WHERE a ~== b;
a | b | lev
---------------------------+----------------------+--------------------
Euler | Euller | 0.8333333333333334
Euler Taveira de Oliveira | Euler T. de Oliveira | 0.76
(2 rows)Results comparison
SELECT * FROM bar WHERE b ~@@ 'euler'; -- jaro-winkler operator
b
----------------------
Euler T. de Oliveira
Euller
(2 rows)SELECT * FROM bar WHERE b ~~~ 'euler'; -- qgram operator
b
---
(0 rows)SELECT * FROM bar WHERE b ~== 'euler'; -- levenshtein operator
b
--------
Euller
(1 row)SELECT * FROM bar WHERE b ~## 'euler'; -- cosine operator
b
---
(0 rows)