pg_similarity (similarity distance calculation) - PolarDB

pg_similarity is a third-party extension for PolarDB for PostgreSQL that adds string similarity distance calculation to SQL queries. Use it when you need to match records with typos, find near-duplicate names, or rank results by phonetic or edit-distance similarity.

Prerequisites

Before you begin, ensure that you have:

A PolarDB for PostgreSQL cluster running PostgreSQL 14 (revision version 14.10.18.0 or later)

To check your revision version, run:

SELECT version();

How it works

pg_similarity has three components:

Functions: implement 17 similarity algorithms as user-defined functions (UDFs). Call them directly in SELECT statements to get a similarity score.
Operators: wrap the functions and compare the score against a threshold. Use them in WHERE clauses to filter matching rows. In addition to the traditional operators = and <>, pg_similarity defines ~~~ and ! for similarity queries — each of these operators represents a similarity function.
Session variables: control function behavior at runtime — tokenization strategy, match threshold, and normalization.

Each algorithm has its own operator (for example, ~== for Levenshtein and ~@@ for Jaro-Winkler). All operators trigger the underlying similarity function and return rows where the score meets the configured threshold.

Supported algorithms

pg_similarity implements 17 algorithms. Each suits a different matching scenario:

Algorithm	Function	Operator	Index support	Parameters
L1 Distance (City Block / Manhattan Distance)	`block(text, text) returns float8`	`~++`	Yes	`pg_similarity.block_tokenizer`, `pg_similarity.block_threshold`, `pg_similarity.block_is_normalized`
Cosine Distance	`cosine(text, text) returns float8`	`~##`	Yes	`pg_similarity.cosine_tokenizer`, `pg_similarity.cosine_threshold`, `pg_similarity.cosine_is_normalized`
Dice Coefficient	`dice(text, text) returns float8`	`~-~`	Yes	`pg_similarity.dice_tokenizer`, `pg_similarity.dice_threshold`, `pg_similarity.dice_is_normalized`
Euclidean Distance	`euclidean(text, text) returns float8`	`~!!`	Yes	`pg_similarity.euclidean_tokenizer`, `pg_similarity.euclidean_threshold`, `pg_similarity.euclidean_is_normalized`
Hamming Distance	`hamming(bit varying, bit varying) returns float8` / `hamming_text(text, text) returns float8`	`~@~`	No	`pg_similarity.hamming_threshold`, `pg_similarity.hamming_is_normalized`
Jaccard Coefficient	`jaccard(text, text) returns float8`	`~??`	Yes	`pg_similarity.jaccard_tokenizer`, `pg_similarity.jaccard_threshold`, `pg_similarity.jaccard_is_normalized`
Jaro Distance	`jaro(text, text) returns float8`	`~%%`	No	`pg_similarity.jaro_threshold`, `pg_similarity.jaro_is_normalized`
Jaro-Winkler Distance	`jarowinkler(text, text) returns float8`	`~@@`	No	`pg_similarity.jarowinkler_threshold`, `pg_similarity.jarowinkler_is_normalized`
Levenshtein Distance	`lev(text, text) returns float8`	`~==`	No	`pg_similarity.levenshtein_threshold`, `pg_similarity.levenshtein_is_normalized`
Matching Coefficient	`matchingcoefficient(text, text) returns float8`	`~^^`	Yes	`pg_similarity.matching_tokenizer`, `pg_similarity.matching_threshold`, `pg_similarity.matching_is_normalized`
Monge-Elkan Coefficient	`mongeelkan(text, text) returns float8`	`~\|\|`	No	`pg_similarity.mongeelkan_tokenizer`, `pg_similarity.mongeelkan_threshold`, `pg_similarity.mongeelkan_is_normalized`
Needleman-Wunsch Coefficient	`needlemanwunsch(text, text) returns float8`	`~#~`	No	`pg_similarity.nw_threshold`, `pg_similarity.nw_is_normalized`
Overlap Coefficient	`overlapcoefficient(text, text) returns float8`	`~**`	Yes	`pg_similarity.overlap_tokenizer`, `pg_similarity.overlap_threshold`, `pg_similarity.overlap_is_normalized`
Q-Gram Distance	`qgram(text, text) returns float8`	`~~~`	Yes	`pg_similarity.qgram_threshold`, `pg_similarity.qgram_is_normalized`
Smith-Waterman Coefficient	`smithwaterman(text, text) returns float8`	`~=~`	No	`pg_similarity.sw_threshold`, `pg_similarity.sw_is_normalized`
Smith-Waterman-Gotoh Coefficient	`smithwatermangotoh(text, text) returns float8`	`~!~`	No	`pg_similarity.swg_threshold`, `pg_similarity.swg_is_normalized`
Soundex Distance	`soundex(text, text) returns float8`	`~*~`	No	—

When to use which algorithm

Goal	Recommended algorithm	Notes
General typo correction (edit distance)	Levenshtein Distance (`lev`)	Works well across languages; case-sensitive
Name matching with prefix bias	Jaro-Winkler Distance (`jarowinkler`)	Weights common prefixes higher than Jaro
Short name matching	Jaro Distance (`jaro`)	Effective for short strings and personal names
Phonetic name matching	Soundex Distance (`soundex`)	Operates on text inputs
Token-set similarity (bag of words)	Jaccard Coefficient (`jaccard`) or Dice Coefficient (`dice`)	Good for comparing sets of words regardless of order
Vector-space similarity (NLP / search)	Cosine Distance (`cosine`)	Useful for term-frequency-style comparisons
Sequence alignment	Smith-Waterman (`smithwaterman`) or Needleman-Wunsch (`needlemanwunsch`)	Local and global sequence alignment algorithms
Bit-string comparison	Hamming Distance (`hamming`)	Operates on `bit varying` inputs; no index support

Parameters

Three parameter categories control function and operator behavior.

tokenizer

Controls how strings are split into tokens before comparison. Applies only to token-based algorithms (those with a _tokenizer parameter). All tokens are lowercased.

Value	Delimiter	Example input	Tokens
`alnum` (default)	Any non-alphanumeric character	`Euler_Taveira_de_Oliveira 22/02/2011`	`Euler`, `Taveira`, `de`, `Oliveira`, `22`, `02`, `2011`
`word`	Whitespace (space, tab, newline, carriage return, form-feed, vertical tab)	`Euler Taveira de Oliveira 22/02/2011`	`Euler`, `Taveira`, `de`, `Oliveira`, `22/02/2011`
`gram`	Sliding window of length n (n-gram)	`euler taveira` (n=3)	`eul`, `ule`, `ler`, `er` , `r t`, `ta`, `tav`, `ave`, `vei`, `eir`, `ira`
`camelcase`	Capitalized characters (included as token start)	`EulerTaveira de Oliveira`	`Euler`, `Taveira de` , `Oliveira`

The gram tokenizer supports a full n-gram mode that pads the string with boundary characters (compile-time option PGS_FULL_NGRAM). This affects Q-Gram Distance results. Case sensitivity can also be changed at compile time via PGS_IGNORE_CASE.

threshold

Controls how strict the match is. If the similarity score for a pair of strings is greater than or equal to the threshold, the operator returns a match.

Valid range: 0.0 to 1.0
Default: 0.7

Lower values return more matches (looser); higher values return fewer (stricter).

normalized

Controls whether similarity functions return a value in the [0.0, 1.0] range.

Default: true
Operators always use normalized values internally. This parameter only affects the raw return value of similarity functions when called directly.

Usage

Install the extension

CREATE EXTENSION pg_similarity;

Configure parameters at runtime

Session variables are set with SET and inspected with SHOW:

-- Check the current levenshtein threshold (default: 0.7)
SHOW pg_similarity.levenshtein_threshold;
 pg_similarity.levenshtein_threshold
-------------------------------------
 0.7
(1 row)

-- Lower the threshold to match more loosely
SET pg_similarity.levenshtein_threshold TO 0.5;

-- Switch the cosine tokenizer to camelcase
SET pg_similarity.cosine_tokenizer TO camelcase;

-- Disable normalization for euclidean (return raw distance)
SET pg_similarity.euclidean_is_normalized TO false;

Set up test data

The following examples use two tables with name data:

CREATE TABLE foo (a TEXT);
INSERT INTO foo VALUES
  ('Euler'),
  ('Oiler'),
  ('Euler Taveira de Oliveira'),
  ('Maria Taveira dos Santos'),
  ('Carlos Santos Silva');

CREATE TABLE bar (b TEXT);
INSERT INTO bar VALUES
  ('Euler T. de Oliveira'),
  ('Euller'),
  ('Oliveira, Euler Taveira'),
  ('Sr. Oliveira');

Call a similarity function directly

Use the jaro function to get a similarity score for every pair of rows:

SELECT a, b, jaro(a, b) FROM foo, bar;

Output (20 rows, 5x4 cross join):

             a             |            b            |        jaro
---------------------------+-------------------------+--------------------
 Euler                     | Euler T. de Oliveira    |               0.75
 Euler                     | Euller                  | 0.9444444444444444
 Euler                     | Oliveira, Euler Taveira | 0.6057971014492753
 Euler                     | Sr. Oliveira            | 0.5055555555555555
 Oiler                     | Euler T. de Oliveira    | 0.4722222222222222
 Oiler                     | Euller                  |                0.7
 Oiler                     | Oliveira, Euler Taveira |  0.672463768115942
 Oiler                     | Sr. Oliveira            | 0.6722222222222223
 Euler Taveira de Oliveira | Euler T. de Oliveira    | 0.7980701754385964
 Euler Taveira de Oliveira | Euller                  | 0.6777777777777777
 Euler Taveira de Oliveira | Oliveira, Euler Taveira | 0.7731884057971014
 Euler Taveira de Oliveira | Sr. Oliveira            | 0.5922222222222222
 Maria Taveira dos Santos  | Euler T. de Oliveira    | 0.6023504273504273
 Maria Taveira dos Santos  | Euller                  | 0.3055555555555556
 Maria Taveira dos Santos  | Oliveira, Euler Taveira | 0.5350241545893719
 Maria Taveira dos Santos  | Sr. Oliveira            | 0.6342592592592593
 Carlos Santos Silva       | Euler T. de Oliveira    | 0.5421052631578946
 Carlos Santos Silva       | Euller                  | 0.3128654970760234
 Carlos Santos Silva       | Oliveira, Euler Taveira | 0.6066615814899567
 Carlos Santos Silva       | Sr. Oliveira            | 0.5077276524644945
(20 rows)

Filter rows using an operator

Use the ~== (Levenshtein) operator in a WHERE clause to return only rows that meet the threshold:

-- Threshold at 0.5: returns 3 matches
SHOW pg_similarity.levenshtein_threshold;
 pg_similarity.levenshtein_threshold
-------------------------------------
 0.5
(1 row)

SELECT a, b, lev(a,b) FROM foo, bar WHERE a ~== b;
             a             |          b           |        lev
---------------------------+----------------------+--------------------
 Euler                     | Euller               | 0.8333333333333334
 Oiler                     | Euller               |                0.5
 Euler Taveira de Oliveira | Euler T. de Oliveira |               0.76
(3 rows)

-- Raise the threshold to 0.7: stricter match, returns 2 rows
SET pg_similarity.levenshtein_threshold = 0.7;
SELECT a, b, lev(a,b) FROM foo, bar WHERE a ~== b;
             a             |          b           |        lev
---------------------------+----------------------+--------------------
 Euler                     | Euller               | 0.8333333333333334
 Euler Taveira de Oliveira | Euler T. de Oliveira |               0.76
(2 rows)

Compare algorithms on the same data

Run multiple similarity algorithms against the same string pair to see how results differ:

-- jaro-winkler operator: returns 2 rows
SELECT * FROM bar WHERE b ~@@ 'euler';
          b
----------------------
 Euler T. de Oliveira
 Euller
(2 rows)

-- q-gram operator: returns 0 rows (sensitive to character n-grams)
SELECT * FROM bar WHERE b ~~~ 'euler';
 b
---
(0 rows)

-- levenshtein operator: returns 1 row
SELECT * FROM bar WHERE b ~== 'euler';
   b
--------
 Euller
(1 row)

-- cosine operator: returns 0 rows (low token overlap)
SELECT * FROM bar WHERE b ~## 'euler';
 b
---
(0 rows)

The same query against 'euler' returns different results depending on the algorithm. Jaro-Winkler matches because it rewards common prefixes; Q-Gram and Cosine require more token overlap and return nothing at the default threshold.

Performance considerations

Similarity comparisons are computed row by row and can be expensive on large tables. Apply the following strategies to keep queries fast.

Use index-supported algorithms

Eight algorithms support index acceleration (see the Index support column in the table above): L1 Distance, Cosine, Dice, Euclidean, Jaccard, Matching Coefficient, Overlap, and Q-Gram. Prefer these when filtering large datasets.

The nine algorithms without index support (Hamming, Jaro, Jaro-Winkler, Levenshtein, Monge-Elkan, Needleman-Wunsch, Smith-Waterman, Smith-Waterman-Gotoh, and Soundex) perform a full table scan. On large datasets, consider pre-filtering rows with a faster indexed algorithm before applying a slower one.

Limit result sets

When scanning large tables, add LIMIT to stop as soon as you have enough results:

SELECT b, lev('euler', b) AS score
FROM bar
WHERE b ~== 'euler'
ORDER BY score DESC
LIMIT 10;

Adjust the threshold before scanning

A stricter threshold (closer to 1.0) reduces the number of rows that the similarity function needs to evaluate:

SET pg_similarity.levenshtein_threshold = 0.8;
SELECT b FROM bar WHERE b ~== 'euler';

Limitations

Hamming Distance has no index support: The hamming function (for bit varying inputs) and hamming_text (for text inputs) do not support index acceleration. Use them only on small datasets or pre-filtered result sets.
Algorithms without index support can be slow at scale: Levenshtein, Jaro, Jaro-Winkler, Smith-Waterman, and others perform a full table scan. On large datasets, consider pre-filtering with a faster indexed algorithm before applying a slower one.
Compile-time options affect tokenization: The gram tokenizer behavior (full n-gram, case sensitivity) is controlled by PGS_FULL_NGRAM and PGS_IGNORE_CASE at compile time and cannot be changed at runtime.