All Products
Search
Document Center

PolarDB:pg_similarity (similarity distance calculation)

Last Updated:Mar 30, 2026

pg_similarity is a third-party extension for PolarDB for PostgreSQL that adds string similarity distance calculation to SQL queries. Use it when you need to match records with typos, find near-duplicate names, or rank results by phonetic or edit-distance similarity.

Prerequisites

Before you begin, ensure that you have:

  • A PolarDB for PostgreSQL cluster running PostgreSQL 14 (revision version 14.10.18.0 or later)

To check your revision version, run:

SELECT version();

How it works

pg_similarity has three components:

  • Functions: implement 17 similarity algorithms as user-defined functions (UDFs). Call them directly in SELECT statements to get a similarity score.

  • Operators: wrap the functions and compare the score against a threshold. Use them in WHERE clauses to filter matching rows. In addition to the traditional operators = and <>, pg_similarity defines ~~~ and ! for similarity queries — each of these operators represents a similarity function.

  • Session variables: control function behavior at runtime — tokenization strategy, match threshold, and normalization.

Each algorithm has its own operator (for example, ~== for Levenshtein and ~@@ for Jaro-Winkler). All operators trigger the underlying similarity function and return rows where the score meets the configured threshold.

Supported algorithms

pg_similarity implements 17 algorithms. Each suits a different matching scenario:

Algorithm

Function

Operator

Index support

Parameters

L1 Distance (City Block / Manhattan Distance)

block(text, text) returns float8

~++

Yes

pg_similarity.block_tokenizer, pg_similarity.block_threshold, pg_similarity.block_is_normalized

Cosine Distance

cosine(text, text) returns float8

~##

Yes

pg_similarity.cosine_tokenizer, pg_similarity.cosine_threshold, pg_similarity.cosine_is_normalized

Dice Coefficient

dice(text, text) returns float8

~-~

Yes

pg_similarity.dice_tokenizer, pg_similarity.dice_threshold, pg_similarity.dice_is_normalized

Euclidean Distance

euclidean(text, text) returns float8

~!!

Yes

pg_similarity.euclidean_tokenizer, pg_similarity.euclidean_threshold, pg_similarity.euclidean_is_normalized

Hamming Distance

hamming(bit varying, bit varying) returns float8 / hamming_text(text, text) returns float8

~@~

No

pg_similarity.hamming_threshold, pg_similarity.hamming_is_normalized

Jaccard Coefficient

jaccard(text, text) returns float8

~??

Yes

pg_similarity.jaccard_tokenizer, pg_similarity.jaccard_threshold, pg_similarity.jaccard_is_normalized

Jaro Distance

jaro(text, text) returns float8

~%%

No

pg_similarity.jaro_threshold, pg_similarity.jaro_is_normalized

Jaro-Winkler Distance

jarowinkler(text, text) returns float8

~@@

No

pg_similarity.jarowinkler_threshold, pg_similarity.jarowinkler_is_normalized

Levenshtein Distance

lev(text, text) returns float8

~==

No

pg_similarity.levenshtein_threshold, pg_similarity.levenshtein_is_normalized

Matching Coefficient

matchingcoefficient(text, text) returns float8

~^^

Yes

pg_similarity.matching_tokenizer, pg_similarity.matching_threshold, pg_similarity.matching_is_normalized

Monge-Elkan Coefficient

mongeelkan(text, text) returns float8

~||

No

pg_similarity.mongeelkan_tokenizer, pg_similarity.mongeelkan_threshold, pg_similarity.mongeelkan_is_normalized

Needleman-Wunsch Coefficient

needlemanwunsch(text, text) returns float8

~#~

No

pg_similarity.nw_threshold, pg_similarity.nw_is_normalized

Overlap Coefficient

overlapcoefficient(text, text) returns float8

~**

Yes

pg_similarity.overlap_tokenizer, pg_similarity.overlap_threshold, pg_similarity.overlap_is_normalized

Q-Gram Distance

qgram(text, text) returns float8

~~~

Yes

pg_similarity.qgram_threshold, pg_similarity.qgram_is_normalized

Smith-Waterman Coefficient

smithwaterman(text, text) returns float8

~=~

No

pg_similarity.sw_threshold, pg_similarity.sw_is_normalized

Smith-Waterman-Gotoh Coefficient

smithwatermangotoh(text, text) returns float8

~!~

No

pg_similarity.swg_threshold, pg_similarity.swg_is_normalized

Soundex Distance

soundex(text, text) returns float8

~*~

No

When to use which algorithm

Goal

Recommended algorithm

Notes

General typo correction (edit distance)

Levenshtein Distance (lev)

Works well across languages; case-sensitive

Name matching with prefix bias

Jaro-Winkler Distance (jarowinkler)

Weights common prefixes higher than Jaro

Short name matching

Jaro Distance (jaro)

Effective for short strings and personal names

Phonetic name matching

Soundex Distance (soundex)

Operates on text inputs

Token-set similarity (bag of words)

Jaccard Coefficient (jaccard) or Dice Coefficient (dice)

Good for comparing sets of words regardless of order

Vector-space similarity (NLP / search)

Cosine Distance (cosine)

Useful for term-frequency-style comparisons

Sequence alignment

Smith-Waterman (smithwaterman) or Needleman-Wunsch (needlemanwunsch)

Local and global sequence alignment algorithms

Bit-string comparison

Hamming Distance (hamming)

Operates on bit varying inputs; no index support

Parameters

Three parameter categories control function and operator behavior.

tokenizer

Controls how strings are split into tokens before comparison. Applies only to token-based algorithms (those with a _tokenizer parameter). All tokens are lowercased.

Value

Delimiter

Example input

Tokens

alnum (default)

Any non-alphanumeric character

Euler_Taveira_de_Oliveira 22/02/2011

Euler, Taveira, de, Oliveira, 22, 02, 2011

word

Whitespace (space, tab, newline, carriage return, form-feed, vertical tab)

Euler Taveira de Oliveira 22/02/2011

Euler, Taveira, de, Oliveira, 22/02/2011

gram

Sliding window of length n (n-gram)

euler taveira (n=3)

eul, ule, ler, er , r t, ta, tav, ave, vei, eir, ira

camelcase

Capitalized characters (included as token start)

EulerTaveira de Oliveira

Euler, Taveira de , Oliveira

The gram tokenizer supports a full n-gram mode that pads the string with boundary characters (compile-time option PGS_FULL_NGRAM). This affects Q-Gram Distance results. Case sensitivity can also be changed at compile time via PGS_IGNORE_CASE.

threshold

Controls how strict the match is. If the similarity score for a pair of strings is greater than or equal to the threshold, the operator returns a match.

  • Valid range: 0.0 to 1.0

  • Default: 0.7

Lower values return more matches (looser); higher values return fewer (stricter).

normalized

Controls whether similarity functions return a value in the [0.0, 1.0] range.

  • Default: true

  • Operators always use normalized values internally. This parameter only affects the raw return value of similarity functions when called directly.

Usage

Install the extension

CREATE EXTENSION pg_similarity;

Configure parameters at runtime

Session variables are set with SET and inspected with SHOW:

-- Check the current levenshtein threshold (default: 0.7)
SHOW pg_similarity.levenshtein_threshold;
 pg_similarity.levenshtein_threshold
-------------------------------------
 0.7
(1 row)

-- Lower the threshold to match more loosely
SET pg_similarity.levenshtein_threshold TO 0.5;

-- Switch the cosine tokenizer to camelcase
SET pg_similarity.cosine_tokenizer TO camelcase;

-- Disable normalization for euclidean (return raw distance)
SET pg_similarity.euclidean_is_normalized TO false;

Set up test data

The following examples use two tables with name data:

CREATE TABLE foo (a TEXT);
INSERT INTO foo VALUES
  ('Euler'),
  ('Oiler'),
  ('Euler Taveira de Oliveira'),
  ('Maria Taveira dos Santos'),
  ('Carlos Santos Silva');

CREATE TABLE bar (b TEXT);
INSERT INTO bar VALUES
  ('Euler T. de Oliveira'),
  ('Euller'),
  ('Oliveira, Euler Taveira'),
  ('Sr. Oliveira');

Call a similarity function directly

Use the jaro function to get a similarity score for every pair of rows:

SELECT a, b, jaro(a, b) FROM foo, bar;

Output (20 rows, 5x4 cross join):

             a             |            b            |        jaro
---------------------------+-------------------------+--------------------
 Euler                     | Euler T. de Oliveira    |               0.75
 Euler                     | Euller                  | 0.9444444444444444
 Euler                     | Oliveira, Euler Taveira | 0.6057971014492753
 Euler                     | Sr. Oliveira            | 0.5055555555555555
 Oiler                     | Euler T. de Oliveira    | 0.4722222222222222
 Oiler                     | Euller                  |                0.7
 Oiler                     | Oliveira, Euler Taveira |  0.672463768115942
 Oiler                     | Sr. Oliveira            | 0.6722222222222223
 Euler Taveira de Oliveira | Euler T. de Oliveira    | 0.7980701754385964
 Euler Taveira de Oliveira | Euller                  | 0.6777777777777777
 Euler Taveira de Oliveira | Oliveira, Euler Taveira | 0.7731884057971014
 Euler Taveira de Oliveira | Sr. Oliveira            | 0.5922222222222222
 Maria Taveira dos Santos  | Euler T. de Oliveira    | 0.6023504273504273
 Maria Taveira dos Santos  | Euller                  | 0.3055555555555556
 Maria Taveira dos Santos  | Oliveira, Euler Taveira | 0.5350241545893719
 Maria Taveira dos Santos  | Sr. Oliveira            | 0.6342592592592593
 Carlos Santos Silva       | Euler T. de Oliveira    | 0.5421052631578946
 Carlos Santos Silva       | Euller                  | 0.3128654970760234
 Carlos Santos Silva       | Oliveira, Euler Taveira | 0.6066615814899567
 Carlos Santos Silva       | Sr. Oliveira            | 0.5077276524644945
(20 rows)

Filter rows using an operator

Use the ~== (Levenshtein) operator in a WHERE clause to return only rows that meet the threshold:

-- Threshold at 0.5: returns 3 matches
SHOW pg_similarity.levenshtein_threshold;
 pg_similarity.levenshtein_threshold
-------------------------------------
 0.5
(1 row)

SELECT a, b, lev(a,b) FROM foo, bar WHERE a ~== b;
             a             |          b           |        lev
---------------------------+----------------------+--------------------
 Euler                     | Euller               | 0.8333333333333334
 Oiler                     | Euller               |                0.5
 Euler Taveira de Oliveira | Euler T. de Oliveira |               0.76
(3 rows)

-- Raise the threshold to 0.7: stricter match, returns 2 rows
SET pg_similarity.levenshtein_threshold = 0.7;
SELECT a, b, lev(a,b) FROM foo, bar WHERE a ~== b;
             a             |          b           |        lev
---------------------------+----------------------+--------------------
 Euler                     | Euller               | 0.8333333333333334
 Euler Taveira de Oliveira | Euler T. de Oliveira |               0.76
(2 rows)

Compare algorithms on the same data

Run multiple similarity algorithms against the same string pair to see how results differ:

-- jaro-winkler operator: returns 2 rows
SELECT * FROM bar WHERE b ~@@ 'euler';
          b
----------------------
 Euler T. de Oliveira
 Euller
(2 rows)

-- q-gram operator: returns 0 rows (sensitive to character n-grams)
SELECT * FROM bar WHERE b ~~~ 'euler';
 b
---
(0 rows)

-- levenshtein operator: returns 1 row
SELECT * FROM bar WHERE b ~== 'euler';
   b
--------
 Euller
(1 row)

-- cosine operator: returns 0 rows (low token overlap)
SELECT * FROM bar WHERE b ~## 'euler';
 b
---
(0 rows)

The same query against 'euler' returns different results depending on the algorithm. Jaro-Winkler matches because it rewards common prefixes; Q-Gram and Cosine require more token overlap and return nothing at the default threshold.

Performance considerations

Similarity comparisons are computed row by row and can be expensive on large tables. Apply the following strategies to keep queries fast.

Use index-supported algorithms

Eight algorithms support index acceleration (see the Index support column in the table above): L1 Distance, Cosine, Dice, Euclidean, Jaccard, Matching Coefficient, Overlap, and Q-Gram. Prefer these when filtering large datasets.

The nine algorithms without index support (Hamming, Jaro, Jaro-Winkler, Levenshtein, Monge-Elkan, Needleman-Wunsch, Smith-Waterman, Smith-Waterman-Gotoh, and Soundex) perform a full table scan. On large datasets, consider pre-filtering rows with a faster indexed algorithm before applying a slower one.

Limit result sets

When scanning large tables, add LIMIT to stop as soon as you have enough results:

SELECT b, lev('euler', b) AS score
FROM bar
WHERE b ~== 'euler'
ORDER BY score DESC
LIMIT 10;

Adjust the threshold before scanning

A stricter threshold (closer to 1.0) reduces the number of rows that the similarity function needs to evaluate:

SET pg_similarity.levenshtein_threshold = 0.8;
SELECT b FROM bar WHERE b ~== 'euler';

Limitations

  • Hamming Distance has no index support: The hamming function (for bit varying inputs) and hamming_text (for text inputs) do not support index acceleration. Use them only on small datasets or pre-filtered result sets.

  • Algorithms without index support can be slow at scale: Levenshtein, Jaro, Jaro-Winkler, Smith-Waterman, and others perform a full table scan. On large datasets, consider pre-filtering with a faster indexed algorithm before applying a slower one.

  • Compile-time options affect tokenization: The gram tokenizer behavior (full n-gram, case sensitivity) is controlled by PGS_FULL_NGRAM and PGS_IGNORE_CASE at compile time and cannot be changed at runtime.