pg_similarity is a third-party extension for PolarDB for PostgreSQL that adds string similarity distance calculation to SQL queries. Use it when you need to match records with typos, find near-duplicate names, or rank results by phonetic or edit-distance similarity.
Prerequisites
Before you begin, ensure that you have:
A PolarDB for PostgreSQL cluster running PostgreSQL 14 (revision version 14.10.18.0 or later)
To check your revision version, run:
SELECT version();How it works
pg_similarity has three components:
Functions: implement 17 similarity algorithms as user-defined functions (UDFs). Call them directly in
SELECTstatements to get a similarity score.Operators: wrap the functions and compare the score against a threshold. Use them in
WHEREclauses to filter matching rows. In addition to the traditional operators=and<>,pg_similaritydefines~~~and!for similarity queries — each of these operators represents a similarity function.Session variables: control function behavior at runtime — tokenization strategy, match threshold, and normalization.
Each algorithm has its own operator (for example, ~== for Levenshtein and ~@@ for Jaro-Winkler). All operators trigger the underlying similarity function and return rows where the score meets the configured threshold.
Supported algorithms
pg_similarity implements 17 algorithms. Each suits a different matching scenario:
Algorithm | Function | Operator | Index support | Parameters |
L1 Distance (City Block / Manhattan Distance) |
|
| Yes |
|
Cosine Distance |
|
| Yes |
|
Dice Coefficient |
|
| Yes |
|
Euclidean Distance |
|
| Yes |
|
Hamming Distance |
|
| No |
|
Jaccard Coefficient |
|
| Yes |
|
Jaro Distance |
|
| No |
|
Jaro-Winkler Distance |
|
| No |
|
Levenshtein Distance |
|
| No |
|
Matching Coefficient |
|
| Yes |
|
Monge-Elkan Coefficient |
|
| No |
|
Needleman-Wunsch Coefficient |
|
| No |
|
Overlap Coefficient |
|
| Yes |
|
Q-Gram Distance |
|
| Yes |
|
Smith-Waterman Coefficient |
|
| No |
|
Smith-Waterman-Gotoh Coefficient |
|
| No |
|
Soundex Distance |
|
| No | — |
When to use which algorithm
Goal | Recommended algorithm | Notes |
General typo correction (edit distance) | Levenshtein Distance ( | Works well across languages; case-sensitive |
Name matching with prefix bias | Jaro-Winkler Distance ( | Weights common prefixes higher than Jaro |
Short name matching | Jaro Distance ( | Effective for short strings and personal names |
Phonetic name matching | Soundex Distance ( | Operates on text inputs |
Token-set similarity (bag of words) | Jaccard Coefficient ( | Good for comparing sets of words regardless of order |
Vector-space similarity (NLP / search) | Cosine Distance ( | Useful for term-frequency-style comparisons |
Sequence alignment | Smith-Waterman ( | Local and global sequence alignment algorithms |
Bit-string comparison | Hamming Distance ( | Operates on |
Parameters
Three parameter categories control function and operator behavior.
tokenizer
Controls how strings are split into tokens before comparison. Applies only to token-based algorithms (those with a _tokenizer parameter). All tokens are lowercased.
Value | Delimiter | Example input | Tokens |
| Any non-alphanumeric character |
|
|
| Whitespace (space, tab, newline, carriage return, form-feed, vertical tab) |
|
|
| Sliding window of length n (n-gram) |
|
|
| Capitalized characters (included as token start) |
|
|
Thegramtokenizer supports a full n-gram mode that pads the string with boundary characters (compile-time optionPGS_FULL_NGRAM). This affects Q-Gram Distance results. Case sensitivity can also be changed at compile time viaPGS_IGNORE_CASE.
threshold
Controls how strict the match is. If the similarity score for a pair of strings is greater than or equal to the threshold, the operator returns a match.
Valid range:
0.0to1.0Default:
0.7
Lower values return more matches (looser); higher values return fewer (stricter).
normalized
Controls whether similarity functions return a value in the [0.0, 1.0] range.
Default:
trueOperators always use normalized values internally. This parameter only affects the raw return value of similarity functions when called directly.
Usage
Install the extension
CREATE EXTENSION pg_similarity;Configure parameters at runtime
Session variables are set with SET and inspected with SHOW:
-- Check the current levenshtein threshold (default: 0.7)
SHOW pg_similarity.levenshtein_threshold;
pg_similarity.levenshtein_threshold
-------------------------------------
0.7
(1 row)
-- Lower the threshold to match more loosely
SET pg_similarity.levenshtein_threshold TO 0.5;
-- Switch the cosine tokenizer to camelcase
SET pg_similarity.cosine_tokenizer TO camelcase;
-- Disable normalization for euclidean (return raw distance)
SET pg_similarity.euclidean_is_normalized TO false;Set up test data
The following examples use two tables with name data:
CREATE TABLE foo (a TEXT);
INSERT INTO foo VALUES
('Euler'),
('Oiler'),
('Euler Taveira de Oliveira'),
('Maria Taveira dos Santos'),
('Carlos Santos Silva');
CREATE TABLE bar (b TEXT);
INSERT INTO bar VALUES
('Euler T. de Oliveira'),
('Euller'),
('Oliveira, Euler Taveira'),
('Sr. Oliveira');Call a similarity function directly
Use the jaro function to get a similarity score for every pair of rows:
SELECT a, b, jaro(a, b) FROM foo, bar;Output (20 rows, 5x4 cross join):
a | b | jaro
---------------------------+-------------------------+--------------------
Euler | Euler T. de Oliveira | 0.75
Euler | Euller | 0.9444444444444444
Euler | Oliveira, Euler Taveira | 0.6057971014492753
Euler | Sr. Oliveira | 0.5055555555555555
Oiler | Euler T. de Oliveira | 0.4722222222222222
Oiler | Euller | 0.7
Oiler | Oliveira, Euler Taveira | 0.672463768115942
Oiler | Sr. Oliveira | 0.6722222222222223
Euler Taveira de Oliveira | Euler T. de Oliveira | 0.7980701754385964
Euler Taveira de Oliveira | Euller | 0.6777777777777777
Euler Taveira de Oliveira | Oliveira, Euler Taveira | 0.7731884057971014
Euler Taveira de Oliveira | Sr. Oliveira | 0.5922222222222222
Maria Taveira dos Santos | Euler T. de Oliveira | 0.6023504273504273
Maria Taveira dos Santos | Euller | 0.3055555555555556
Maria Taveira dos Santos | Oliveira, Euler Taveira | 0.5350241545893719
Maria Taveira dos Santos | Sr. Oliveira | 0.6342592592592593
Carlos Santos Silva | Euler T. de Oliveira | 0.5421052631578946
Carlos Santos Silva | Euller | 0.3128654970760234
Carlos Santos Silva | Oliveira, Euler Taveira | 0.6066615814899567
Carlos Santos Silva | Sr. Oliveira | 0.5077276524644945
(20 rows)Filter rows using an operator
Use the ~== (Levenshtein) operator in a WHERE clause to return only rows that meet the threshold:
-- Threshold at 0.5: returns 3 matches
SHOW pg_similarity.levenshtein_threshold;
pg_similarity.levenshtein_threshold
-------------------------------------
0.5
(1 row)
SELECT a, b, lev(a,b) FROM foo, bar WHERE a ~== b;
a | b | lev
---------------------------+----------------------+--------------------
Euler | Euller | 0.8333333333333334
Oiler | Euller | 0.5
Euler Taveira de Oliveira | Euler T. de Oliveira | 0.76
(3 rows)
-- Raise the threshold to 0.7: stricter match, returns 2 rows
SET pg_similarity.levenshtein_threshold = 0.7;
SELECT a, b, lev(a,b) FROM foo, bar WHERE a ~== b;
a | b | lev
---------------------------+----------------------+--------------------
Euler | Euller | 0.8333333333333334
Euler Taveira de Oliveira | Euler T. de Oliveira | 0.76
(2 rows)Compare algorithms on the same data
Run multiple similarity algorithms against the same string pair to see how results differ:
-- jaro-winkler operator: returns 2 rows
SELECT * FROM bar WHERE b ~@@ 'euler';
b
----------------------
Euler T. de Oliveira
Euller
(2 rows)
-- q-gram operator: returns 0 rows (sensitive to character n-grams)
SELECT * FROM bar WHERE b ~~~ 'euler';
b
---
(0 rows)
-- levenshtein operator: returns 1 row
SELECT * FROM bar WHERE b ~== 'euler';
b
--------
Euller
(1 row)
-- cosine operator: returns 0 rows (low token overlap)
SELECT * FROM bar WHERE b ~## 'euler';
b
---
(0 rows)The same query against 'euler' returns different results depending on the algorithm. Jaro-Winkler matches because it rewards common prefixes; Q-Gram and Cosine require more token overlap and return nothing at the default threshold.
Performance considerations
Similarity comparisons are computed row by row and can be expensive on large tables. Apply the following strategies to keep queries fast.
Use index-supported algorithms
Eight algorithms support index acceleration (see the Index support column in the table above): L1 Distance, Cosine, Dice, Euclidean, Jaccard, Matching Coefficient, Overlap, and Q-Gram. Prefer these when filtering large datasets.
The nine algorithms without index support (Hamming, Jaro, Jaro-Winkler, Levenshtein, Monge-Elkan, Needleman-Wunsch, Smith-Waterman, Smith-Waterman-Gotoh, and Soundex) perform a full table scan. On large datasets, consider pre-filtering rows with a faster indexed algorithm before applying a slower one.
Limit result sets
When scanning large tables, add LIMIT to stop as soon as you have enough results:
SELECT b, lev('euler', b) AS score
FROM bar
WHERE b ~== 'euler'
ORDER BY score DESC
LIMIT 10;Adjust the threshold before scanning
A stricter threshold (closer to 1.0) reduces the number of rows that the similarity function needs to evaluate:
SET pg_similarity.levenshtein_threshold = 0.8;
SELECT b FROM bar WHERE b ~== 'euler';Limitations
Hamming Distance has no index support: The
hammingfunction (forbit varyinginputs) andhamming_text(fortextinputs) do not support index acceleration. Use them only on small datasets or pre-filtered result sets.Algorithms without index support can be slow at scale: Levenshtein, Jaro, Jaro-Winkler, Smith-Waterman, and others perform a full table scan. On large datasets, consider pre-filtering with a faster indexed algorithm before applying a slower one.
Compile-time options affect tokenization: The
gramtokenizer behavior (full n-gram, case sensitivity) is controlled byPGS_FULL_NGRAMandPGS_IGNORE_CASEat compile time and cannot be changed at runtime.