UTL_MATCH is a built-in package in PolarDB for PostgreSQL (Compatible with Oracle) that calculates string similarity using the Jaro-Winkler algorithm. Use it to perform fuzzy matching, data deduplication, and data cleaning directly in the database—without transferring data to the application layer.
Prerequisites
Before you begin, ensure that you have:
A PolarDB for PostgreSQL (Compatible with Oracle) cluster running minor engine version 2.0.14.17.36.0 or later
To check your version, view the minor engine version in the console, or run:
SHOW polardb_version;If the version requirement is not met, upgrade the minor engine version.
How it works
The Jaro-Winkler algorithm compares two strings and returns a similarity score. It is designed for short strings such as names and addresses. The algorithm gives extra weight to matching characters at the beginning of the string (prefix matches), making it effective for catching common spelling variations.
Key characteristics:
Assigns higher weight to prefix matches
Sensitive to character order, but tolerates some character transposition
Case-insensitive:
'Hello'and'hello'have a similarity of 100%
Functions
UTL_MATCH provides two functions. Both accept two VARCHAR2 strings and differ only in how they express the result.
| Function | Return type | Return range | Description |
|---|---|---|---|
JARO_WINKLER(s1 VARCHAR2, s2 VARCHAR2) | BINARY_DOUBLE | 0.0–1.0, where 1.0 = exact match | Returns the Jaro-Winkler similarity score as a decimal |
JARO_WINKLER_SIMILARITY(s1 VARCHAR2, s2 VARCHAR2) | PLS_INTEGER | 0–100, where 100 = exact match | Returns the similarity as a percentage integer (the integer part of JARO_WINKLER × 100) |
Minimal examples
-- Raw score (0.0 to 1.0)
SELECT UTL_MATCH.JARO_WINKLER('John Smith', 'Jon Smith') FROM DUAL;
-- Returns approximately 0.973
-- Percentage (0 to 100)
SELECT UTL_MATCH.JARO_WINKLER_SIMILARITY('John Smith', 'Jon Smith') FROM DUAL;
-- Returns 97Usage notes
NULL and empty string handling: If either input is NULL or an empty string, the function returns NULL, not 0. Handle this case explicitly to avoid unexpected NULL values in your logic.
Examples
Calculate basic string similarity
The following example uses both functions to compare strings across different similarity scenarios.
DECLARE
v_score BINARY_DOUBLE;
v_percent PLS_INTEGER;
BEGIN
-- 1. Compare identical strings
v_score := UTL_MATCH.JARO_WINKLER('hello', 'hello');
v_percent := UTL_MATCH.JARO_WINKLER_SIMILARITY('hello', 'hello');
DBMS_OUTPUT.PUT_LINE('Identical (hello vs hello) - Score: ' || v_score || ', Percentage: ' || v_percent);
-- 2. Compare similar strings with spelling errors
v_score := UTL_MATCH.JARO_WINKLER('hello', 'hallo');
v_percent := UTL_MATCH.JARO_WINKLER_SIMILARITY('hello', 'hallo');
DBMS_OUTPUT.PUT_LINE('Spelling error (hello vs hallo) - Score: ' || v_score || ', Percentage: ' || v_percent);
-- 3. Compare completely different strings
v_score := UTL_MATCH.JARO_WINKLER('hello', 'world');
v_percent := UTL_MATCH.JARO_WINKLER_SIMILARITY('hello', 'world');
DBMS_OUTPUT.PUT_LINE('Completely different (hello vs world) - Score: ' || v_score || ', Percentage: ' || v_percent);
-- 4. Compare strings with different cases
v_score := UTL_MATCH.JARO_WINKLER('Hello', 'hello');
v_percent := UTL_MATCH.JARO_WINKLER_SIMILARITY('Hello', 'hello');
DBMS_OUTPUT.PUT_LINE('Different case (Hello vs hello) - Score: ' || v_score || ', Percentage: ' || v_percent);
END;
/Expected output:
Identical (hello vs hello) - Score: 1, Percentage: 100
Spelling error (hello vs hallo) - Score: 0.88, Percentage: 88
Completely different (hello vs world) - Score: 0.466666666666667, Percentage: 47
Different case (Hello vs hello) - Score: 1, Percentage: 100Fuzzy name matching
The following example compares a set of test names against a standard name using a similarity threshold. Names that meet or exceed the threshold are flagged as matches.
DECLARE
TYPE name_array IS VARRAY(10) OF VARCHAR2(100);
v_standard_name VARCHAR2(100) := 'John Smith';
v_test_names name_array := name_array(
'John Smith', -- Exact match
'Jon Smith', -- Misspelled first name
'John Smyth', -- Last name spelling variation
'J Smith', -- Abbreviated first name
'Smith John', -- Swapped order
'John Johnson' -- Completely different last name
);
v_similarity BINARY_DOUBLE;
v_threshold BINARY_DOUBLE := 0.8; -- Define the similarity threshold as 80%
BEGIN
DBMS_OUTPUT.PUT_LINE('Standard name: ' || v_standard_name || ', Match threshold: ' || v_threshold);
DBMS_OUTPUT.PUT_LINE('----------------------------------------');
FOR i IN 1..v_test_names.COUNT LOOP
v_similarity := UTL_MATCH.JARO_WINKLER(v_standard_name, v_test_names(i));
DBMS_OUTPUT.PUT_LINE(
'Input: ' || RPAD(v_test_names(i), 15) ||
' | Similarity: ' || ROUND(v_similarity, 3) ||
' | Is Match: ' || CASE WHEN v_similarity >= v_threshold THEN 'Yes' ELSE 'No' END
);
END LOOP;
END;
/Expected output:
Standard name: John Smith, Match threshold: 0.8
----------------------------------------
Input: John Smith | Similarity: 1 | Is Match: Yes
Input: Jon Smith | Similarity: 0.973 | Is Match: Yes
Input: John Smyth | Similarity: 0.96 | Is Match: Yes
Input: J Smith | Similarity: 0.781 | Is Match: No
Input: Smith John | Similarity: 0.533 | Is Match: No
Input: John Johnson | Similarity: 0.828 | Is Match: YesThe algorithm scores prefix-preserving variations (Jon Smith, John Smyth) much higher than reversed-order strings (Smith John), because word order affects the match position weights.
Detect duplicate customer records
In a customer relationship management (CRM) system, duplicate records can be created when customers register through different channels. The following example identifies potential duplicates by combining name and email similarity scores using a weighted average (70% name, 30% email).
DECLARE
TYPE customer IS RECORD (id INTEGER, name VARCHAR2(100), email VARCHAR2(100));
TYPE customer_array IS TABLE OF customer;
v_customers customer_array := customer_array(
customer(1, 'John Smith', 'john.smith@email.com'),
customer(2, 'Jon Smith', 'j.smith@email.com'), -- Similar name, similar email
customer(3, 'John Smyth', 'john.smyth@gmail.com'), -- Similar name, different email
customer(4, 'Jane Doe', 'jane.doe@email.com'), -- Completely different
customer(5, 'John Smith', 'johnsmith@email.com') -- Same name, similar email
);
v_name_sim BINARY_DOUBLE;
v_email_sim BINARY_DOUBLE;
v_overall_sim BINARY_DOUBLE;
v_threshold BINARY_DOUBLE := 0.85; -- Define the combined similarity threshold
BEGIN
DBMS_OUTPUT.PUT_LINE('Duplicate data detection analysis (Threshold: ' || v_threshold || ')');
DBMS_OUTPUT.PUT_LINE('==============================================');
FOR i IN 1..v_customers.COUNT LOOP
FOR j IN i + 1..v_customers.COUNT LOOP
v_name_sim := UTL_MATCH.JARO_WINKLER(v_customers(i).name, v_customers(j).name);
v_email_sim := UTL_MATCH.JARO_WINKLER(v_customers(i).email, v_customers(j).email);
-- Calculate a combined similarity score using a weighted average (70% weight for name, 30% weight for email)
v_overall_sim := (v_name_sim * 0.7) + (v_email_sim * 0.3);
IF v_overall_sim >= v_threshold THEN
DBMS_OUTPUT.PUT_LINE('Potential duplicate found:');
DBMS_OUTPUT.PUT_LINE(' Record ' || v_customers(i).id || ': ' || v_customers(i).name || ' | ' || v_customers(i).email);
DBMS_OUTPUT.PUT_LINE(' Record ' || v_customers(j).id || ': ' || v_customers(j).name || ' | ' || v_customers(j).email);
DBMS_OUTPUT.PUT_LINE(' -> Combined similarity: ' || ROUND(v_overall_sim * 100) || '%');
DBMS_OUTPUT.PUT_LINE('');
END IF;
END LOOP;
END LOOP;
END;
/Expected output:
Duplicate data detection analysis (Threshold: 0.85)
==============================================
Potential duplicate found:
Record 1: John Smith | john.smith@email.com
Record 2: Jon Smith | j.smith@email.com
-> Combined similarity: 95%
Potential duplicate found:
Record 1: John Smith | john.smith@email.com
Record 3: John Smyth | john.smyth@gmail.com
-> Combined similarity: 95%
Potential duplicate found:
Record 1: John Smith | john.smith@email.com
Record 5: John Smith | johnsmith@email.com
-> Combined similarity: 100%
Potential duplicate found:
Record 2: Jon Smith | j.smith@email.com
Record 3: John Smyth | john.smyth@gmail.com
-> Combined similarity: 90%
Potential duplicate found:
Record 2: Jon Smith | j.smith@email.com
Record 5: John Smith | johnsmith@email.com
-> Combined similarity: 95%
Potential duplicate found:
Record 3: John Smyth | john.smyth@gmail.com
Record 5: John Smith | johnsmith@email.com
-> Combined similarity: 95%