All Products
Search
Document Center

PolarDB:UTL_MATCH

Last Updated:Mar 28, 2026

UTL_MATCH is a built-in package in PolarDB for PostgreSQL (Compatible with Oracle) that calculates string similarity using the Jaro-Winkler algorithm. Use it to perform fuzzy matching, data deduplication, and data cleaning directly in the database—without transferring data to the application layer.

Prerequisites

Before you begin, ensure that you have:

  • A PolarDB for PostgreSQL (Compatible with Oracle) cluster running minor engine version 2.0.14.17.36.0 or later

To check your version, view the minor engine version in the console, or run:

SHOW polardb_version;

If the version requirement is not met, upgrade the minor engine version.

How it works

The Jaro-Winkler algorithm compares two strings and returns a similarity score. It is designed for short strings such as names and addresses. The algorithm gives extra weight to matching characters at the beginning of the string (prefix matches), making it effective for catching common spelling variations.

Key characteristics:

  • Assigns higher weight to prefix matches

  • Sensitive to character order, but tolerates some character transposition

  • Case-insensitive: 'Hello' and 'hello' have a similarity of 100%

Functions

UTL_MATCH provides two functions. Both accept two VARCHAR2 strings and differ only in how they express the result.

FunctionReturn typeReturn rangeDescription
JARO_WINKLER(s1 VARCHAR2, s2 VARCHAR2)BINARY_DOUBLE0.0–1.0, where 1.0 = exact matchReturns the Jaro-Winkler similarity score as a decimal
JARO_WINKLER_SIMILARITY(s1 VARCHAR2, s2 VARCHAR2)PLS_INTEGER0–100, where 100 = exact matchReturns the similarity as a percentage integer (the integer part of JARO_WINKLER × 100)

Minimal examples

-- Raw score (0.0 to 1.0)
SELECT UTL_MATCH.JARO_WINKLER('John Smith', 'Jon Smith') FROM DUAL;
-- Returns approximately 0.973

-- Percentage (0 to 100)
SELECT UTL_MATCH.JARO_WINKLER_SIMILARITY('John Smith', 'Jon Smith') FROM DUAL;
-- Returns 97

Usage notes

NULL and empty string handling: If either input is NULL or an empty string, the function returns NULL, not 0. Handle this case explicitly to avoid unexpected NULL values in your logic.

Examples

Calculate basic string similarity

The following example uses both functions to compare strings across different similarity scenarios.

DECLARE
    v_score    BINARY_DOUBLE;
    v_percent  PLS_INTEGER;
BEGIN
    -- 1. Compare identical strings
    v_score := UTL_MATCH.JARO_WINKLER('hello', 'hello');
    v_percent := UTL_MATCH.JARO_WINKLER_SIMILARITY('hello', 'hello');
    DBMS_OUTPUT.PUT_LINE('Identical (hello vs hello) - Score: ' || v_score || ', Percentage: ' || v_percent);

    -- 2. Compare similar strings with spelling errors
    v_score := UTL_MATCH.JARO_WINKLER('hello', 'hallo');
    v_percent := UTL_MATCH.JARO_WINKLER_SIMILARITY('hello', 'hallo');
    DBMS_OUTPUT.PUT_LINE('Spelling error (hello vs hallo) - Score: ' || v_score || ', Percentage: ' || v_percent);

    -- 3. Compare completely different strings
    v_score := UTL_MATCH.JARO_WINKLER('hello', 'world');
    v_percent := UTL_MATCH.JARO_WINKLER_SIMILARITY('hello', 'world');
    DBMS_OUTPUT.PUT_LINE('Completely different (hello vs world) - Score: ' || v_score || ', Percentage: ' || v_percent);

    -- 4. Compare strings with different cases
    v_score := UTL_MATCH.JARO_WINKLER('Hello', 'hello');
    v_percent := UTL_MATCH.JARO_WINKLER_SIMILARITY('Hello', 'hello');
    DBMS_OUTPUT.PUT_LINE('Different case (Hello vs hello) - Score: ' || v_score || ', Percentage: ' || v_percent);
END;
/

Expected output:

Identical (hello vs hello) - Score: 1, Percentage: 100
Spelling error (hello vs hallo) - Score: 0.88, Percentage: 88
Completely different (hello vs world) - Score: 0.466666666666667, Percentage: 47
Different case (Hello vs hello) - Score: 1, Percentage: 100

Fuzzy name matching

The following example compares a set of test names against a standard name using a similarity threshold. Names that meet or exceed the threshold are flagged as matches.

DECLARE
    TYPE name_array IS VARRAY(10) OF VARCHAR2(100);
    v_standard_name VARCHAR2(100) := 'John Smith';
    v_test_names    name_array := name_array(
        'John Smith',   -- Exact match
        'Jon Smith',    -- Misspelled first name
        'John Smyth',   -- Last name spelling variation
        'J Smith',      -- Abbreviated first name
        'Smith John',   -- Swapped order
        'John Johnson'  -- Completely different last name
    );
    v_similarity    BINARY_DOUBLE;
    v_threshold     BINARY_DOUBLE := 0.8; -- Define the similarity threshold as 80%
BEGIN
    DBMS_OUTPUT.PUT_LINE('Standard name: ' || v_standard_name || ', Match threshold: ' || v_threshold);
    DBMS_OUTPUT.PUT_LINE('----------------------------------------');

    FOR i IN 1..v_test_names.COUNT LOOP
        v_similarity := UTL_MATCH.JARO_WINKLER(v_standard_name, v_test_names(i));
        DBMS_OUTPUT.PUT_LINE(
            'Input: ' || RPAD(v_test_names(i), 15) ||
            ' | Similarity: ' || ROUND(v_similarity, 3) ||
            ' | Is Match: ' || CASE WHEN v_similarity >= v_threshold THEN 'Yes' ELSE 'No' END
        );
    END LOOP;
END;
/

Expected output:

Standard name: John Smith, Match threshold: 0.8
----------------------------------------
Input: John Smith      | Similarity: 1 | Is Match: Yes
Input: Jon Smith       | Similarity: 0.973 | Is Match: Yes
Input: John Smyth      | Similarity: 0.96 | Is Match: Yes
Input: J Smith         | Similarity: 0.781 | Is Match: No
Input: Smith John      | Similarity: 0.533 | Is Match: No
Input: John Johnson    | Similarity: 0.828 | Is Match: Yes

The algorithm scores prefix-preserving variations (Jon Smith, John Smyth) much higher than reversed-order strings (Smith John), because word order affects the match position weights.

Detect duplicate customer records

In a customer relationship management (CRM) system, duplicate records can be created when customers register through different channels. The following example identifies potential duplicates by combining name and email similarity scores using a weighted average (70% name, 30% email).

DECLARE
    TYPE customer IS RECORD (id INTEGER, name VARCHAR2(100), email VARCHAR2(100));
    TYPE customer_array IS TABLE OF customer;
    v_customers customer_array := customer_array(
        customer(1, 'John Smith', 'john.smith@email.com'),
        customer(2, 'Jon Smith',  'j.smith@email.com'),      -- Similar name, similar email
        customer(3, 'John Smyth', 'john.smyth@gmail.com'),   -- Similar name, different email
        customer(4, 'Jane Doe',   'jane.doe@email.com'),      -- Completely different
        customer(5, 'John Smith', 'johnsmith@email.com')      -- Same name, similar email
    );
    v_name_sim    BINARY_DOUBLE;
    v_email_sim   BINARY_DOUBLE;
    v_overall_sim BINARY_DOUBLE;
    v_threshold   BINARY_DOUBLE := 0.85; -- Define the combined similarity threshold
BEGIN
    DBMS_OUTPUT.PUT_LINE('Duplicate data detection analysis (Threshold: ' || v_threshold || ')');
    DBMS_OUTPUT.PUT_LINE('==============================================');

    FOR i IN 1..v_customers.COUNT LOOP
        FOR j IN i + 1..v_customers.COUNT LOOP
            v_name_sim := UTL_MATCH.JARO_WINKLER(v_customers(i).name, v_customers(j).name);
            v_email_sim := UTL_MATCH.JARO_WINKLER(v_customers(i).email, v_customers(j).email);

            -- Calculate a combined similarity score using a weighted average (70% weight for name, 30% weight for email)
            v_overall_sim := (v_name_sim * 0.7) + (v_email_sim * 0.3);

            IF v_overall_sim >= v_threshold THEN
                DBMS_OUTPUT.PUT_LINE('Potential duplicate found:');
                DBMS_OUTPUT.PUT_LINE('  Record ' || v_customers(i).id || ': ' || v_customers(i).name || ' | ' || v_customers(i).email);
                DBMS_OUTPUT.PUT_LINE('  Record ' || v_customers(j).id || ': ' || v_customers(j).name || ' | ' || v_customers(j).email);
                DBMS_OUTPUT.PUT_LINE('  -> Combined similarity: ' || ROUND(v_overall_sim * 100) || '%');
                DBMS_OUTPUT.PUT_LINE('');
            END IF;
        END LOOP;
    END LOOP;
END;
/

Expected output:

Duplicate data detection analysis (Threshold: 0.85)
==============================================
Potential duplicate found:
  Record 1: John Smith | john.smith@email.com
  Record 2: Jon Smith | j.smith@email.com
  -> Combined similarity: 95%
Potential duplicate found:
  Record 1: John Smith | john.smith@email.com
  Record 3: John Smyth | john.smyth@gmail.com
  -> Combined similarity: 95%
Potential duplicate found:
  Record 1: John Smith | john.smith@email.com
  Record 5: John Smith | johnsmith@email.com
  -> Combined similarity: 100%
Potential duplicate found:
  Record 2: Jon Smith | j.smith@email.com
  Record 3: John Smyth | john.smyth@gmail.com
  -> Combined similarity: 90%
Potential duplicate found:
  Record 2: Jon Smith | j.smith@email.com
  Record 5: John Smith | johnsmith@email.com
  -> Combined similarity: 95%
Potential duplicate found:
  Record 3: John Smyth | john.smyth@gmail.com
  Record 5: John Smith | johnsmith@email.com
  -> Combined similarity: 95%