All Products
Search
Document Center

PolarDB:UTL_MATCH

Last Updated:Oct 27, 2025

Handling external input, data integration, and data cleaning often involves strings that do not match exactly. These mismatches are caused by spelling errors, abbreviations, or inconsistent formats. For example, although John Smith and Jon Smyth refer to the same person, a simple equality comparison cannot identify their association. The UTL_MATCH built-in package in PolarDB for PostgreSQL (Compatible with Oracle) solves this problem. It uses the Jaro-Winkler algorithm to calculate the similarity between two strings. This lets you perform fuzzy matching, data deduplication, and data cleaning directly in the database.

Overview

UTL_MATCH is a built-in toolkit that is highly compatible with Oracle. It calculates the similarity between two strings. The package is based on the Jaro-Winkler algorithm. This algorithm works well for comparing short strings, such as names and addresses, because it gives a higher similarity score to strings that share a common prefix.

The package includes the following two core functions:

Function

Return type

Description

JARO_WINKLER(s1 VARCHAR2, s2 VARCHAR2)

BINARY_DOUBLE

Calculates the Jaro-Winkler similarity score for two strings. The return value is between 0.0 and 1.0. A score of 1.0 indicates an exact match.

JARO_WINKLER_SIMILARITY(s1 VARCHAR2, s2 VARCHAR2)

PLS_INTEGER

Calculates the Jaro-Winkler similarity percentage for two strings. The return value ranges from 0 to 100, where 100 indicates an exact match. This value is the integer part of the JARO_WINKLER function's result multiplied by 100.

Algorithm characteristics

  • Assigns a higher weight to matching characters at the beginning of the string.

  • Is sensitive to character order but allows for some character transposition.

  • Is case-insensitive. For example, the similarity between 'Hello' and 'hello' is 100%.

Scope

Your PolarDB for PostgreSQL (Compatible with Oracle) cluster must run minor engine version 2.0.14.17.36.0 or later.

Note

You can view the minor engine version in the console or by running the SHOW polardb_version; statement. If the requirement is not met, upgrade the minor engine version.

Features

  • Database-native processing: Perform fuzzy string matching directly within the database without transferring data to the application layer. This approach simplifies the system architecture and improves performance.

  • Efficient data cleaning: Quickly identify potential duplicate data that results from spelling errors, abbreviations, or format variations. This makes it a powerful tool for data quality management and data deduplication.

Notes

Handling null and empty strings: If an input parameter for the JARO_WINKLER or JARO_WINKLER_SIMILARITY function is NULL or an empty string, the function returns NULL, not 0. Handle this case in your code to prevent logic errors from unexpected NULL values.

Examples

The following examples show how to use UTL_MATCH in different use cases.

Example 1: Calculate basic string similarity

This example shows how to use the two core functions of UTL_MATCH to compare the similarity of different strings.

DECLARE
    v_score    BINARY_DOUBLE;
    v_percent  PLS_INTEGER;
BEGIN
    -- 1. Compare identical strings
    v_score := UTL_MATCH.JARO_WINKLER('hello', 'hello');
    v_percent := UTL_MATCH.JARO_WINKLER_SIMILARITY('hello', 'hello');
    DBMS_OUTPUT.PUT_LINE('Identical (hello vs hello) - Score: ' || v_score || ', Percentage: ' || v_percent);

    -- 2. Compare similar strings with spelling errors
    v_score := UTL_MATCH.JARO_WINKLER('hello', 'hallo');
    v_percent := UTL_MATCH.JARO_WINKLER_SIMILARITY('hello', 'hallo');
    DBMS_OUTPUT.PUT_LINE('Spelling error (hello vs hallo) - Score: ' || v_score || ', Percentage: ' || v_percent);

    -- 3. Compare completely different strings
    v_score := UTL_MATCH.JARO_WINKLER('hello', 'world');
    v_percent := UTL_MATCH.JARO_WINKLER_SIMILARITY('hello', 'world');
    DBMS_OUTPUT.PUT_LINE('Completely different (hello vs world) - Score: ' || v_score || ', Percentage: ' || v_percent);

    -- 4. Compare strings with different cases
    v_score := UTL_MATCH.JARO_WINKLER('Hello', 'hello');
    v_percent := UTL_MATCH.JARO_WINKLER_SIMILARITY('Hello', 'hello');
    DBMS_OUTPUT.PUT_LINE('Different case (Hello vs hello) - Score: ' || v_score || ', Percentage: ' || v_percent);
END;
/

Expected output:

Identical (hello vs hello) - Score: 1, Percentage: 100
Spelling error (hello vs hallo) - Score: 0.88, Percentage: 88
Completely different (hello vs world) - Score: 0.466666666666667, Percentage: 47
Different case (Hello vs hello) - Score: 1, Percentage: 100

Example 2: Perform fuzzy name matching in data cleaning

This example shows how to use UTL_MATCH to perform fuzzy matching and filtering on names in a dataset. Set a similarity threshold to identify records with different spellings that might refer to the same person.

DECLARE
    TYPE name_array IS VARRAY(10) OF VARCHAR2(100);
    v_standard_name VARCHAR2(100) := 'John Smith';
    v_test_names    name_array := name_array(
        'John Smith',   -- Exact match
        'Jon Smith',    -- Misspelled first name
        'John Smyth',   -- Last name spelling variation
        'J Smith',      -- Abbreviated first name
        'Smith John',   -- Swapped order
        'John Johnson'  -- Completely different last name
    );
    v_similarity    BINARY_DOUBLE;
    v_threshold     BINARY_DOUBLE := 0.8; -- Define the similarity threshold as 80%
BEGIN
    DBMS_OUTPUT.PUT_LINE('Standard name: ' || v_standard_name || ', Match threshold: ' || v_threshold);
    DBMS_OUTPUT.PUT_LINE('----------------------------------------');

    FOR i IN 1..v_test_names.COUNT LOOP
        v_similarity := UTL_MATCH.JARO_WINKLER(v_standard_name, v_test_names(i));
        DBMS_OUTPUT.PUT_LINE(
            'Input: ' || RPAD(v_test_names(i), 15) ||
            ' | Similarity: ' || ROUND(v_similarity, 3) ||
            ' | Is Match: ' || CASE WHEN v_similarity >= v_threshold THEN 'Yes' ELSE 'No' END
        );
    END LOOP;
END;
/

Expected output:

Standard name: John Smith, Match threshold: 0.8
----------------------------------------
Input: John Smith      | Similarity: 1 | Is Match: Yes
Input: Jon Smith       | Similarity: 0.973 | Is Match: Yes
Input: John Smyth      | Similarity: 0.96 | Is Match: Yes
Input: J Smith         | Similarity: 0.781 | Is Match: No
Input: Smith John      | Similarity: 0.533 | Is Match: No
Input: John Johnson    | Similarity: 0.828 | Is Match: Yes

Result analysis: The Jaro-Winkler algorithm is highly tolerant of prefix matches and minor character changes. Therefore, it identifies Jon Smith and John Smyth as highly similar. In contrast, the similarity score for Smith John is much lower because the word order is reversed.

Example 3: Detect potential duplicate customer records

In a customer relationship management (CRM) system, duplicate customer records can be created from entries made through different channels. This example shows how to identify potential duplicate records by combining the similarity scores of name and email fields.

DECLARE
    TYPE customer IS RECORD (id INTEGER, name VARCHAR2(100), email VARCHAR2(100));
    TYPE customer_array IS TABLE OF customer;
    v_customers customer_array := customer_array(
        customer(1, 'John Smith', 'john.smith@email.com'),
        customer(2, 'Jon Smith',  'j.smith@email.com'),      -- Similar name, similar email
        customer(3, 'John Smyth', 'john.smyth@gmail.com'),   -- Similar name, different email
        customer(4, 'Jane Doe',   'jane.doe@email.com'),      -- Completely different
        customer(5, 'John Smith', 'johnsmith@email.com')      -- Same name, similar email
    );
    v_name_sim   BINARY_DOUBLE;
    v_email_sim  BINARY_DOUBLE;
    v_overall_sim BINARY_DOUBLE;
    v_threshold   BINARY_DOUBLE := 0.85; -- Define the combined similarity threshold
BEGIN
    DBMS_OUTPUT.PUT_LINE('Duplicate data detection analysis (Threshold: ' || v_threshold || ')');
    DBMS_OUTPUT.PUT_LINE('==============================================');

    FOR i IN 1..v_customers.COUNT LOOP
        FOR j IN i + 1..v_customers.COUNT LOOP
            v_name_sim := UTL_MATCH.JARO_WINKLER(v_customers(i).name, v_customers(j).name);
            v_email_sim := UTL_MATCH.JARO_WINKLER(v_customers(i).email, v_customers(j).email);

            -- Calculate a combined similarity score using a weighted average (70% weight for name, 30% weight for email)
            v_overall_sim := (v_name_sim * 0.7) + (v_email_sim * 0.3);

            IF v_overall_sim >= v_threshold THEN
                DBMS_OUTPUT.PUT_LINE('Potential duplicate found:');
                DBMS_OUTPUT.PUT_LINE('  Record ' || v_customers(i).id || ': ' || v_customers(i).name || ' | ' || v_customers(i).email);
                DBMS_OUTPUT.PUT_LINE('  Record ' || v_customers(j).id || ': ' || v_customers(j).name || ' | ' || v_customers(j).email);
                DBMS_OUTPUT.PUT_LINE('  -> Combined similarity: ' || ROUND(v_overall_sim * 100) || '%');
                DBMS_OUTPUT.PUT_LINE('');
            END IF;
        END LOOP;
    END LOOP;
END;
/

Expected output:

Duplicate data detection analysis (Threshold: 0.85)
==============================================
Potential duplicate found:
  Record 1: John Smith | john.smith@email.com
  Record 2: Jon Smith | j.smith@email.com
  -> Combined similarity: 95%
Potential duplicate found:
  Record 1: John Smith | john.smith@email.com
  Record 3: John Smyth | john.smyth@gmail.com
  -> Combined similarity: 95%
Potential duplicate found:
  Record 1: John Smith | john.smith@email.com
  Record 5: John Smith | johnsmith@email.com
  -> Combined similarity: 100%
Potential duplicate found:
  Record 2: Jon Smith | j.smith@email.com
  Record 3: John Smyth | john.smyth@gmail.com
  -> Combined similarity: 90%
Potential duplicate found:
  Record 2: Jon Smith | j.smith@email.com
  Record 5: John Smith | johnsmith@email.com
  -> Combined similarity: 95%
Potential duplicate found:
  Record 3: John Smyth | john.smyth@gmail.com
  Record 5: John Smith | johnsmith@email.com
  -> Combined similarity: 95%