Handling external input, data integration, and data cleaning often involves strings that do not match exactly. These mismatches are caused by spelling errors, abbreviations, or inconsistent formats. For example, although John Smith and Jon Smyth refer to the same person, a simple equality comparison cannot identify their association. The UTL_MATCH built-in package in PolarDB for PostgreSQL (Compatible with Oracle) solves this problem. It uses the Jaro-Winkler algorithm to calculate the similarity between two strings. This lets you perform fuzzy matching, data deduplication, and data cleaning directly in the database.
Overview
UTL_MATCH is a built-in toolkit that is highly compatible with Oracle. It calculates the similarity between two strings. The package is based on the Jaro-Winkler algorithm. This algorithm works well for comparing short strings, such as names and addresses, because it gives a higher similarity score to strings that share a common prefix.
The package includes the following two core functions:
Function | Return type | Description |
|
| Calculates the Jaro-Winkler similarity score for two strings. The return value is between 0.0 and 1.0. A score of 1.0 indicates an exact match. |
|
| Calculates the Jaro-Winkler similarity percentage for two strings. The return value ranges from 0 to 100, where 100 indicates an exact match. This value is the integer part of the |
Algorithm characteristics
Assigns a higher weight to matching characters at the beginning of the string.
Is sensitive to character order but allows for some character transposition.
Is case-insensitive. For example, the similarity between
'Hello'and'hello'is 100%.
Scope
Your PolarDB for PostgreSQL (Compatible with Oracle) cluster must run minor engine version 2.0.14.17.36.0 or later.
You can view the minor engine version in the console or by running the SHOW polardb_version; statement. If the requirement is not met, upgrade the minor engine version.
Features
Database-native processing: Perform fuzzy string matching directly within the database without transferring data to the application layer. This approach simplifies the system architecture and improves performance.
Efficient data cleaning: Quickly identify potential duplicate data that results from spelling errors, abbreviations, or format variations. This makes it a powerful tool for data quality management and data deduplication.
Notes
Handling null and empty strings: If an input parameter for the JARO_WINKLER or JARO_WINKLER_SIMILARITY function is NULL or an empty string, the function returns NULL, not 0. Handle this case in your code to prevent logic errors from unexpected NULL values.
Examples
The following examples show how to use UTL_MATCH in different use cases.
Example 1: Calculate basic string similarity
This example shows how to use the two core functions of UTL_MATCH to compare the similarity of different strings.
DECLARE
v_score BINARY_DOUBLE;
v_percent PLS_INTEGER;
BEGIN
-- 1. Compare identical strings
v_score := UTL_MATCH.JARO_WINKLER('hello', 'hello');
v_percent := UTL_MATCH.JARO_WINKLER_SIMILARITY('hello', 'hello');
DBMS_OUTPUT.PUT_LINE('Identical (hello vs hello) - Score: ' || v_score || ', Percentage: ' || v_percent);
-- 2. Compare similar strings with spelling errors
v_score := UTL_MATCH.JARO_WINKLER('hello', 'hallo');
v_percent := UTL_MATCH.JARO_WINKLER_SIMILARITY('hello', 'hallo');
DBMS_OUTPUT.PUT_LINE('Spelling error (hello vs hallo) - Score: ' || v_score || ', Percentage: ' || v_percent);
-- 3. Compare completely different strings
v_score := UTL_MATCH.JARO_WINKLER('hello', 'world');
v_percent := UTL_MATCH.JARO_WINKLER_SIMILARITY('hello', 'world');
DBMS_OUTPUT.PUT_LINE('Completely different (hello vs world) - Score: ' || v_score || ', Percentage: ' || v_percent);
-- 4. Compare strings with different cases
v_score := UTL_MATCH.JARO_WINKLER('Hello', 'hello');
v_percent := UTL_MATCH.JARO_WINKLER_SIMILARITY('Hello', 'hello');
DBMS_OUTPUT.PUT_LINE('Different case (Hello vs hello) - Score: ' || v_score || ', Percentage: ' || v_percent);
END;
/Expected output:
Identical (hello vs hello) - Score: 1, Percentage: 100
Spelling error (hello vs hallo) - Score: 0.88, Percentage: 88
Completely different (hello vs world) - Score: 0.466666666666667, Percentage: 47
Different case (Hello vs hello) - Score: 1, Percentage: 100Example 2: Perform fuzzy name matching in data cleaning
This example shows how to use UTL_MATCH to perform fuzzy matching and filtering on names in a dataset. Set a similarity threshold to identify records with different spellings that might refer to the same person.
DECLARE
TYPE name_array IS VARRAY(10) OF VARCHAR2(100);
v_standard_name VARCHAR2(100) := 'John Smith';
v_test_names name_array := name_array(
'John Smith', -- Exact match
'Jon Smith', -- Misspelled first name
'John Smyth', -- Last name spelling variation
'J Smith', -- Abbreviated first name
'Smith John', -- Swapped order
'John Johnson' -- Completely different last name
);
v_similarity BINARY_DOUBLE;
v_threshold BINARY_DOUBLE := 0.8; -- Define the similarity threshold as 80%
BEGIN
DBMS_OUTPUT.PUT_LINE('Standard name: ' || v_standard_name || ', Match threshold: ' || v_threshold);
DBMS_OUTPUT.PUT_LINE('----------------------------------------');
FOR i IN 1..v_test_names.COUNT LOOP
v_similarity := UTL_MATCH.JARO_WINKLER(v_standard_name, v_test_names(i));
DBMS_OUTPUT.PUT_LINE(
'Input: ' || RPAD(v_test_names(i), 15) ||
' | Similarity: ' || ROUND(v_similarity, 3) ||
' | Is Match: ' || CASE WHEN v_similarity >= v_threshold THEN 'Yes' ELSE 'No' END
);
END LOOP;
END;
/Expected output:
Standard name: John Smith, Match threshold: 0.8
----------------------------------------
Input: John Smith | Similarity: 1 | Is Match: Yes
Input: Jon Smith | Similarity: 0.973 | Is Match: Yes
Input: John Smyth | Similarity: 0.96 | Is Match: Yes
Input: J Smith | Similarity: 0.781 | Is Match: No
Input: Smith John | Similarity: 0.533 | Is Match: No
Input: John Johnson | Similarity: 0.828 | Is Match: YesResult analysis: The Jaro-Winkler algorithm is highly tolerant of prefix matches and minor character changes. Therefore, it identifies Jon Smith and John Smyth as highly similar. In contrast, the similarity score for Smith John is much lower because the word order is reversed.
Example 3: Detect potential duplicate customer records
In a customer relationship management (CRM) system, duplicate customer records can be created from entries made through different channels. This example shows how to identify potential duplicate records by combining the similarity scores of name and email fields.
DECLARE
TYPE customer IS RECORD (id INTEGER, name VARCHAR2(100), email VARCHAR2(100));
TYPE customer_array IS TABLE OF customer;
v_customers customer_array := customer_array(
customer(1, 'John Smith', 'john.smith@email.com'),
customer(2, 'Jon Smith', 'j.smith@email.com'), -- Similar name, similar email
customer(3, 'John Smyth', 'john.smyth@gmail.com'), -- Similar name, different email
customer(4, 'Jane Doe', 'jane.doe@email.com'), -- Completely different
customer(5, 'John Smith', 'johnsmith@email.com') -- Same name, similar email
);
v_name_sim BINARY_DOUBLE;
v_email_sim BINARY_DOUBLE;
v_overall_sim BINARY_DOUBLE;
v_threshold BINARY_DOUBLE := 0.85; -- Define the combined similarity threshold
BEGIN
DBMS_OUTPUT.PUT_LINE('Duplicate data detection analysis (Threshold: ' || v_threshold || ')');
DBMS_OUTPUT.PUT_LINE('==============================================');
FOR i IN 1..v_customers.COUNT LOOP
FOR j IN i + 1..v_customers.COUNT LOOP
v_name_sim := UTL_MATCH.JARO_WINKLER(v_customers(i).name, v_customers(j).name);
v_email_sim := UTL_MATCH.JARO_WINKLER(v_customers(i).email, v_customers(j).email);
-- Calculate a combined similarity score using a weighted average (70% weight for name, 30% weight for email)
v_overall_sim := (v_name_sim * 0.7) + (v_email_sim * 0.3);
IF v_overall_sim >= v_threshold THEN
DBMS_OUTPUT.PUT_LINE('Potential duplicate found:');
DBMS_OUTPUT.PUT_LINE(' Record ' || v_customers(i).id || ': ' || v_customers(i).name || ' | ' || v_customers(i).email);
DBMS_OUTPUT.PUT_LINE(' Record ' || v_customers(j).id || ': ' || v_customers(j).name || ' | ' || v_customers(j).email);
DBMS_OUTPUT.PUT_LINE(' -> Combined similarity: ' || ROUND(v_overall_sim * 100) || '%');
DBMS_OUTPUT.PUT_LINE('');
END IF;
END LOOP;
END LOOP;
END;
/Expected output:
Duplicate data detection analysis (Threshold: 0.85)
==============================================
Potential duplicate found:
Record 1: John Smith | john.smith@email.com
Record 2: Jon Smith | j.smith@email.com
-> Combined similarity: 95%
Potential duplicate found:
Record 1: John Smith | john.smith@email.com
Record 3: John Smyth | john.smyth@gmail.com
-> Combined similarity: 95%
Potential duplicate found:
Record 1: John Smith | john.smith@email.com
Record 5: John Smith | johnsmith@email.com
-> Combined similarity: 100%
Potential duplicate found:
Record 2: Jon Smith | j.smith@email.com
Record 3: John Smyth | john.smyth@gmail.com
-> Combined similarity: 90%
Potential duplicate found:
Record 2: Jon Smith | j.smith@email.com
Record 5: John Smith | johnsmith@email.com
-> Combined similarity: 95%
Potential duplicate found:
Record 3: John Smyth | john.smyth@gmail.com
Record 5: John Smith | johnsmith@email.com
-> Combined similarity: 95%