smlar is an open-source extension for PolarDB for PostgreSQL that calculates array similarity inside the database. It supports GiST and GIN indexes for fast similarity searches, and works with all built-in PostgreSQL data types.
Three similarity algorithms are available: consine (default), TF-IDF, and overlap. Choose the algorithm based on your use case — cosine is a general-purpose vector similarity measure, TF-IDF weights terms by frequency and rarity (useful for text-like arrays), and overlap counts shared elements without normalization.
Supported versions
smlar requires one of the following PolarDB for PostgreSQL versions:
PolarDB for PostgreSQL 16 (revision version 2.0.16.9.6.0 or later)
PolarDB for PostgreSQL 14 (revision version 2.0.14.5.1.0 or later)
PolarDB for PostgreSQL 11 (revision version 2.0.11.9.28.0 or later)
To check your revision version, go to the PolarDB console or run SHOW polardb_version;. If the revision version does not meet the requirements, update it.
Install and remove smlar
Install:
CREATE EXTENSION smlar;The % operator in smlar conflicts with the % operator in rum. Installing both extensions in the same schema is not supported. To use smlar alongside rum, install them in separate schemas.
Remove:
DROP EXTENSION smlar;Calculate array similarity
Basic similarity — same data type:
SELECT smlar('{3,2}'::int[], '{3,2,1}');
smlar
----------
0.816497
(1 row)Custom formula:
SELECT smlar('{1,4,6}'::int[], '{5,4,6}', 'N.i / (N.a + N.b)' );
smlar
----------
0.333333
(1 row)In the formula, N.i is the number of identical elements (intersection), N.a is the number of unique elements in the first array, and N.b is the number of unique elements in the second array.
Threshold-based similarity test:
Use the % operator to check whether the similarity of two arrays exceeds the threshold set by smlar.threshold:
-- Set the threshold
SET smlar.threshold = 0.6;
-- Confirm the current value
SHOW smlar.threshold;
smlar.threshold
-----------------
0.6
(1 row)
-- Test similarity (returns TRUE or FALSE)
SELECT '{3,2}'::int[] % '{3,2,1}'::int[];Functions and operators
| Function or operator | Description |
|---|---|
float4 smlar(anyarray, anyarray) | Calculates the similarity of two arrays of the same data type. |
float4 smlar(anyarray, anyarray, bool useIntersect) | Calculates the similarity of two arrays of a custom composite type (element and weight pairs). Set useIntersect to TRUE to include only overlapping elements, or FALSE to include all elements. Define the composite type with: CREATE TYPE type_name AS (element_name anytype, weight_name FLOAT4); |
float4 smlar(anyarray a, anyarray b, text formula) | Calculates the similarity using a custom formula. Predefined variables: N.i (intersection count), N.a (unique elements in array a), N.b (unique elements in array b). |
anyarray % anyarray | Returns TRUE if the similarity of the two arrays exceeds smlar.threshold. Otherwise returns FALSE. |
text[] tsvector2textarray(tsvector) | Converts a tsvector value to a text[] array. |
anyarray array_unique(anyarray) | Sorts and removes duplicate elements from the array. |
float4 inarray(anyarray, anyelement) | Returns 1.0 if the element is in the array; otherwise returns 0. |
float4 inarray(anyarray, anyelement, float4, float4) | Returns the third argument if the element is in the array; otherwise returns the fourth argument. |
Parameters
Set parameters using the PostgreSQL SET command or in postgresql.conf.
General parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
smlar.threshold | FLOAT | — | The minimum similarity score for the % operator to return TRUE. |
smlar.persistent_cache | BOOL | — | Whether to store cached global statistics in transaction-independent memory. |
smlar.type | STRING | consine | The similarity algorithm. Valid values: consine, tfidf, overlap. |
TF-IDF parameters
These parameters apply only when smlar.type = 'tfidf'.
| Parameter | Type | Default | Description |
|---|---|---|---|
smlar.stattable | STRING | — | The name of the table that stores collection-level statistics. Create it with: CREATE TABLE table_name (value data_type UNIQUE, ndoc int4 NOT NULL CHECK (ndoc>0)); |
smlar.tf_method | STRING | n | The term frequency (TF) calculation method. Valid values: n (simple count), log (1 + log(n)), const (1). |
smlar.idf_plus_one | BOOL | FALSE | The inverse document frequency (IDF) formula. FALSE uses log(d/df); TRUE uses log(1 + d/df). |