smlar is an open-source extension for PolarDB for PostgreSQL (Compatible with Oracle) that calculates similarity between arrays directly in the database. It supports GiST and GIN indexes for fast similarity searches across large datasets, and works with all built-in PostgreSQL data types.
Use cases: product recommendation engines, duplicate content detection, tag-based similarity matching, and search result ranking in e-commerce and search applications.
Prerequisites
Before you begin, ensure that you have:
A PolarDB for PostgreSQL (Compatible with Oracle) cluster running version 2.0 (revision 2.0.14.1.0 or later)
To check your revision version, run:
SHOW polar_version;How smlar calculates similarity
smlar supports three similarity algorithms, configured via the smlar.type parameter:
cosine (default): Measures the angle between two arrays treated as vectors. Best for comparing arrays of different lengths where element frequency matters.
tfidf: Weights elements by how rare they are across the dataset (term frequency × inverse document frequency). Best when some elements are more significant than others.
overlap: Counts shared elements divided by total elements. Best for simple set-overlap comparisons.
All three algorithms return a float4 value between 0 and 1, where 1 means the arrays are identical.
Install and remove smlar
Install the extension in your database:
CREATE EXTENSION smlar;The%operator of smlar conflicts with the%operator of rum. Do not create both extensions in the same schema.
To remove the extension:
DROP EXTENSION smlar;Calculate similarity
Basic similarity
The smlar(anyarray, anyarray) function returns the similarity score between two arrays of the same data type:
SELECT smlar('{3,2}'::int[], '{3,2,1}');Output:
smlar
----------
0.816497
(1 row)Custom formula
Use the three-argument form to apply a custom similarity formula:
SELECT smlar('{1,4,6}'::int[], '{5,4,6}', 'N.i / (N.a + N.b)');Output:
smlar
----------
0.333333
(1 row)The formula variables are:
| Variable | Description |
|---|---|
N.i | Number of identical elements in both arrays (intersection) |
N.a | Number of unique elements in the first array |
N.b | Number of unique elements in the second array |
Threshold-based matching
Use the % operator to find records that exceed a similarity threshold. Set the threshold with smlar.threshold:
SET smlar.threshold = 0.7;
-- Find all products similar to a given feature set
SELECT product_id, features
FROM products
WHERE features % '{electronics,portable,wireless}'::text[];This returns TRUE if the similarity score exceeds the configured threshold, and FALSE otherwise.
Use indexes for large datasets
GiST and GIN indexes accelerate % operator queries on large tables. Without an index, smlar performs a full table scan.
Create a GiST index:
CREATE INDEX ON products USING gist(features);Create a GIN index:
CREATE INDEX ON products USING gin(features);When to use each index type:
| Index | Best for | Trade-off |
|---|---|---|
| GiST | Medium-sized datasets, frequent writes | Faster to build and update |
| GIN | Large datasets with high query volume | Slower to build, faster to query |
Functions and operators
| Function or operator | Description | Example | Result |
|---|---|---|---|
float4 smlar(anyarray, anyarray) | Calculates similarity between two arrays of the same data type | smlar('{1,2}'::int[], '{1,2,3}') | 0.816497 |
float4 smlar(anyarray, anyarray, bool useIntersect) | Calculates similarity between two arrays of custom composite types (elements and weights). Set useIntersect to true to use only overlapping elements; set to false to use all elements. | smlar(arr1, arr2, true) | float4 |
float4 smlar(anyarray a, anyarray b, text formula) | Calculates similarity using a custom formula with variables N.i, N.a, and N.b | smlar('{1,4,6}'::int[], '{5,4,6}', 'N.i/(N.a+N.b)') | 0.333333 |
anyarray % anyarray | Returns TRUE if the similarity of the two arrays exceeds smlar.threshold; otherwise FALSE | '{1,2}'::int[] % '{1,2,3}'::int[] | t |
text[] tsvector2textarray(tsvector) | Converts a tsvector array to a text[] array | tsvector2textarray(to_tsvector('cat dogs')) | {cat,dogs} |
anyarray array_unique(anyarray) | Sorts and deduplicates an array | array_unique('{3,1,2,1}'::int[]) | {1,2,3} |
float4 inarray(anyarray, anyelement) | Returns 1.0 if the element exists in the array; otherwise 0 | inarray('{1,2,3}'::int[], 2) | 1 |
float4 inarray(anyarray, anyelement, float4, float4) | Returns the third parameter if the element exists in the array; otherwise the fourth parameter | inarray('{1,2,3}'::int[], 5, 1.0, 0.5) | 0.5 |
To use the composite-type variant of smlar, first define a composite type:
CREATE TYPE weighted_item AS (element text, weight float4);Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
smlar.threshold | float | — | Similarity threshold for the % operator. Two arrays are considered similar when their similarity score exceeds this value. |
smlar.type | string | cosine | Similarity algorithm. Valid values: cosine, tfidf, overlap. See How smlar calculates similarity. |
smlar.tf_method | string | n | Term frequency (TF) calculation method for tfidf. Valid values: n (simple count), log (1 + log(n)), const (1). |
smlar.idf_plus_one | bool | false | Inverse document frequency (IDF) formula. false uses log(d/df); true uses log(1 + d/df). |
smlar.stattable | string | — | Name of the table storing collection-wide statistics, required for tfidf. |
smlar.persistent_cache | bool | — | When true, stores cached global statistics in transaction-independent memory. |
Set up a statistics table for tfidf
The tfidf algorithm requires a statistics table that tracks how often each element appears across the collection. Create it with the following schema:
CREATE TABLE smlar_stats (
value data_type UNIQUE,
ndoc int4 (or bigint) NOT NULL CHECK (ndoc > 0)
);Replace data_type with the data type of your array elements, and use bigint instead of int4 if you need to support large document counts.
Then point smlar to this table:
SET smlar.stattable = 'smlar_stats';