Use the smlar extension to calculate data similarity - PolarDB

smlar is an open-source extension for PolarDB for PostgreSQL (Compatible with Oracle) that calculates similarity between arrays directly in the database. It supports GiST and GIN indexes for fast similarity searches across large datasets, and works with all built-in PostgreSQL data types.

Use cases: product recommendation engines, duplicate content detection, tag-based similarity matching, and search result ranking in e-commerce and search applications.

Prerequisites

Before you begin, ensure that you have:

A PolarDB for PostgreSQL (Compatible with Oracle) cluster running version 2.0 (revision 2.0.14.1.0 or later)

To check your revision version, run:

SHOW polar_version;

How smlar calculates similarity

smlar supports three similarity algorithms, configured via the smlar.type parameter:

cosine (default): Measures the angle between two arrays treated as vectors. Best for comparing arrays of different lengths where element frequency matters.
tfidf: Weights elements by how rare they are across the dataset (term frequency × inverse document frequency). Best when some elements are more significant than others.
overlap: Counts shared elements divided by total elements. Best for simple set-overlap comparisons.

All three algorithms return a float4 value between 0 and 1, where 1 means the arrays are identical.

Install and remove smlar

Install the extension in your database:

CREATE EXTENSION smlar;

The % operator of smlar conflicts with the % operator of rum. Do not create both extensions in the same schema.

To remove the extension:

DROP EXTENSION smlar;

Calculate similarity

Basic similarity

The smlar(anyarray, anyarray) function returns the similarity score between two arrays of the same data type:

SELECT smlar('{3,2}'::int[], '{3,2,1}');

Output:

  smlar
----------
 0.816497
(1 row)

Custom formula

Use the three-argument form to apply a custom similarity formula:

SELECT smlar('{1,4,6}'::int[], '{5,4,6}', 'N.i / (N.a + N.b)');

Output:

  smlar
----------
 0.333333
(1 row)

The formula variables are:

Variable	Description
`N.i`	Number of identical elements in both arrays (intersection)
`N.a`	Number of unique elements in the first array
`N.b`	Number of unique elements in the second array

Threshold-based matching

Use the % operator to find records that exceed a similarity threshold. Set the threshold with smlar.threshold:

SET smlar.threshold = 0.7;

-- Find all products similar to a given feature set
SELECT product_id, features
FROM products
WHERE features % '{electronics,portable,wireless}'::text[];

This returns TRUE if the similarity score exceeds the configured threshold, and FALSE otherwise.

Use indexes for large datasets

GiST and GIN indexes accelerate % operator queries on large tables. Without an index, smlar performs a full table scan.

Create a GiST index:

CREATE INDEX ON products USING gist(features);

Create a GIN index:

CREATE INDEX ON products USING gin(features);

When to use each index type:

Index	Best for	Trade-off
GiST	Medium-sized datasets, frequent writes	Faster to build and update
GIN	Large datasets with high query volume	Slower to build, faster to query

Functions and operators

Function or operator	Description	Example	Result
`float4 smlar(anyarray, anyarray)`	Calculates similarity between two arrays of the same data type	`smlar('{1,2}'::int[], '{1,2,3}')`	`0.816497`
`float4 smlar(anyarray, anyarray, bool useIntersect)`	Calculates similarity between two arrays of custom composite types (elements and weights). Set `useIntersect` to `true` to use only overlapping elements; set to `false` to use all elements.	`smlar(arr1, arr2, true)`	`float4`
`float4 smlar(anyarray a, anyarray b, text formula)`	Calculates similarity using a custom formula with variables `N.i`, `N.a`, and `N.b`	`smlar('{1,4,6}'::int[], '{5,4,6}', 'N.i/(N.a+N.b)')`	`0.333333`
`anyarray % anyarray`	Returns `TRUE` if the similarity of the two arrays exceeds `smlar.threshold`; otherwise `FALSE`	`'{1,2}'::int[] % '{1,2,3}'::int[]`	`t`
`text[] tsvector2textarray(tsvector)`	Converts a `tsvector` array to a `text[]` array	`tsvector2textarray(to_tsvector('cat dogs'))`	`{cat,dogs}`
`anyarray array_unique(anyarray)`	Sorts and deduplicates an array	`array_unique('{3,1,2,1}'::int[])`	`{1,2,3}`
`float4 inarray(anyarray, anyelement)`	Returns `1.0` if the element exists in the array; otherwise `0`	`inarray('{1,2,3}'::int[], 2)`	`1`
`float4 inarray(anyarray, anyelement, float4, float4)`	Returns the third parameter if the element exists in the array; otherwise the fourth parameter	`inarray('{1,2,3}'::int[], 5, 1.0, 0.5)`	`0.5`

To use the composite-type variant of smlar, first define a composite type:

CREATE TYPE weighted_item AS (element text, weight float4);

Parameters

Parameter	Type	Default	Description
`smlar.threshold`	float	—	Similarity threshold for the `%` operator. Two arrays are considered similar when their similarity score exceeds this value.
`smlar.type`	string	`cosine`	Similarity algorithm. Valid values: `cosine`, `tfidf`, `overlap`. See How smlar calculates similarity.
`smlar.tf_method`	string	`n`	Term frequency (TF) calculation method for `tfidf`. Valid values: `n` (simple count), `log` (1 + log(n)), `const` (1).
`smlar.idf_plus_one`	bool	`false`	Inverse document frequency (IDF) formula. `false` uses log(d/df); `true` uses log(1 + d/df).
`smlar.stattable`	string	—	Name of the table storing collection-wide statistics, required for `tfidf`.
`smlar.persistent_cache`	bool	—	When `true`, stores cached global statistics in transaction-independent memory.

Set up a statistics table for tfidf

The tfidf algorithm requires a statistics table that tracks how often each element appears across the collection. Create it with the following schema:

CREATE TABLE smlar_stats (
  value   data_type UNIQUE,
  ndoc    int4 (or bigint) NOT NULL CHECK (ndoc > 0)
);

Replace data_type with the data type of your array elements, and use bigint instead of int4 if you need to support large document counts.

Then point smlar to this table:

SET smlar.stattable = 'smlar_stats';

PolarDB:smlar