All Products
Search
Document Center

PolarDB:smlar

Last Updated:Mar 28, 2026

smlar is an open-source extension for PolarDB for PostgreSQL (Compatible with Oracle) that calculates similarity between arrays directly in the database. It supports GiST and GIN indexes for fast similarity searches across large datasets, and works with all built-in PostgreSQL data types.

Use cases: product recommendation engines, duplicate content detection, tag-based similarity matching, and search result ranking in e-commerce and search applications.

Prerequisites

Before you begin, ensure that you have:

  • A PolarDB for PostgreSQL (Compatible with Oracle) cluster running version 2.0 (revision 2.0.14.1.0 or later)

To check your revision version, run:

SHOW polar_version;

How smlar calculates similarity

smlar supports three similarity algorithms, configured via the smlar.type parameter:

  • cosine (default): Measures the angle between two arrays treated as vectors. Best for comparing arrays of different lengths where element frequency matters.

  • tfidf: Weights elements by how rare they are across the dataset (term frequency × inverse document frequency). Best when some elements are more significant than others.

  • overlap: Counts shared elements divided by total elements. Best for simple set-overlap comparisons.

All three algorithms return a float4 value between 0 and 1, where 1 means the arrays are identical.

Install and remove smlar

Install the extension in your database:

CREATE EXTENSION smlar;
The % operator of smlar conflicts with the % operator of rum. Do not create both extensions in the same schema.

To remove the extension:

DROP EXTENSION smlar;

Calculate similarity

Basic similarity

The smlar(anyarray, anyarray) function returns the similarity score between two arrays of the same data type:

SELECT smlar('{3,2}'::int[], '{3,2,1}');

Output:

  smlar
----------
 0.816497
(1 row)

Custom formula

Use the three-argument form to apply a custom similarity formula:

SELECT smlar('{1,4,6}'::int[], '{5,4,6}', 'N.i / (N.a + N.b)');

Output:

  smlar
----------
 0.333333
(1 row)

The formula variables are:

VariableDescription
N.iNumber of identical elements in both arrays (intersection)
N.aNumber of unique elements in the first array
N.bNumber of unique elements in the second array

Threshold-based matching

Use the % operator to find records that exceed a similarity threshold. Set the threshold with smlar.threshold:

SET smlar.threshold = 0.7;

-- Find all products similar to a given feature set
SELECT product_id, features
FROM products
WHERE features % '{electronics,portable,wireless}'::text[];

This returns TRUE if the similarity score exceeds the configured threshold, and FALSE otherwise.

Use indexes for large datasets

GiST and GIN indexes accelerate % operator queries on large tables. Without an index, smlar performs a full table scan.

Create a GiST index:

CREATE INDEX ON products USING gist(features);

Create a GIN index:

CREATE INDEX ON products USING gin(features);

When to use each index type:

IndexBest forTrade-off
GiSTMedium-sized datasets, frequent writesFaster to build and update
GINLarge datasets with high query volumeSlower to build, faster to query

Functions and operators

Function or operatorDescriptionExampleResult
float4 smlar(anyarray, anyarray)Calculates similarity between two arrays of the same data typesmlar('{1,2}'::int[], '{1,2,3}')0.816497
float4 smlar(anyarray, anyarray, bool useIntersect)Calculates similarity between two arrays of custom composite types (elements and weights). Set useIntersect to true to use only overlapping elements; set to false to use all elements.smlar(arr1, arr2, true)float4
float4 smlar(anyarray a, anyarray b, text formula)Calculates similarity using a custom formula with variables N.i, N.a, and N.bsmlar('{1,4,6}'::int[], '{5,4,6}', 'N.i/(N.a+N.b)')0.333333
anyarray % anyarrayReturns TRUE if the similarity of the two arrays exceeds smlar.threshold; otherwise FALSE'{1,2}'::int[] % '{1,2,3}'::int[]t
text[] tsvector2textarray(tsvector)Converts a tsvector array to a text[] arraytsvector2textarray(to_tsvector('cat dogs')){cat,dogs}
anyarray array_unique(anyarray)Sorts and deduplicates an arrayarray_unique('{3,1,2,1}'::int[]){1,2,3}
float4 inarray(anyarray, anyelement)Returns 1.0 if the element exists in the array; otherwise 0inarray('{1,2,3}'::int[], 2)1
float4 inarray(anyarray, anyelement, float4, float4)Returns the third parameter if the element exists in the array; otherwise the fourth parameterinarray('{1,2,3}'::int[], 5, 1.0, 0.5)0.5

To use the composite-type variant of smlar, first define a composite type:

CREATE TYPE weighted_item AS (element text, weight float4);

Parameters

ParameterTypeDefaultDescription
smlar.thresholdfloatSimilarity threshold for the % operator. Two arrays are considered similar when their similarity score exceeds this value.
smlar.typestringcosineSimilarity algorithm. Valid values: cosine, tfidf, overlap. See How smlar calculates similarity.
smlar.tf_methodstringnTerm frequency (TF) calculation method for tfidf. Valid values: n (simple count), log (1 + log(n)), const (1).
smlar.idf_plus_oneboolfalseInverse document frequency (IDF) formula. false uses log(d/df); true uses log(1 + d/df).
smlar.stattablestringName of the table storing collection-wide statistics, required for tfidf.
smlar.persistent_cacheboolWhen true, stores cached global statistics in transaction-independent memory.

Set up a statistics table for tfidf

The tfidf algorithm requires a statistics table that tracks how often each element appears across the collection. Create it with the following schema:

CREATE TABLE smlar_stats (
  value   data_type UNIQUE,
  ndoc    int4 (or bigint) NOT NULL CHECK (ndoc > 0)
);

Replace data_type with the data type of your array elements, and use bigint instead of int4 if you need to support large document counts.

Then point smlar to this table:

SET smlar.stattable = 'smlar_stats';

References