All Products
Search
Document Center

ApsaraDB RDS:Cheminformatics and molecular retrieval (RDKit)

Last Updated:Mar 30, 2026

The RDKit plug-in adds cheminformatics capabilities to ApsaraDB RDS for PostgreSQL, including molecular storage, similarity search using Tanimoto and Dice coefficients, substructure search, and GiST indexing for large compound libraries.

Prerequisites

Before you begin, ensure that you have:

  • An RDS instance running PostgreSQL 12

Data types

RDKit introduces data types for working with molecular data.

Type Description Input format Example
mol Molecule SMILES (simplified molecular input line entry specification) 'c1ccccc1'::mol (benzene)
fp Molecular fingerprint bytea binary format
bfp Bit vector fingerprint Accepted by tanimoto_sml, dice_sml
sfp Sparse count vector fingerprint Accepted by tanimoto_sml, dice_sml
qmol Query molecule with SMARTS features SMARTS 'c1cccc[c,n]1'::qmol

For the full list of supported SQL statements, see RDKit SQL.

Enable the RDKit plug-in

Run the following command to create the extension:

postgres=# CREATE EXTENSION rdkit;
CREATE EXTENSION

Configure similarity thresholds

The % and # operators compare molecular similarity against configurable GUC thresholds. The default value for both is 0.5.

postgres=# SHOW rdkit.tanimoto_threshold;
 rdkit.tanimoto_threshold
--------------------------
 0.5
(1 row)

postgres=# SHOW rdkit.dice_threshold;
 rdkit.dice_threshold
----------------------
 0.5
(1 row)

GUC parameters

Parameter Description Default
rdkit.tanimoto_threshold Tanimoto similarity threshold for the % operator 0.5
rdkit.dice_threshold Dice similarity threshold for the # operator 0.5

Create indexes

Choose an index type based on the operation you plan to run.

B-tree and hash indexes support comparison operations on mol and fp columns:

CREATE INDEX molidx ON pgmol (mol);
CREATE INDEX molidx ON pgmol (fp);

GiST indexes support similarity and substructure search operators (mol % mol, mol # mol, mol @> mol, mol <@ mol, fp % fp, fp # fp):

CREATE INDEX molidx ON pgmol USING gist (mol);

Operators

Similarity operators

Operator Returns true when
mol % mol, fp % fp Tanimoto similarity < rdkit.tanimoto_threshold
mol # mol, fp # fp Dice similarity < rdkit.dice_threshold

Substructure operators

Operator Returns true when
mol @> mol Left operand contains right operand as a substructure
mol <@ mol Right operand contains left operand as a substructure

Functions

tanimoto_sml

Calculates the Tanimoto similarity between two fingerprints and returns a double precision value between 0 and 1.

postgres=# \df tanimoto_sml
                           List of functions
 Schema |     Name     | Result data type | Argument data types | Type
--------+--------------+------------------+---------------------+------
 public | tanimoto_sml | double precision | bfp, bfp            | func
 public | tanimoto_sml | double precision | sfp, sfp            | func
(2 rows)

dice_sml

Calculates the Dice similarity between two fingerprints and returns a double precision value between 0 and 1.

postgres=# \df dice_sml
                         List of functions
 Schema |   Name   | Result data type | Argument data types | Type
--------+----------+------------------+---------------------+------
 public | dice_sml | double precision | bfp, bfp            | func
 public | dice_sml | double precision | sfp, sfp            | func
(2 rows)

substruct

Returns true if the second argument is a substructure of the first.

postgres=# \df substruct
                         List of functions
 Schema |   Name    | Result data type | Argument data types | Type
--------+-----------+------------------+---------------------+------
 public | substruct | boolean          | mol, mol            | func
 public | substruct | boolean          | mol, qmol           | func
 public | substruct | boolean          | reaction, reaction  | func
(3 rows)

Usage notes

  • mol input and output use SMILES.

  • fp input and output use the bytea binary format.