The RDKit plug-in adds cheminformatics capabilities to ApsaraDB RDS for PostgreSQL, including molecular storage, similarity search using Tanimoto and Dice coefficients, substructure search, and GiST indexing for large compound libraries.
Prerequisites
Before you begin, ensure that you have:
-
An RDS instance running PostgreSQL 12
Data types
RDKit introduces data types for working with molecular data.
| Type | Description | Input format | Example |
|---|---|---|---|
mol |
Molecule | SMILES (simplified molecular input line entry specification) | 'c1ccccc1'::mol (benzene) |
fp |
Molecular fingerprint | bytea binary format |
— |
bfp |
Bit vector fingerprint | — | Accepted by tanimoto_sml, dice_sml |
sfp |
Sparse count vector fingerprint | — | Accepted by tanimoto_sml, dice_sml |
qmol |
Query molecule with SMARTS features | SMARTS | 'c1cccc[c,n]1'::qmol |
For the full list of supported SQL statements, see RDKit SQL.
Enable the RDKit plug-in
Run the following command to create the extension:
postgres=# CREATE EXTENSION rdkit;
CREATE EXTENSION
Configure similarity thresholds
The % and # operators compare molecular similarity against configurable GUC thresholds. The default value for both is 0.5.
postgres=# SHOW rdkit.tanimoto_threshold;
rdkit.tanimoto_threshold
--------------------------
0.5
(1 row)
postgres=# SHOW rdkit.dice_threshold;
rdkit.dice_threshold
----------------------
0.5
(1 row)
GUC parameters
| Parameter | Description | Default |
|---|---|---|
rdkit.tanimoto_threshold |
Tanimoto similarity threshold for the % operator |
0.5 |
rdkit.dice_threshold |
Dice similarity threshold for the # operator |
0.5 |
Create indexes
Choose an index type based on the operation you plan to run.
B-tree and hash indexes support comparison operations on mol and fp columns:
CREATE INDEX molidx ON pgmol (mol);
CREATE INDEX molidx ON pgmol (fp);
GiST indexes support similarity and substructure search operators (mol % mol, mol # mol, mol @> mol, mol <@ mol, fp % fp, fp # fp):
CREATE INDEX molidx ON pgmol USING gist (mol);
Operators
Similarity operators
| Operator | Returns true when |
|---|---|
mol % mol, fp % fp |
Tanimoto similarity < rdkit.tanimoto_threshold |
mol # mol, fp # fp |
Dice similarity < rdkit.dice_threshold |
Substructure operators
| Operator | Returns true when |
|---|---|
mol @> mol |
Left operand contains right operand as a substructure |
mol <@ mol |
Right operand contains left operand as a substructure |
Functions
tanimoto_sml
Calculates the Tanimoto similarity between two fingerprints and returns a double precision value between 0 and 1.
postgres=# \df tanimoto_sml
List of functions
Schema | Name | Result data type | Argument data types | Type
--------+--------------+------------------+---------------------+------
public | tanimoto_sml | double precision | bfp, bfp | func
public | tanimoto_sml | double precision | sfp, sfp | func
(2 rows)
dice_sml
Calculates the Dice similarity between two fingerprints and returns a double precision value between 0 and 1.
postgres=# \df dice_sml
List of functions
Schema | Name | Result data type | Argument data types | Type
--------+----------+------------------+---------------------+------
public | dice_sml | double precision | bfp, bfp | func
public | dice_sml | double precision | sfp, sfp | func
(2 rows)
substruct
Returns true if the second argument is a substructure of the first.
postgres=# \df substruct
List of functions
Schema | Name | Result data type | Argument data types | Type
--------+-----------+------------------+---------------------+------
public | substruct | boolean | mol, mol | func
public | substruct | boolean | mol, qmol | func
public | substruct | boolean | reaction, reaction | func
(3 rows)
Usage notes
-
molinput and output use SMILES. -
fpinput and output use thebyteabinary format.