This topic describes how to use the RDKit plug-in of ApsaraDB RDS for PostgreSQL to implement functions such as molecular computing and search.

Prerequisites

Your RDS instance runs PostgreSQL 12.

Background information

RDKit supports two data types: the mol data type that is used to describe molecular types, and the fp data type that is used to describe molecular fingerprints. It allows for comparison computing, similarity computing based on the Tanimoto and Dice coefficients, and GiST indexing.

For more information about the SQL statements that are supported by RDKit, visit RDKit SQL.

Precautions

  • Input and output functions based on the mol data type comply with the simplified molecular input line entry specification (SMILES).
  • Input and output functions based on the fp data type comply with the bytea format that is used to store binary data.

Create the RDKit plug-in

postgres=# create extension rdkit ;
CREATE EXTENSION

Default parameter settings

postgres=# show rdkit.tanimoto_threshold ;
 rdkit.tanimoto_threshold 
--------------------------
 0.5
(1 row)

postgres=# show rdkit.dice_threshold;
 rdkit.dice_threshold 
----------------------
 0.5
(1 row)

Indexes supported

  • B-tree and hash indexes are supported for comparison computing operations that are based on the mol and fp data types. Examples:
    CREATE INDEX molidx ON pgmol (mol);
    CREATE INDEX molidx ON pgmol (fp);
  • GiST indexes are supported for the following operations that are based on the mol and fp data types: "mol % mol", "mol # mol", "mol @> mol", "mol <@ mol", "fp % fp", and "fp # fp." Example:
    CREATE INDEX molidx ON pgmol USING gist (mol);

Sample functions

  • The tanimoto_sml function calculates the degree of similarity based on the Tanimoto coefficient.
    postgres=# \df tanimoto_sml
                               List of functions
     Schema |     Name     | Result data type | Argument data types | Type 
    --------+--------------+------------------+---------------------+------
     public | tanimoto_sml | double precision | bfp, bfp            | func
     public | tanimoto_sml | double precision | sfp, sfp            | func
    (2 rows)
  • The dice_sml function calculates the degree of similarity based on the Dice coefficient.
    postgres=# \df dice_sml
                             List of functions
     Schema |   Name   | Result data type | Argument data types | Type 
    --------+----------+------------------+---------------------+------
     public | dice_sml | double precision | bfp, bfp            | func
     public | dice_sml | double precision | sfp, sfp            | func
    (2 rows)
  • If the second argument is a substructure of the first argument, the substruct function returns the TRUE value.
    postgres=# \df substruct
                             List of functions
     Schema |   Name    | Result data type | Argument data types | Type 
    --------+-----------+------------------+---------------------+------
     public | substruct | boolean          | mol, mol            | func
     public | substruct | boolean          | mol, qmol           | func
     public | substruct | boolean          | reaction, reaction  | func
    (3 rows)

Basic operations

  • mol % mol and fp % fp

    If the degree of similarity that is calculated based on the Tanimoto coefficient is less than the value of the rdkit.tanimoto_threshold GUC variable, the TRUE value is returned.

  • mol # mol and fp # fp

    If the degree of similarity that is calculated based on the Dice coefficient is less than the value of the rdkit.dice_threshold GUC variable, the TRUE value is returned.

  • mol @> mol

    If the left operand contains the right operand, the TRUE value is returned.

  • mol <@ mol

    If the right operand contains the left operand, the TRUE value is returned.