×
Community Blog PostgreSQL Chemical Analysis Module: RDKit Cartridge 1 for Environment Deployment

PostgreSQL Chemical Analysis Module: RDKit Cartridge 1 for Environment Deployment

In this article, we introduce the RDKit toolset for managing and analyzing large sets of chemical data and explain how PostgreSQL acts as the molecular database in this toolset.

By digoal

Background

Cheminformatics, an IT field, uses computers and computational programs to advance the collection, preservation, analysis, and management of large amounts of chemical data.

RDKit is an open source toolset used in cheminformatics. It features the following:

  • Business-friendly BSD license
  • Core data structures and algorithms in C++
  • Python 3.x wrappers generated using Boost.Python
  • Java and C# wrappers generated with SWIG
  • 2D and 3D molecular operations
  • Descriptor generation for machine learning
  • Molecular database cartridge for PostgreSQL
  • Cheminformatics nodes for KNIME (distributed from the KNIME community site)

PostgreSQL serves as the molecular database in this toolset. Important functions of PostgreSQL include:

RDKit Database Plug-in Reference

New features are as follows.

New Type

mol : an rdkit molecule. Can be created from a SMILES via direct type conversion, for example: ‘c1ccccc1’::mol creates a molecule from the SMILES ‘c1ccccc1’  
qmol : an rdkit molecule containing query features (i.e. constructed from SMARTS). Can be created from a SMARTS via direct type conversion, for example: ‘c1cccc[c,n]1’::qmol creates a query molecule from the SMARTS ‘c1cccc[c,n]1’  
sfp : a sparse count vector fingerprint (SparseIntVect in C++ and Python)  
bfp : a bit vector fingerprint (ExplicitBitVect in C++ and Python)  

Include the indexes supported by these types.

Similarity Search

% : operator used for similarity searches using Tanimoto similarity. Returns whether or not the Tanimoto similarity between two fingerprints (either two sfp or two bfp values) exceeds rdkit.tanimoto_threshold.  
# : operator used for similarity searches using Dice similarity. Returns whether or not the Dice similarity between two fingerprints (either two sfp or two bfp values) exceeds rdkit.dice_threshold.  
<%> : used for Tanimoto KNN searches (to return ordered lists of neighbors).  
<#> : used for Dice KNN searches (to return ordered lists of neighbors).  

Substructure and Exact Structure Search

@> : substructure search operator. Returns whether or not the mol or qmol on the right is a substructure of the mol on the left.  
<@ : substructure search operator. Returns whether or not the mol or qmol on the left is a substructure of the mol on the right.  
@= : returns whether or not two molecules are the same.  

Molecule Comparison

< : returns whether or not the left mol is less than the right mol  
> : returns whether or not the left mol is greater than the right mol  
= : returns whether or not the left mol is equal to the right mol  
<= : returns whether or not the left mol is less than or equal to the right mol  
>= : returns whether or not the left mol is greater than or equal to the right mol  

Fingerprint Operation Functions

  • Generating fingerprints
  • Working with fingerprints
  • Fingerprint I/O

Molecular-related Operation Functions

  • Molecule I/O and Validation
  • Substructure operations
  • Descriptors
  • Connectivity Descriptors
  • MCS

Deploy the RDKit Database Environment

Method 1: Deploy the Integrated Environment Using Anaconda

  1. Add users.
# useradd rdkit  
# su - rdkit  

2) Download and install Anaconda.

wget https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh  
  
bash ./Anaconda3-2019.10-Linux-x86_64.sh   

3) Set environment variables.

. /home/rdkit/.bashrc  
  
(base) [rdkit@pg11-test ~]$ which conda  
~/anaconda3/bin/conda  

4) Install RDKit and PostgreSQL that inherit the RDKit plug-in using Anaconda.

(base) [rdkit@pg11-test ~]$ conda install -c rdkit rdkit-postgresql  
  
Collecting package metadata (current_repodata.json): done  
Solving environment: done  
  
## Package Plan ##  
  
  environment location: /home/rdkit/anaconda3  
  
  added / updated specs:  
    - rdkit-postgresql  
  
The following packages will be downloaded:  
  
    package                    |            build  
    ---------------------------|-----------------  
    libboost-1.67.0            |       h46d08c1_4        13.0 MB  
    libpq-11.2                 |       h20c2e04_0         2.0 MB  
    postgresql-11.2            |       h20c2e04_0         3.7 MB  
    rdkit-postgresql-2019.09.1.0|       h5fa6086_0         2.9 MB  rdkit  
    ------------------------------------------------------------  
                                           Total:        21.5 MB  
  
The following NEW packages will be INSTALLED:  
  
  libboost           pkgs/main/linux-64::libboost-1.67.0-h46d08c1_4  
  libpq              pkgs/main/linux-64::libpq-11.2-h20c2e04_0  
  postgresql         pkgs/main/linux-64::postgresql-11.2-h20c2e04_0  
  rdkit-postgresql   rdkit/linux-64::rdkit-postgresql-2019.09.1.0-h5fa6086_0  
  
Proceed ([y]/n)? y    

5) After installing PostgreSQL, verify it. However, if you only install a few plugins now, how can you add plugins subsequently? For this, we provide examples later.

(base) [rdkit@pg11-test ~]$ which psql  
~/anaconda3/bin/psql  
(base) [rdkit@pg11-test ~]$ which initdb  
~/anaconda3/bin/initdb  
(base) [rdkit@pg11-test ~]$ which pgbench  
~/anaconda3/bin/pgbench  
  
cd /home/rdkit/anaconda3/share/extension  
  
(base) [rdkit@pg11-test extension]$ ll  
total 60  
-rw-rw-r-- 2 rdkit rdkit   332 Mar 14  2019 plpgsql--1.0.sql  
-rw-rw-r-- 2 rdkit rdkit   179 Mar 14  2019 plpgsql.control  
-rw-rw-r-- 2 rdkit rdkit   381 Mar 14  2019 plpgsql--unpackaged--1.0.sql  
-rw-rw-r-- 2 rdkit rdkit 42915 Oct 25 14:01 rdkit--3.8.sql  
-rw-rw-r-- 2 rdkit rdkit   135 Oct 25 14:01 rdkit.control  

6) The compilation parameters that Anaconda compiled for PostgreSQL are as follows:

(base) [rdkit@pg11-test ~]$ pg_config   
BINDIR = /home/rdkit/anaconda3/bin  
DOCDIR = /home/rdkit/anaconda3/share/doc  
HTMLDIR = /home/rdkit/anaconda3/share/doc  
INCLUDEDIR = /home/rdkit/anaconda3/include  
PKGINCLUDEDIR = /home/rdkit/anaconda3/include  
INCLUDEDIR-SERVER = /home/rdkit/anaconda3/include/server  
LIBDIR = /home/rdkit/anaconda3/lib  
PKGLIBDIR = /home/rdkit/anaconda3/lib  
LOCALEDIR = /home/rdkit/anaconda3/share/locale  
MANDIR = /home/rdkit/anaconda3/share/man  
SHAREDIR = /home/rdkit/anaconda3/share  
SYSCONFDIR = /home/rdkit/anaconda3/etc  
PGXS = /home/rdkit/anaconda3/lib/pgxs/src/makefiles/pgxs.mk  
CONFIGURE = '--prefix=/home/rdkit/anaconda3' '--with-readline' '--with-libraries=/home/rdkit/anaconda3/lib' '--with-includes=/home/rdkit/anaconda3/include' '--with-openssl' '--with-gssapi' 'CC=/tmp/build/80754af9/postgresql-split_1552510884761/_build_env/bin/x86_64-conda_cos6-linux-gnu-cc' 'CFLAGS=-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -I/home/rdkit/anaconda3/include -fdebug-prefix-map=/tmp/build/80754af9/postgresql-split_1552510884761/work=/usr/local/src/conda/postgresql-split-11.2 -fdebug-prefix-map=/home/rdkit/anaconda3=/usr/local/src/conda-prefix' 'LDFLAGS=-Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,-rpath,/home/rdkit/anaconda3/lib -Wl,-rpath-link,/home/rdkit/anaconda3/lib -L/home/rdkit/anaconda3/lib' 'CPPFLAGS=-DNDEBUG -D_FORTIFY_SOURCE=2 -O2' 'CPP=/tmp/build/80754af9/postgresql-split_1552510884761/_build_env/bin/x86_64-conda_cos6-linux-gnu-cpp' 'PKG_CONFIG_PATH=/home/rdkit/anaconda3/lib/pkgconfig'  
CC = /tmp/build/80754af9/postgresql-split_1552510884761/_build_env/bin/x86_64-conda_cos6-linux-gnu-cc  
CPPFLAGS = -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -D_GNU_SOURCE -I/home/rdkit/anaconda3/include  
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -I/home/rdkit/anaconda3/include -fdebug-prefix-map=/tmp/build/80754af9/postgresql-split_1552510884761/work=/usr/local/src/conda/postgresql-split-11.2 -fdebug-prefix-map=/home/rdkit/anaconda3=/usr/local/src/conda-prefix  
CFLAGS_SL = -fPIC  
LDFLAGS = -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,-rpath,/home/rdkit/anaconda3/lib -Wl,-rpath-link,/home/rdkit/anaconda3/lib -L/home/rdkit/anaconda3/lib -L/home/rdkit/anaconda3/lib -Wl,--as-needed -Wl,-rpath,'/home/rdkit/anaconda3/lib',--disable-new-dtags  
LDFLAGS_EX =   
LDFLAGS_SL =   
LIBS = -lpgcommon -lpgport -lpthread -lssl -lcrypto -lgssapi_krb5 -lz -lreadline -lrt -lcrypt -ldl -lm   
VERSION = PostgreSQL 11.2  

7) Prepare a database directory.

su - root  
(base) [root@pg11-test ~]# mkdir /data02/rdkit  
(base) [root@pg11-test ~]# chown rdkit:rdkit /data02/rdkit  

8) Set database environment variables.

su - rdkit  
  
vi .bash_profile  
  
export PGHOME=/home/rdkit/anaconda3  
export PGDATA=/data02/rdkit/pgdata  
export LD_LIBRARY_PATH=$PGHOME/lib:$LD_LIBRARY_PATH  
export PATH=$PGHOME/bin:$PATH  
export PGDATABASE=postgres  
export PGHOST=$PGDATA  
export PGPORT=8800  
export PGUSER=postgres  

9) Initialize the database.

initdb -D $PGDATA -E UTF8 --lc-collate=C --lc-ctype=en_US.utf8 -U postgres  

10) Configure database parameters on machines with 512-gigabyte memory.

vi $PGDATA/postgresql.conf  
  
listen_addresses = '0.0.0.0'  
port = 8800  
max_connections = 1000  
superuser_reserved_connections = 13  
unix_socket_directories = '.'  
tcp_keepalives_idle = 60  
tcp_keepalives_interval = 10  
tcp_keepalives_count = 10  
shared_buffers = 32GB  
maintenance_work_mem = 2GB  
autovacuum_work_mem = -1  
dynamic_shared_memory_type = posix  
vacuum_cost_delay = 0  
bgwriter_delay = 10ms  
bgwriter_lru_maxpages = 1000  
bgwriter_lru_multiplier = 10.0  
effective_io_concurrency = 0  
max_worker_processes = 128  
max_parallel_maintenance_workers = 4  
max_parallel_workers_per_gather = 0  
max_parallel_workers = 24  
wal_level = replica  
fsync = on  
synchronous_commit = off  
full_page_writes = on  
wal_compression = on  
wal_buffers = 16MB  
wal_writer_delay = 10ms  
checkpoint_timeout = 15min  
max_wal_size = 96GB  
min_wal_size = 16GB  
checkpoint_completion_target = 0.3  
archive_mode = on  
archive_command = '/bin/date'  
max_wal_senders = 32  
max_replication_slots = 16  
max_standby_archive_delay = 300s  
max_standby_streaming_delay = 300s  
wal_receiver_status_interval = 1s  
random_page_cost = 1.1  
effective_cache_size = 512GB  
log_destination = 'csvlog'  
logging_collector = on  
log_filename = 'postgresql-%d.log'  
log_truncate_on_rotation = on  
log_checkpoints = on  
log_error_verbosity = verbose  
log_statement = 'ddl'  
log_timezone = 'PRC'  
autovacuum = on  
log_autovacuum_min_duration = 0  
autovacuum_max_workers = 3  
autovacuum_freeze_max_age = 1200000000  
autovacuum_multixact_freeze_max_age = 1400000000  
autovacuum_vacuum_cost_delay = 0ms  
datestyle = 'iso, mdy'  
timezone = 'PRC'  
lc_messages = 'en_US.UTF-8'  
lc_monetary = 'en_US.UTF-8'  
lc_numeric = 'en_US.UTF-8'  
lc_time = 'en_US.UTF-8'  
default_text_search_config = 'pg_catalog.english'  
  
vi $PGDATA/pg_hba.conf  
host all all 0.0.0.0/0 md5  
host replication all 0.0.0.0/0 md5  

11) Start the database and install the plug-in.

pg_ctl start  
  
psql  
psql (11.2)  
Type "help" for help.  
  
postgres=# \l  
                               List of databases  
   Name    |  Owner   | Encoding | Collate |   Ctype    |   Access privileges     
-----------+----------+----------+---------+------------+-----------------------  
 postgres  | postgres | UTF8     | C       | en_US.utf8 |   
 template0 | postgres | UTF8     | C       | en_US.utf8 | =c/postgres          +  
           |          |          |         |            | postgres=CTc/postgres  
 template1 | postgres | UTF8     | C       | en_US.utf8 | =c/postgres          +  
           |          |          |         |            | postgres=CTc/postgres  
(3 rows)  
  
postgres=# create extension rdkit ;  
CREATE EXTENSION  

Add New Plug-ins

You need to add new plug-ins because there are only a few packaged plug-ins.

For example, install the PostgreSQL plug-ins: https://www.postgresql.org/ftp/source/v11.6/

(base) [rdkit@pg11-test ~]$ wget https://ftp.postgresql.org/pub/source/v11.6/postgresql-11.6.tar.bz2  
  
tar -jxvf postgresql-11.6.tar.bz2  
  
cd   
./configure --prefix=/home/rdkit/anaconda3  
cd contrib  

Modify Makefile.

vi Makefile  
  
删除  
  
vacuumlo  
oid2name  
pg_standby  

Installation:

USE_PGXS=1 make   
USE_PGXS=1 make install  

After installation, copy the lib and SQL files to the correct directory.

cd /home/rdkit/anaconda3  
  
mv lib/postgresql/* lib/  
mv share/postgresql/extension/* share/extension/  

Method 2: Install Source Code

Please refer to https://github.com/rdkit/rdkit/releases

wget https://github.com/rdkit/rdkit/archive/Release_2019_09_1.tar.gz   
tar -zxvf Release_2019_09_1.tar.gz   
cd rdkit-Release_2019_09_1  
  
mkdir build  
cd build  
  
必须使用python3  
  
export PATH  
export RDBASE  
export LD_LIBRARY_PATH  
  
cmake .. -DPYTHON_EXECUTABLE=/usr/bin/python3  
make  
make install  

cd rdkit-Release_2019_09_1/Code/PgSQL/rdkit  
USE_PGXS=1 make  
USE_PGXS=1 make install  

Reference Guide

Please refer to https://www.rdkit.org/docs/Cartridge.html#reference-guide

Download the test data:

Create the table with reference to documents, pour data, generate indexes, and then query and analyze the table.

References

0 0 0
Share on

digoal

173 posts | 10 followers

You may also like

Comments