Testing the Performance of PipelineDB for Real-Time Statistics on Virtual Machines

By Digoal

Background

PipelineDB is a type of streaming relational database developed based on PostgreSQL (PipelineDB 0.8.1 is developed based on PostgreSQL 9.4.4). This type of database features automatic processing of streaming data. Instead of raw data, PipelineDB only stores the processed data, so it is very suitable for common real-time streaming data processing scenarios.

Such scenarios include website traffic statistics, monitoring statistics for IT services, App Store access statistics, real-time statistics for IoT sensor data, and real-time statistics for logistics orders.

Introduction to PipelineDB

PipelineDB is based on, and is wire compatible with, PostgreSQL 9.4 and has added functionality including continuous SQL queries, probabilistic data structures, sliding windowing, and stream-table joins.

According to PostgreSQL's website,

"PipelineDB's fundamental abstraction is what is called a continuous view. These are much like regular SQL views, except that their defining SELECT queries can include streams as a source to read from. The most important property of continuous views is that they only store their output in the database. That output is then continuously updated incrementally as new data flows through streams, and raw stream data is discarded once all continuous views have read it."

Streaming Statistics Example

Create a continuous view without defining a table, which is similar to NoSQL.

pipeline=# CREATE CONTINUOUS VIEW v0 AS SELECT COUNT(*) FROM stream;    
CREATE CONTINUOUS VIEW  
pipeline=# CREATE CONTINUOUS VIEW v1 AS SELECT COUNT(*) FROM stream;  
CREATE CONTINUOUS VIEW

Activate the continuous view.

pipeline=# ACTIVATE;  
ACTIVATE 2

Write data into streams.

pipeline=# INSERT INTO stream (x) VALUES (1);  
INSERT 0 1  
pipeline=# SET stream_targets TO v0;  
SET  
pipeline=# INSERT INTO stream (x) VALUES (1);  
INSERT 0 1  
pipeline=# SET stream_targets TO DEFAULT;  
SET  
pipeline=# INSERT INTO stream (x) VALUES (1);  
INSERT 0 1

If you do not want to receive the streaming data, stop the streams.

pipeline=# DEACTIVATE;  
DEACTIVATE 2

Query the continuous view.

pipeline=# SELECT count FROM v0;  
 count  
-------  
     3  
(1 row)  
pipeline=# SELECT count FROM v1;  
 count  
-------  
     2  
(1 row)  
pipeline=#

How to Deploy PipelineDB

Install PipelineDB.

[root@digoal soft_bak]# rpm -ivh pipelinedb-0.8.1-centos6-x86_64.rpm   
Preparing...                ########################################### [100%]  
   1:pipelinedb             ########################################### [100%]  
/sbin/ldconfig: /opt/gcc4.9.3/lib/libstdc++.so.6.0.20-gdb.py is not an ELF file - it has the wrong magic bytes at the start.  
  
/sbin/ldconfig: /opt/gcc4.9.3/lib64/libstdc++.so.6.0.20-gdb.py is not an ELF file - it has the wrong magic bytes at the start.  
  
  
    ____  _            ___            ____  ____  
   / __ \(_)___  ___  / (_)___  ___  / __ \/ __ )  
  / /_/ / / __ \/ _ \/ / / __ \/ _ \/ / / / __  |  
 / ____/ / /_/ /  __/ / / / / /  __/ /_/ / /_/ /  
/_/   /_/ .___/\___/_/_/_/ /_/\___/_____/_____/  
       /_/  
  
PipelineDB successfully installed. To get started, initialize a  
database directory:  
  
pipeline-init -D <data directory>  
  
where <data directory> is a nonexistent directory where you'd  
like all of your database files to live.   
  
You can find the PipelineDB documentation at:  
  
http://docs.pipelinedb.com

Configure PipelineDB.

[root@digoal soft_bak]# cd /usr/lib/pipelinedb  
[root@digoal pipelinedb]# ll  
total 16  
drwxr-xr-x 2 root root 4096 Oct 15 10:47 bin  
drwxr-xr-x 5 root root 4096 Oct 15 10:47 include  
drwxr-xr-x 6 root root 4096 Oct 15 10:47 lib  
drwxr-xr-x 4 root root 4096 Oct 15 10:47 share  
  
[root@digoal pipelinedb]# useradd pdb  
[root@digoal pipelinedb]# vi /home/pdb/.bash_profile  
# add by digoal  
export PS1="$USER@`/bin/hostname -s`-> "  
export PGPORT=1953  
export PGDATA=/data01/pg_root_1953  
export LANG=en_US.utf8  
export PGHOME=/usr/lib/pipelinedb  
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH  
export DATE=`date +"%Y%m%d%H%M"`  
export PATH=$PGHOME/bin:$PATH:.  
export MANPATH=$PGHOME/share/man:$MANPATH  
export PGHOST=$PGDATA  
export PGDATABASE=pipeline  
export PGUSER=postgres  
alias rm='rm -i'  
alias ll='ls -lh'  
unalias vi  
  
[root@digoal pipelinedb]# mkdir /data01/pg_root_1953  
[root@digoal pipelinedb]# chown pdb:pdb /data01/pg_root_1953  
[root@digoal pipelinedb]# chmod 700 /data01/pg_root_1953  
  
[root@digoal pipelinedb]# su - pdb  
pdb@digoal-> which psql  
/usr/lib/pipelinedb/bin/psql

Initialize the database.

pdb@digoal-> psql -V  
psql (PostgreSQL) 9.4.4  
  
pdb@digoal-> cd /usr/lib/pipelinedb/bin/  
pdb@digoal-> ll  
total 13M  
-rwxr-xr-x 1 root root  62K Sep 18 01:01 clusterdb  
-rwxr-xr-x 1 root root  62K Sep 18 01:01 createdb  
-rwxr-xr-x 1 root root  66K Sep 18 01:01 createlang  
-rwxr-xr-x 1 root root  63K Sep 18 01:01 createuser  
-rwxr-xr-x 1 root root  44K Sep 18 01:02 cs2cs  
-rwxr-xr-x 1 root root  58K Sep 18 01:01 dropdb  
-rwxr-xr-x 1 root root  66K Sep 18 01:01 droplang  
-rwxr-xr-x 1 root root  58K Sep 18 01:01 dropuser  
-rwxr-xr-x 1 root root 776K Sep 18 01:01 ecpg  
-rwxr-xr-x 1 root root  28K Sep 18 00:57 gdaladdo  
-rwxr-xr-x 1 root root  79K Sep 18 00:57 gdalbuildvrt  
-rwxr-xr-x 1 root root 1.3K Sep 18 00:57 gdal-config  
-rwxr-xr-x 1 root root  33K Sep 18 00:57 gdal_contour  
-rwxr-xr-x 1 root root 188K Sep 18 00:57 gdaldem  
-rwxr-xr-x 1 root root  74K Sep 18 00:57 gdalenhance  
-rwxr-xr-x 1 root root 131K Sep 18 00:57 gdal_grid  
-rwxr-xr-x 1 root root  83K Sep 18 00:57 gdalinfo  
-rwxr-xr-x 1 root root  90K Sep 18 00:57 gdallocationinfo  
-rwxr-xr-x 1 root root  42K Sep 18 00:57 gdalmanage  
-rwxr-xr-x 1 root root 236K Sep 18 00:57 gdal_rasterize  
-rwxr-xr-x 1 root root  25K Sep 18 00:57 gdalserver  
-rwxr-xr-x 1 root root  77K Sep 18 00:57 gdalsrsinfo  
-rwxr-xr-x 1 root root  49K Sep 18 00:57 gdaltindex  
-rwxr-xr-x 1 root root  33K Sep 18 00:57 gdaltransform  
-rwxr-xr-x 1 root root 158K Sep 18 00:57 gdal_translate  
-rwxr-xr-x 1 root root 168K Sep 18 00:57 gdalwarp  
-rwxr-xr-x 1 root root  41K Sep 18 01:02 geod  
-rwxr-xr-x 1 root root 1.3K Sep 18 00:51 geos-config  
lrwxrwxrwx 1 root root    4 Oct 15 10:47 invgeod -> geod  
lrwxrwxrwx 1 root root    4 Oct 15 10:47 invproj -> proj  
-rwxr-xr-x 1 root root  20K Sep 18 01:02 nad2bin  
-rwxr-xr-x 1 root root 186K Sep 18 00:57 nearblack  
-rwxr-xr-x 1 root root 374K Sep 18 00:57 ogr2ogr  
-rwxr-xr-x 1 root root  77K Sep 18 00:57 ogrinfo  
-rwxr-xr-x 1 root root 283K Sep 18 00:57 ogrlineref  
-rwxr-xr-x 1 root root  47K Sep 18 00:57 ogrtindex  
-rwxr-xr-x 1 root root  30K Sep 18 01:01 pg_config  
-rwxr-xr-x 1 root root  30K Sep 18 01:01 pg_controldata  
-rwxr-xr-x 1 root root  33K Sep 18 01:01 pg_isready  
-rwxr-xr-x 1 root root  39K Sep 18 01:01 pg_resetxlog  
-rwxr-xr-x 1 root root 183K Sep 18 01:02 pgsql2shp  
lrwxrwxrwx 1 root root    4 Oct 15 10:47 pipeline -> psql  
-rwxr-xr-x 1 root root  74K Sep 18 01:01 pipeline-basebackup  
lrwxrwxrwx 1 root root    9 Oct 15 10:47 pipeline-config -> pg_config  
-rwxr-xr-x 1 root root  44K Sep 18 01:01 pipeline-ctl  
-rwxr-xr-x 1 root root 355K Sep 18 01:01 pipeline-dump  
-rwxr-xr-x 1 root root  83K Sep 18 01:01 pipeline-dumpall  
-rwxr-xr-x 1 root root 105K Sep 18 01:01 pipeline-init  
-rwxr-xr-x 1 root root  50K Sep 18 01:01 pipeline-receivexlog  
-rwxr-xr-x 1 root root  56K Sep 18 01:01 pipeline-recvlogical  
-rwxr-xr-x 1 root root 153K Sep 18 01:01 pipeline-restore  
-rwxr-xr-x 1 root root 6.2M Sep 18 01:01 pipeline-server  
lrwxrwxrwx 1 root root   15 Oct 15 10:47 postmaster -> pipeline-server  
-rwxr-xr-x 1 root root  49K Sep 18 01:02 proj  
-rwxr-xr-x 1 root root 445K Sep 18 01:01 psql  
-rwxr-xr-x 1 root root 439K Sep 18 01:02 raster2pgsql  
-rwxr-xr-x 1 root root  62K Sep 18 01:01 reindexdb  
-rwxr-xr-x 1 root root 181K Sep 18 01:02 shp2pgsql  
-rwxr-xr-x 1 root root  27K Sep 18 00:57 testepsg  
-rwxr-xr-x 1 root root  63K Sep 18 01:01 vacuumdb  
  
pdb@digoal-> pipeline-init -D $PGDATA -U postgres -E UTF8 --locale=C -W  
pdb@digoal-> cd $PGDATA  
pdb@digoal-> ll  
total 108K  
drwx------ 5 pdb pdb 4.0K Oct 15 10:57 base  
drwx------ 2 pdb pdb 4.0K Oct 15 10:57 global  
drwx------ 2 pdb pdb 4.0K Oct 15 10:57 pg_clog  
drwx------ 2 pdb pdb 4.0K Oct 15 10:57 pg_dynshmem  
-rw------- 1 pdb pdb 4.4K Oct 15 10:57 pg_hba.conf  
-rw------- 1 pdb pdb 1.6K Oct 15 10:57 pg_ident.conf  
drwx------ 4 pdb pdb 4.0K Oct 15 10:57 pg_logical  
drwx------ 4 pdb pdb 4.0K Oct 15 10:57 pg_multixact  
drwx------ 2 pdb pdb 4.0K Oct 15 10:57 pg_notify  
drwx------ 2 pdb pdb 4.0K Oct 15 10:57 pg_replslot  
drwx------ 2 pdb pdb 4.0K Oct 15 10:57 pg_serial  
drwx------ 2 pdb pdb 4.0K Oct 15 10:57 pg_snapshots  
drwx------ 2 pdb pdb 4.0K Oct 15 10:57 pg_stat  
drwx------ 2 pdb pdb 4.0K Oct 15 10:57 pg_stat_tmp  
drwx------ 2 pdb pdb 4.0K Oct 15 10:57 pg_subtrans  
drwx------ 2 pdb pdb 4.0K Oct 15 10:57 pg_tblspc  
drwx------ 2 pdb pdb 4.0K Oct 15 10:57 pg_twophase  
-rw------- 1 pdb pdb    4 Oct 15 10:57 PG_VERSION  
drwx------ 3 pdb pdb 4.0K Oct 15 10:57 pg_xlog  
-rw------- 1 pdb pdb   88 Oct 15 10:57 pipelinedb.auto.conf  
-rw------- 1 pdb pdb  23K Oct 15 10:57 pipelinedb.conf

Configure stream processing parameters, such as setting the memory size, enabling or disabling synchronization, or setting the merged batch and number of worker processes.

pipelinedb.conf  
#------------------------------------------------------------------------------  
# CONTINUOUS VIEW OPTIONS  
#------------------------------------------------------------------------------  
  
# size of the buffer for storing unread stream tuples  
#tuple_buffer_blocks = 128MB  
  
# synchronization level for combiner commits; off, local, remote_write, or on  
#continuous_query_combiner_synchronous_commit = off  
  
# maximum amount of memory to use for combiner query executions  
#continuous_query_combiner_work_mem = 256MB  
  
# maximum memory to be used by the combiner for caching; this is independent  
# of combiner_work_mem  
#continuous_query_combiner_cache_mem = 32MB  
  
# the default fillfactor to use for continuous views  
#continuous_view_fillfactor = 50  
  
# the time in milliseconds a continuous query process will wait for a batch  
# to accumulate  
# continuous_query_max_wait = 10  
  
# the maximum number of events to accumulate before executing a continuous query  
# plan on them  
#continuous_query_batch_size = 10000  
  
# the number of parallel continuous query combiner processes to use for  
# each database  
#continuous_query_num_combiners = 2  
  
# the number of parallel continuous query worker processes to use for  
# each database  
#continuous_query_num_workers = 2  
  
# allow direct changes to be made to materialization tables?  
#continuous_query_materialization_table_updatable = off  
  
# inserts into streams should be synchronous?  
#synchronous_stream_insert = off  
  
# continuous views that should be affected when writing to streams.  
# it is string with comma separated values for continuous view names.  
#stream_targets = ''

Activate the database. As you can see, the native database supports PostgreSQL.

pdb@digoal-> pipeline-ctl start  
pdb@digoal-> psql pipeline postgres  
psql (9.4.4)  
Type "help" for help.  
  
pipeline=# \l  
                             List of databases  
   Name    |  Owner   | Encoding | Collate | Ctype |   Access privileges     
-----------+----------+----------+---------+-------+-----------------------  
 pipeline  | postgres | UTF8     | C       | C     |   
 template0 | postgres | UTF8     | C       | C     | =c/postgres          +  
           |          |          |         |       | postgres=CTc/postgres  
 template1 | postgres | UTF8     | C       | C     | =c/postgres          +  
           |          |          |         |       | postgres=CTc/postgres  
(3 rows)  
pipeline=# \dx  
                                          List of installed extensions  
       Name       | Version  |   Schema   |                             Description                               
------------------+----------+------------+---------------------------------------------------------------------  
 plpgsql          | 1.0      | pg_catalog | PL/pgSQL procedural language  
 postgis          | 2.2.0dev | pg_catalog | PostGIS geometry, geography, and raster spatial types and functions  
 postgis_topology | 2.2.0dev | topology   | PostGIS topology spatial types and functions  
(3 rows)

Check which functions are added to PipelineDB. Some functions are added as plug-ins, such as PostGIS. Some functions can be used as references or used directly.

pipeline=# select proname from pg_proc order by oid desc;  
......  
 second  
 minute  
 hour  
 day  
 month  
 year  
......  
 cmsketch_empty  
 tdigest_add  
 tdigest_empty  
 tdigest_empty  
 bloom_add  
 bloom_empty  
 bloom_empty  
 hll_add  
 hll_empty  
 hll_empty  
......

Conduct Performance Testing on a Virtual Machine (VM) on Your Own Laptop

Create five continuous views. A continuous view is a view for which you do not need to create a base table.

CREATE CONTINUOUS VIEW v0 AS SELECT COUNT(*) FROM stream;   
CREATE CONTINUOUS VIEW v1 AS SELECT sum(x::int),count(*),avg(y::int) FROM stream;   
CREATE CONTINUOUS VIEW v001 AS SELECT sum(x::int),count(*),avg(y::int) FROM stream1;  
CREATE CONTINUOUS VIEW v002 AS SELECT sum(x::int),count(*),avg(y::int) FROM stream2;   
CREATE CONTINUOUS VIEW v003 AS SELECT sum(x::int),count(*),avg(y::int) FROM stream3;

Activate stream statistics.

activate;

View the data dictionary.

select relname from pg_class where relkind='C';

Conduct the batch insert test.

pdb@digoal-> vi test.sql  
insert into stream(x,y,z) select generate_series(1,1000),1,1;  
insert into stream1(x,y,z) select generate_series(1,1000),1,1;  
insert into stream2(x,y,z) select generate_series(1,1000),1,1;  
insert into stream3(x,y,z) select generate_series(1,1000),1,1;

The following provides the test result. Note that you must use "simple" or "extended" here. If you use "prepared", only the last SQL statement takes effect. It is not clear yet whether this is a PipelineDB or pgbench bug.

pdb@digoal-> /opt/pgsql/bin/pgbench -M extended -n -r -f ./test.sql -P 1 -c 10 -j 10 -T 100000  
progress: 1.0 s, 133.8 tps, lat 68.279 ms stddev 58.444  
progress: 2.0 s, 143.9 tps, lat 71.623 ms stddev 53.880  
progress: 3.0 s, 149.5 tps, lat 66.452 ms stddev 49.727  
progress: 4.0 s, 148.3 tps, lat 67.085 ms stddev 55.484  
progress: 5.1 s, 145.7 tps, lat 68.624 ms stddev 67.795

About 0.58 million records are written to the database every second, and the statistics for five continuous views are compiled.

All these operations are completed in-memory, so the speed is very fast. PipelineDB uses the worker process to merge data. The execution result of the top command during stress testing is as follows:

top - 11:23:07 up  2:49,  4 users,  load average: 1.83, 3.08, 1.78  
Tasks: 177 total,   5 running, 172 sleeping,   0 stopped,   0 zombie  
Cpu(s): 11.6%us, 15.0%sy, 10.3%ni, 63.0%id,  0.0%wa,  0.0%hi,  0.1%si,  0.0%st  
Mem:   3916744k total,   605084k used,  3311660k free,    27872k buffers  
Swap:  1048572k total,        0k used,  1048572k free,   401748k cached  
  
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                               
11469 pdb       25   5  405m  75m  67m R 52.9  2.0   1:56.45 pipeline: bgworker: worker0 [pipeline]   
12246 pdb       20   0  400m  69m  67m S 14.3  1.8   0:10.55 pipeline: postgres pipeline [local] idle    
12243 pdb       20   0  400m  69m  67m S 13.3  1.8   0:10.45 pipeline: postgres pipeline [local] idle   
12248 pdb       20   0  400m  69m  67m S 13.3  1.8   0:10.40 pipeline: postgres pipeline [local] idle              
12244 pdb       20   0  400m  69m  67m S 12.6  1.8   0:10.50 pipeline: postgres pipeline [local] idle   
12237 pdb       20   0  400m  69m  67m R 12.3  1.8   0:10.52 pipeline: postgres pipeline [local] idle              
12247 pdb       20   0  402m  70m  67m R 12.3  1.8   0:10.70 pipeline: postgres pipeline [local] idle              
12245 pdb       20   0  401m  69m  67m S 12.0  1.8   0:10.78 pipeline: postgres pipeline [local] idle              
12235 pdb       20   0  400m  69m  67m S 11.3  1.8   0:10.88 pipeline: postgres pipeline [local] idle              
12239 pdb       20   0  400m  69m  67m S 11.0  1.8   0:10.79 pipeline: postgres pipeline [local] idle              
12241 pdb       20   0  400m  69m  67m S 11.0  1.8   0:10.53 pipeline: postgres pipeline [local] idle              
11466 pdb       20   0  119m 1480  908 R  5.3  0.0   0:58.39 pipeline: stats collector process                     
11468 pdb       25   5  401m  12m 9744 S  2.3  0.3   0:16.49 pipeline: bgworker: combiner0 [pipeline]              
12228 pdb       20   0  678m 3408  884 S  2.3  0.1   0:02.36 /opt/pgsql/bin/pgbench -M extended -n -r -f ./test.sql -P 1 -c 10 -j 10 -T 100000   
11464 pdb       20   0  398m  17m  16m S  1.7  0.4   0:10.47 pipeline: wal writer process                          
11459 pdb       20   0  398m 153m 153m S  0.0  4.0   0:00.37 /usr/lib/pipelinedb/bin/pipeline-server               
11460 pdb       20   0  115m  852  424 S  0.0  0.0   0:00.02 pipeline: logger process                              
11462 pdb       20   0  398m 3336 2816 S  0.0  0.1   0:00.06 pipeline: checkpointer process                        
11463 pdb       20   0  398m 2080 1604 S  0.0  0.1   0:00.08 pipeline: writer process                              
11465 pdb       20   0  401m 4460 1184 S  0.0  0.1   0:00.33 pipeline: autovacuum launcher process                 
11467 pdb       20   0  398m 1992 1056 S  0.0  0.1   0:00.00 pipeline: continuous query scheduler process  
  
pdb@digoal-> psql  
psql (9.4.4)  
Type "help" for help.  
pipeline=# select * from v0;  
  count    
---------  
 9732439  
(1 row)  
  
pipeline=# select * from v1;  
    sum     |  count  |          avg             
------------+---------+------------------------  
 4923514276 | 9837585 | 1.00000000000000000000  
(1 row)  
  
pipeline=# select * from v001;  
     sum      |  count   |          avg             
--------------+----------+------------------------  
 505023543131 | 11036501 | 1.00000000000000000000  
(1 row)  
  
pipeline=# select * from v002;  
      sum      |  count   |          avg             
---------------+----------+------------------------  
 1005065536319 | 12119513 | 1.00000000000000000000  
(1 row)  
  
pipeline=# select * from v003;  
     sum     |  count   |          avg             
-------------+----------+------------------------  
 14948355485 | 29867002 | 1.00000000000000000000  
(1 row)

After one billion streaming data records are written, the database size is still only 13 MB. This is because the streaming data is located in the memory and discarded after being processed.

pipeline=# \l+  
                                                              List of databases  
   Name    |  Owner   | Encoding | Collate | Ctype |   Access privileges   | Size  | Tablespace |                Description                   
-----------+----------+----------+---------+-------+-----------------------+-------+------------+--------------------------------------------  
 pipeline  | postgres | UTF8     | C       | C     |                       | 13 MB | pg_default | default administrative connection database  
 template0 | postgres | UTF8     | C       | C     | =c/postgres          +| 12 MB | pg_default | unmodifiable empty database  
           |          |          |         |       | postgres=CTc/postgres |       |            |   
 template1 | postgres | UTF8     | C       | C     | =c/postgres          +| 12 MB | pg_default | default template for new databases  
           |          |          |         |       | postgres=CTc/postgres |       |            |   
(3 rows)

If your application has a similar scenario, this is the best solution.

Data from a Test Conducted on the Physical Machine

The results of the test conducted on the E5-2650 are as follows.

In this test, 10 PipelineDB instances are deployed and the preceding cases are executed. About 6 million data records are processed per second.

Combined with LVS, HAproxy, or JDBC LB, you can conduct large-scale real-time processing. This also indicates that the single-server performance of PipelineDB still has a lot of room for improvement.

Earlier versions of PipelineDB could not fully utilize the CPU. In the latest version, you do not need to deploy multiple instances, as you need only one instance for the entire CPU.

References

https://github.com/pipelinedb/pipelinedb

https://www.pipelinedb.com/

Community

Testing the Performance of PipelineDB for Real-Time Statistics on Virtual Machines

Background

Introduction to PipelineDB

Streaming Statistics Example

How to Deploy PipelineDB

Conduct Performance Testing on a Virtual Machine (VM) on Your Own Laptop

Data from a Test Conducted on the Physical Machine

References

Read previous post:

Read next post:

digoal

You may also like

Comments

digoal

Related Products

PolarDB for PostgreSQL

Media Solution

AnalyticDB for PostgreSQL

ApsaraVideo Media Processing