×
Community Blog Similarity Analysis for PostgreSQL Text Databases

Similarity Analysis for PostgreSQL Text Databases

In this article, we will quickly go over some of the more common scenarios in which you can use PostgreSQL in the hope to inspire you to do more with PostgreSQL.

By Digoal.

When it comes to searches and all sorts of search filters, practically speaking, we can often find ourselves needing to search according to many different parameters. To name a few, these could be things such as having a similar appearance, or geographical closeness, or even similar personality traits, so on.

These aren't just technical questions, of course. They are ones that affect us all the time. For instance, let's if say you know what kind of particular item you want and you have a picture of it, but you find it hard to describe. In that case, it's a lot easier to search by image.

Well, luckily for us, PostgreSQL, which is arguably the world's most advanced open-source database, is powerful enough to support the background inner workings of any of these types of search queries and filtering scenarios. In reality, the limit isn't the technology, but rather the limit is our imagination!

In this article, we will quickly and briefly go over some of the more common scenarios in which you can use PostgreSQL in, specifically we will go over:

  • sorting by image or facial similarity
  • sorting by overlapping hobbies
  • sorting by age approximation
  • sorting by distance
  • sorting by textual similarity
  • sorting by word-break similarities

Of course, there's many more scenarios in which you can apply PostgreSQL to filter and sort search results. In reality, the applications are nearly endless! We hope that this blog can inspire you to do more with PostgreSQL.

Sorting by Image or Facial Similarity

Sorting by Image and facial similarity are both things that are very relevant nowadays-especially with the prominence of images and video in our daily lives. Image searches are now common on most online search engines. I'm sure nearly everyone has used the reverse-image search option on Google before. And now, on many e-commerce platforms, such as Alibaba's own Taobao (淘宝), you can search for items based on the power of AI.

Well, you can use PostgreSQL to empower image similarity searches, too. Consider the example given in this blog, for example.

Sorting by Overlapping Hobbies

You can also collect hobby data from a group of people, then group people by hobby or sort by hobby overlap to find the target group. Also, potential as more hobbies are gathered, and more information is gathered about these hobbies, you can do even more with this data. As this is relatively simple, we won't be discussing this in more detail here.

Sorting by Age Approximation

Alternatively, you could sort by ages or by age approximations. Actually, this is relatively simple, too. For example, you can enter the age 23, and have the system return the results that are the closest to this age.

Consider this example in PostgreSQL's official documentation. The below example is based on the example given by PostgreSQL themselves-where we use age approximation as the main search filter. For this particular output example, the system returns 10 pieces of data that are the closet to the age 100:

postgres=# create extension btree_gist;
CREATE EXTENSION
postgres=# create table test12(id int);
CREATE TABLE
postgres=# insert into test12 select trunc(random()*1000) from generate_series(1,100000);
INSERT 0 100000
postgres=# create index idx_test12 on test12 using gist(id);
CREATE INDEX
postgres=# select * from test12 order by id <-> 100 limit 10;
 id  
-----
 100
 100
 100
 100
 100
 100
 100
 100
 100
 100
(10 rows)

Sorting by Distance

You can also sort based on distance. For this, you can read more about the functions and operators involved here in this tutorial provided in PostgreSQL's official documentation. Anyway, before we get to distracted, let's look at example. In the example below, our search retrieves 10 points that are geographically nearest to the specific point we entered.

postgres=# create table test13(c1 point);
CREATE TABLE
postgres=# insert into test13 select ('('||trunc(random()*1000)||','||trunc(random()*5000)||')')::point from generate_series(1,10000);
INSERT 0 10000
postgres=# create index idx_test13 on test13 using gist(c1);
CREATE INDEX
postgres=# select * from test13 order by c1 <-> point '(1,10000)' limit 10;
     c1     
------------
 (58,4993)
 (191,4995)
 (48,4991)
 (326,4998)
 (99,4988)
 (205,4991)
 (348,4998)
 (53,4986)
 (174,4988)
 (136,4984)
(10 rows)

Sorting by Textual Similarity

Next, you can sort by similar features, or more specifically words and the repetition of words, in a text. Important to our discussion here is, of course, the code discussed at this page. Let me show you an example here. The example shows how 'Hello Digoal' and the Mandarin sentence of 'Nihao Digoal' (你好德哥) are markedly similar-which they ought to be, as they mean the same thing:

postgres=# create extension pg_trgm;
CREATE EXTENSION
postgres=# create table test14(c1 text);
CREATE TABLE
postgres=# insert into test14 values ('Hello Digoal'), ('China'), ('Hello China'), ('Nihao Digoal');
INSERT 0 4
postgres=# select * from test14;
      c1      
--------------
 Hello Digoal
 China
 Hello China
 Nihao Digoal
(4 rows)
postgres=# create index idx_test14 on test14 using gist(c1 gist_trgm_ops);
CREATE INDEX
postgres=# explain select *,c1 <-> 'Digoal' from test14 order by c1 <-> 'Digoal' limit 2;
                                   QUERY PLAN                                   
--------------------------------------------------------------------------------
 Limit  (cost=0.13..4.17 rows=2 width=36)
   ->  Index Scan using idx_test14 on test14  (cost=0.13..8.21 rows=4 width=36)
         Order By: (c1 <-> 'Digoal'::text)
(3 rows)
postgres=# select *,c1 <-> 'Digoal' from test14 order by c1 <-> 'Digoal' limit 2;
      c1      | ?column? 
--------------+----------
 Hello Digoal | 0.461538
 Nihao Digoal | 0.461538
(2 rows)

Sorting by Word-Break Similarity

Last let's look at a more complex scenario. This is one of sorting by word-break similarities. Unlike sorting by textual similarity, this sorting scenario counts work-break similarity rather than similarities in the words themselves.

The rum plug-in can be used for this kind of sorting scenario. For this, we will be using this resource. Now consider this example:

export PATH=/home/digoal/pgsql9.6/bin:$PATH
git clone https://github.com/postgrespro/rum
cd rum
make USE_PGXS=1
make USE_PGXS=1 install
//
//
git clone https://github.com/jaiminpan/pg_jieba
cd pg_jieba
make USE_PGXS=1 
make USE_PGXS=1 install
//
//
postgres=# create extension rum;
CREATE EXTENSION
postgres=# create extension pg_jieba;
CREATE EXTENSION
// Word Example examples:
postgres=#  select * from to_tsvector('jiebacfg', 'Xiao Ming graduated with a master's degree from the Institute of Computing Technology of the Chinese Academy of Sciences, and later went on to study at Kyoto University in Japan.');
                                   to_tsvector                                    
----------------------------------------------------------------------------------
 'Chinese Academy of Sciences':5 'Xiao Ming':1 'Kyoto University':10 'graduated':3 'study':11 'master's degree':2 'Institute of Computing Technology':6
(1 row)
// With similarities
postgres=#  select * from rum_ts_distance(to_tsvector('jiebacfg', 'Xiao Ming graduated with a master's degree from the Institute of Computing Technology of the Chinese Academy of Sciences, and later went on to study at Kyoto University in Japan.') , to_tsquery(''Institute of Computing Technology'));
 rum_ts_distance 
-----------------
         16.4493
(1 row)
// Without similarities
postgres=#  select * from rum_ts_distance(to_tsvector('jiebacfg', 'Xiao Ming graduated with a master's degree from the Institute of Computing Technology of the Chinese Academy of Sciences, and later went on to study at Kyoto University in Japan.) , to_tsquery('Computer Science'));
 rum_ts_distance 
-----------------
        Infinity
(1 row)
// One or the other has similarities
postgres=# select * from rum_ts_distance(to_tsvector('jiebacfg', 'Xiao Ming graduated with a master's degree from the Institute of Computing Technology of the Chinese Academy of Sciences, and later went on to study at Kyoto University in Japan.') , to_tsquery('''Institute of Computing Technology | master's degree'));
 rum_ts_distance 
-----------------
         8.22467
(1 row)
// Both have similarities
postgres=# select * from rum_ts_distance(to_tsvector('jiebacfg', 'Xiao Ming graduated with a master's degree from the Institute of Computing Technology of the Chinese Academy of Sciences, and later went on to study at Kyoto University in Japan.') , to_tsquery('''Institute of Computing Technology & master's degree'));
 rum_ts_distance 
-----------------
         32.8987
(1 row)
// Order
postgres=# create table test15(c1 tsvector);
CREATE TABLE
postgres=# insert into test15 values (to_tsvector('jiebacfg', 'hello china, i''m digoal')), (to_tsvector('jiebacfg', 'hello world, i''m postgresql')), (to_tsvector('jiebacfg', 'how are you, i''m digoal'));
INSERT 0 3
postgres=# select * from test15;
                         c1                          
-----------------------------------------------------
 ' ':2,5,9 'china':3 'digoal':10 'hello':1 'm':8
 ' ':2,5,9 'hello':1 'm':8 'postgresql':10 'world':3
 ' ':2,4,7,11 'digoal':12 'm':10
(3 rows)
postgres=# create index idx_test15 on test15 using rum(c1 rum_tsvector_ops);
CREATE INDEX
postgres=# select *,c1 <=> to_tsquery('hello') from test15;
                         c1                          | ?column? 
-----------------------------------------------------+----------
 ' ':2,5,9 'china':3 'digoal':10 'hello':1 'm':8     |  16.4493
 ' ':2,5,9 'hello':1 'm':8 'postgresql':10 'world':3 |  16.4493
 ' ':2,4,7,11 'digoal':12 'm':10                     | Infinity
(3 rows)
postgres=# explain select *,c1 <=> to_tsquery('postgresql') from test15 order by c1 <=> to_tsquery('postgresql');
                                   QUERY PLAN                                   
--------------------------------------------------------------------------------
 Index Scan using idx_test15 on test15  (cost=3600.25..3609.06 rows=3 width=36)
   Order By: (c1 <=> to_tsquery('postgresql'::text))
(2 rows)
0 0 0
Share on

digoal

276 posts | 24 followers

You may also like

Comments