×
Community Blog How to Retrieve and Deduplicate Similar Text in PostgreSQL

How to Retrieve and Deduplicate Similar Text in PostgreSQL

This article attempts to answer a series of questions raised in the Alibaba Cloud Community's Q&A section regarding the retrieval and deduplication of similar text in PostgreSQL.

Background

This article seeks to address the following list of questions raised in the Q&A section of the Alibaba Cloud Community.

How can I delete similar data from a table?

For example, similar strings such as the following:

  • How to do it?
  • How to perform it?
  • How can I do it?
  • What should I do in this case?

How to Solve the Problem

Follow the steps below to resolve the issues listed above.

1) Use the pg_trgm word breaker provided by PostgreSQL to divide strings into multiple non-repeated tokens and determine the content similarity between two strings. We recommend using the word breaking method to divide the content into terms.
2) After dividing the content into terms, aggregate the terms and remove duplicate data.
3) Generate a Cartesian product (matrix) through self-association to calculate the similarity between each record and each other record. To calculate the similarity, simply divide the number of overlapping tokens by the number of tokens in the set after deduplication.
4) Remove unnecessary data based on similarity.

While dealing with huge data volumes, use a professional analytical programming language, such as PL/R.

Practical Example

First, install the PostgreSQL word breaker (AliCloudDB for PostgreSQL contains this plug-in. For more information, see the official manual.)

git clone https://github.com/jaiminpan/pg_jieba.git  
mv pg_jieba $PGSRC/contrib/  
export PATH=/home/digoal/pgsql9.5/bin:$PATH  
cd $PGSRC/contrib/pg_jieba  
make clean;make;make install  
  
git clone https://github.com/jaiminpan/pg_scws.git  
mv pg_jieba $PGSRC/contrib/  
export PATH=/home/digoal/pgsql9.5/bin:$PATH  
cd $PGSRC/contrib/pg_scws  
make clean;make;make install  

Create a plug-in as shown below.

psql  
# create extension pg_jieba;  
# create extension pg_scws;  

Next, create a test case.

create table tdup1 (id int primary key, info text);  
create extension pg_trgm;  
insert into tdup1 values (1, '银屑病怎么治?');  
insert into tdup1 values (2, '银屑病怎么治疗?');  
insert into tdup1 values (3, '银屑病怎么治疗好?');  
insert into tdup1 values (4, '银屑病怎么能治疗好?');  

Now, choose one of the word breakers as shown below.

postgres=# select to_tsvector('jiebacfg', info),* from tdup1 ;  
     to_tsvector     | id |         info           
---------------------+----+----------------------  
 '治':3 '银屑病':1   |  1 | 银屑病怎么治?  
 '治疗':3 '银屑病':1 |  2 | 银屑病怎么治疗?  
 '治疗':3 '银屑病':1 |  3 | 银屑病怎么治疗好?  
 '治疗':4 '银屑病':1 |  4 | 银屑病怎么能治疗好?  
(4 rows)  
  
postgres=# select to_tsvector('scwscfg', info),* from tdup1 ;  
            to_tsvector            | id |         info           
-----------------------------------+----+----------------------  
 '治':2 '银屑病':1                 |  1 | 银屑病怎么治?  
 '治疗':2 '银屑病':1               |  2 | 银屑病怎么治疗?  
 '好':3 '治疗':2 '银屑病':1        |  3 | 银屑病怎么治疗好?  
 '好':4 '治疗':3 '能':2 '银屑病':1 |  4 | 银屑病怎么能治疗好?  
(4 rows)  

Create the following three functions:

  • To calculate the set (after deduplication) of two arrays
postgres=# create or replace function array_union(text[], text[]) returns text[] as $$  
  select array_agg(c1) from (select c1 from unnest($1||$2) t(c1) group by c1) t;  
$$ language sql strict;  
CREATE FUNCTION  
  • To deduplicate the arrays
postgres=# create or replace function array_dist(text[]) returns text[] as $$           
  select array_agg(c1) from (select c1 from unnest($1) t(c1) group by c1) t;      
$$ language sql strict;  
CREATE FUNCTION  
  • To calculate the overlapping parts (after deduplication) of the two arrays
postgres=# create or replace function array_share(text[], text[]) returns text[] as $$  
  select array_agg(unnest) from (select unnest($1) intersect select unnest($2) group by 1) t;  
$$ language sql strict;  
CREATE FUNCTION  

The Cartesian result is as follows:

regexp_split_to_array((regexp_replace(to_tsvector('jiebacfg',info)::text,'(:\d+)', '', 'g')),' ') used to convert info into arrays

postgres=# with t(c1,c2,c3) as   
(select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector('jiebacfg',info)::text,'(:\d+)', '', 'g')),' ')) from tdup1)   
select * from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)   
simulate from t t1,t t2) t;  
 t1c1 | t2c1 |         t1c2         |         t2c2         |       t1c3        |       t2c3        | simulate   
------+------+----------------------+----------------------+-------------------+-------------------+----------  
    1 |    1 | 银屑病怎么治?       | 银屑病怎么治?       | {'银屑病','治'}   | {'银屑病','治'}   |     1.00  
    1 |    2 | 银屑病怎么治?       | 银屑病怎么治疗?     | {'银屑病','治'}   | {'银屑病','治疗'} |     0.33  
    1 |    3 | 银屑病怎么治?       | 银屑病怎么治疗好?   | {'银屑病','治'}   | {'银屑病','治疗'} |     0.33  
    1 |    4 | 银屑病怎么治?       | 银屑病怎么能治疗好? | {'银屑病','治'}   | {'银屑病','治疗'} |     0.33  
    2 |    1 | 银屑病怎么治疗?     | 银屑病怎么治?       | {'银屑病','治疗'} | {'银屑病','治'}   |     0.33  
    2 |    2 | 银屑病怎么治疗?     | 银屑病怎么治疗?     | {'银屑病','治疗'} | {'银屑病','治疗'} |     1.00  
    2 |    3 | 银屑病怎么治疗?     | 银屑病怎么治疗好?   | {'银屑病','治疗'} | {'银屑病','治疗'} |     1.00  
    2 |    4 | 银屑病怎么治疗?     | 银屑病怎么能治疗好? | {'银屑病','治疗'} | {'银屑病','治疗'} |     1.00  
    3 |    1 | 银屑病怎么治疗好?   | 银屑病怎么治?       | {'银屑病','治疗'} | {'银屑病','治'}   |     0.33  
    3 |    2 | 银屑病怎么治疗好?   | 银屑病怎么治疗?     | {'银屑病','治疗'} | {'银屑病','治疗'} |     1.00  
    3 |    3 | 银屑病怎么治疗好?   | 银屑病怎么治疗好?   | {'银屑病','治疗'} | {'银屑病','治疗'} |     1.00  
    3 |    4 | 银屑病怎么治疗好?   | 银屑病怎么能治疗好? | {'银屑病','治疗'} | {'银屑病','治疗'} |     1.00  
    4 |    1 | 银屑病怎么能治疗好? | 银屑病怎么治?       | {'银屑病','治疗'} | {'银屑病','治'}   |     0.33  
    4 |    2 | 银屑病怎么能治疗好? | 银屑病怎么治疗?     | {'银屑病','治疗'} | {'银屑病','治疗'} |     1.00  
    4 |    3 | 银屑病怎么能治疗好? | 银屑病怎么治疗好?   | {'银屑病','治疗'} | {'银屑病','治疗'} |     1.00  
    4 |    4 | 银屑病怎么能治疗好? | 银屑病怎么能治疗好? | {'银屑病','治疗'} | {'银屑病','治疗'} |     1.00  
(16 rows)  

The preceding result is actually a matrix, where simulate indicates the similarity to be calculated.

1

During deduplication, we only need the data in the upper or lower part of the diagonal of the matrix. Therefore, add one more condition to determine the similarity.

postgres=# with t(c1,c2,c3) as   
(select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector('jiebacfg',info)::text,'(:\d+)', '', 'g')),' ')) from tdup1)   
select * from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)   
simulate from t t1,t t2 where t1.c1<>t2.c1 and t1.c1<t2.c1) t;  
 t1c1 | t2c1 |        t1c2        |         t2c2         |       t1c3        |       t2c3        | simulate   
------+------+--------------------+----------------------+-------------------+-------------------+----------  
    1 |    2 | 银屑病怎么治?     | 银屑病怎么治疗?     | {'银屑病','治'}   | {'银屑病','治疗'} |     0.33  
    1 |    3 | 银屑病怎么治?     | 银屑病怎么治疗好?   | {'银屑病','治'}   | {'银屑病','治疗'} |     0.33  
    1 |    4 | 银屑病怎么治?     | 银屑病怎么能治疗好? | {'银屑病','治'}   | {'银屑病','治疗'} |     0.33  
    2 |    3 | 银屑病怎么治疗?   | 银屑病怎么治疗好?   | {'银屑病','治疗'} | {'银屑病','治疗'} |     1.00  
    2 |    4 | 银屑病怎么治疗?   | 银屑病怎么能治疗好? | {'银屑病','治疗'} | {'银屑病','治疗'} |     1.00  
    3 |    4 | 银屑病怎么治疗好? | 银屑病怎么能治疗好? | {'银屑病','治疗'} | {'银屑病','治疗'} |     1.00  
(6 rows)

First, specify the simulate. For example, data with a similarity greater than 0.5 needs to be deduplicated.

postgres=# with t(c1,c2,c3) as   
(select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector('jiebacfg',info)::text,'(:\d+)', '', 'g')),' ')) from tdup1)   
select * from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)   
simulate from t t1,t t2 where t1.c1<>t2.c1 and t1.c1<t2.c1) t where simulate>0.5;  
 t1c1 | t2c1 |        t1c2        |         t2c2         |       t1c3        |       t2c3        | simulate   
------+------+--------------------+----------------------+-------------------+-------------------+----------  
    2 |    3 | 银屑病怎么治疗?   | 银屑病怎么治疗好?   | {'银屑病','治疗'} | {'银屑病','治疗'} |     1.00  
    2 |    4 | 银屑病怎么治疗?   | 银屑病怎么能治疗好? | {'银屑病','治疗'} | {'银屑病','治疗'} |     1.00  
    3 |    4 | 银屑病怎么治疗好? | 银屑病怎么能治疗好? | {'银屑病','治疗'} | {'银屑病','治疗'} |     1.00  
(3 rows)  

Second, delete the records corresponding to the IDs in column t2c1.

delete from tdup1 where id in (with t(c1,c2,c3) as   
(select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector('jiebacfg',info)::text,'(:\d+)', '', 'g')),' ')) from tdup1)   
select t2c1 from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)   
simulate from t t1,t t2 where t1.c1<>t2.c1 and t1.c1<t2.c1) t where simulate>0.5);  
  
例如 :   
postgres=# insert into tdup1 values (11, '白血病怎么治?');  
INSERT 0 1  
postgres=# insert into tdup1 values (22, '白血病怎么治疗?');  
INSERT 0 1  
postgres=# insert into tdup1 values (13, '白血病怎么治疗好?');  
INSERT 0 1  
postgres=# insert into tdup1 values (24, '白血病怎么能治疗好?');  
INSERT 0 1  
postgres=#   
postgres=# with t(c1,c2,c3) as                               
(select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector('jiebacfg',info)::text,'(:\d+)', '', 'g')),' ')) from tdup1)   
select * from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)   
simulate from t t1,t t2 where t1.c1<>t2.c1 and t1.c1<t2.c1) t where simulate>0.5;  
 t1c1 | t2c1 |        t1c2        |         t2c2         |       t1c3        |       t2c3        | simulate   
------+------+--------------------+----------------------+-------------------+-------------------+----------  
    2 |    3 | 银屑病怎么治疗?   | 银屑病怎么治疗好?   | {'银屑病','治疗'} | {'银屑病','治疗'} |     1.00  
    2 |    4 | 银屑病怎么治疗?   | 银屑病怎么能治疗好? | {'银屑病','治疗'} | {'银屑病','治疗'} |     1.00  
    3 |    4 | 银屑病怎么治疗好? | 银屑病怎么能治疗好? | {'银屑病','治疗'} | {'银屑病','治疗'} |     1.00  
   22 |   24 | 白血病怎么治疗?   | 白血病怎么能治疗好? | {'治疗','白血病'} | {'治疗','白血病'} |     1.00  
   13 |   22 | 白血病怎么治疗好? | 白血病怎么治疗?     | {'治疗','白血病'} | {'治疗','白血病'} |     1.00  
   13 |   24 | 白血病怎么治疗好? | 白血病怎么能治疗好? | {'治疗','白血病'} | {'治疗','白血病'} |     1.00  
(6 rows)  
  
postgres=# begin;  
BEGIN  
postgres=# delete from tdup1 where id in (with t(c1,c2,c3) as   
postgres(# (select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector('jiebacfg',info)::text,'(:\d+)', '', 'g')),' ')) from tdup1)   
postgres(# select t2c1 from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)   
postgres(# simulate from t t1,t t2 where t1.c1<>t2.c1 and t1.c1<t2.c1) t where simulate>0.5);  
DELETE 4  
postgres=# select * from tdup1 ;  
 id |        info          
----+--------------------  
  1 | 银屑病怎么治?  
  2 | 银屑病怎么治疗?  
 11 | 白血病怎么治?  
 13 | 白血病怎么治疗好?  
(4 rows)  

Use a database to solve the problems that you may encounter because the JOIN filter is <> and <, while the hashjoin is not used.

This operation usually takes longer for large amounts of data.

postgres=# explain delete from tdup1 where id in (with t(c1,c2,c3) as   
(select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector('jiebacfg',info)::text,'(:\d+)', '', 'g')),' ')) from tdup1)   
select t2c1 from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)   
simulate from t t1,t t2 where t1.c1<>t2.c1 and t1.c1<t2.c1) t where simulate>0.5);  
                                                      QUERY PLAN                                                        
----------------------------------------------------------------------------------------------------------------------  
 Delete on tdup1  (cost=10005260133.58..10005260215.84 rows=2555 width=34)  
   ->  Hash Join  (cost=10005260133.58..10005260215.84 rows=2555 width=34)  
         Hash Cond: (tdup1.id = "ANY_subquery".t2c1)  
         ->  Seq Scan on tdup1  (cost=0.00..61.10 rows=5110 width=10)  
         ->  Hash  (cost=10005260131.08..10005260131.08 rows=200 width=32)  
               ->  HashAggregate  (cost=10005260129.08..10005260131.08 rows=200 width=32)  
                     Group Key: "ANY_subquery".t2c1  
                     ->  Subquery Scan on "ANY_subquery"  (cost=10000002667.20..10005252911.99 rows=2886838 width=32)  
                           ->  Subquery Scan on t  (cost=10000002667.20..10005224043.61 rows=2886838 width=4)  
                                 Filter: (t.simulate > 0.5)  
                                 CTE t  
                                   ->  Seq Scan on tdup1 tdup1_1  (cost=0.00..2667.20 rows=5110 width=36)  
                                 ->  Nested Loop  (cost=10000000000.00..10005113119.99 rows=8660513 width=68)  
                                       Join Filter: ((t1.c1 <> t2.c1) AND (t1.c1 < t2.c1))  
                                       ->  CTE Scan on t t1  (cost=0.00..102.20 rows=5110 width=36)  
                                       ->  CTE Scan on t t2  (cost=0.00..102.20 rows=5110 width=36)  
(16 rows)  

A better way is to use PL/R or R for matrix calculation and then filtering the result. You can also use an MPP database. For example, Greenplum processes a large amount of data with R and MADlib.

Summary

The article illustrates how to retrieve and deduplicate similar text in PostgreSQL by using examples. It applies Word Breaking functionality of PG to answer the questions regarding the deduplication of similar text. The other feature of PG, Window Query, is not used in this example. However, if your data does not have a primary key, use ctid and row_number to locate a unique record.

References

0 0 0
Share on

digoal

277 posts | 24 followers

You may also like

Comments