Multiple image processing business scenarios, such as image search, video deduplication, image beautification, and image deduplication, encounter the common challenge of managing the humongous volume of duplicate images and videos stored on the server.
For instance, users may upload many videos where specific video has different versions with different resolutions, audio tracks, or compression ratios. This leads to a large number of duplicate videos saved on the server. Another example is the identification of porn videos or images. It is crucial to identify such content but the challenge is how to identify porn videos and images.
On the other hand, image search is the second most common search engine after text search. Image search engines, such as Alibaba Cloud Image Search, allow you to upload a particular picture to the interface and search out a bunch of similar pictures.
This article describes how to screen out duplicate videos and how to implement an efficient image search using PostgreSQL. With the omnipotent API of PostgreSQL, it is easy to extend its image search function.
The PostgreSQL image search plug-in uses the mainstream Haar wavelet technology to store images after conversion. Take a look at its Wikipedia page to learn more about this technology.
You can also refer to this document to know more about how to search similar images in PostgreSQL.
Following are some examples to illustrate the Haar wavelet technology application.
Follow the steps below to install the PostgreSQL Image Search Plug-in:
Step 1. Add dependency on gd.h.
yum install -y gd-devel
Step 2. Download and install imgsmlr.
$ git clone https://github.com/postgrespro/imgsmlr $ cd imgsmlr $ export PGHOME=/home/digoal/pgsql9.5 $ export PATH=$PGHOME/bin:$PATH:. $ make USE_PGXS=1 $ make USE_PGXS=1 install
Step 3. Install the plug-in.
$ psql psql (9.5.3) Type "help" for help. postgres=# create extension imgsmlr; CREATE EXTENSION
The following are two newly added data types in imgsmlr.
|pattern||16388 bytes||Result of Haar wavelet transform on the image|
|signature||64 bytes||Short representation of pattern for fast search using GiST indexes|
The KNN operator and gist index method that supports the pattern and signature types can be used for similar image search.
|Operator||Left type||Right type||Return type||Description|
|<->||pattern||pattern||float8||Eucledian distance between two patterns|
|<->||signature||signature||float8||Eucledian distance between two signatures|
The following table shows several newly added functions. Now, you can convert binary images to the pattern type and convert the data stored in the pattern into the signature type.
|jpeg2pattern(bytea)||pattern||Convert jpeg image into pattern|
|png2pattern(bytea)||pattern||Convert png image into pattern|
|gif2pattern(bytea)||pattern||Convert gif image into pattern|
|pattern2signature(pattern)||signature||Create signature from pattern|
|shuffle_pattern(pattern)||pattern||Shuffle pattern for less sensitivity to image shift|
Follow the steps listed below to test the PostgreSQL Image Search Plug-in:
Step 1. Import some images as shown below (the more the better).
Step 2. Now, create the image table.
create table image (id serial, data bytea);
Step 3. Import the images to the database.
insert into image(data) select pg_read_binary_file('文件路径');
Step 4. Next, convert the images to the pattern and signature types.
CREATE TABLE pat AS ( SELECT id, shuffle_pattern(pattern) AS pattern, pattern2signature(pattern) AS signature FROM ( SELECT id, jpeg2pattern(data) AS pattern FROM image ) x );
Step 5. Create indexes as shown below.
ALTER TABLE pat ADD PRIMARY KEY (id); CREATE INDEX pat_signature_idx ON pat USING gist (signature);
Step 6. Now you can run an approximation query, such as querying images that are similar to id = : id images and retrieve the top 10 items on the similarity ranking list.
SELECT id, smlr FROM ( SELECT id, pattern <-> (SELECT pattern FROM pat WHERE id = :id) AS smlr FROM pat WHERE id <> :id ORDER BY signature <-> (SELECT signature FROM pat WHERE id = :id) LIMIT 100 ) x ORDER BY x.smlr ASC LIMIT 10
Here, you can use KNN indexing to quickly get output based on the similarity rankings.
Let's take a look at the following examples.
For video deduplication, extract keyframes in a video to generate the Cartesian product through self-correlation. Calculate the similarity between any two images of different videos, and when the similarity reaches a certain threshold value, the two videos are deemed the same.
Refer to the following example for better understanding.
创建图片表，并将所有视频的关键帧导入表中 create table image (id serial8 primary key, movie_id int, data bytea); 导入图片，假设为jpeg格式 ... 略 ... 生成patten 和 signature CREATE TABLE pat AS ( SELECT id, movie_id, shuffle_pattern(pattern) AS pattern, pattern2signature(pattern) AS signature FROM ( SELECT id, movie_id, jpeg2pattern(data) AS pattern FROM image ) x ); 计算不同视频的相似度 select t1.movie_id, t1.id, t1.signature<->t2.signature from pat t1 join pat t2 on (t1.movie_id<>t2.movie_id) order by t1.signature<->t2.signature desc or select t1.movie_id, t1.id, t1.signature<->t2.signature from pat t1 join pat t2 on (t1.movie_id<>t2.movie_id) where t1.signature<->t2.signature > 0.9 order by t1.signature<->t2.signature desc
In short, PostgreSQL is a very powerful database with highly customizable functions and does not require you to make any changes to the PostgreSQL kernel. Therefore, it is safe and reliable. The image search technology is an example of PostgreSQL feature extension and offers extremely high speed.
Build your own PostgreSQL solution on Alibaba Cloud with ApsaraDB for RDS PostgreSQL.
digoal - February 3, 2020
Alibaba Clouder - June 1, 2018
digoal - September 12, 2019
digoal - February 3, 2020
zhuodao - July 30, 2020
ApsaraDB - November 16, 2020
An intelligent image search service with product search and generic search features to help users resolve image search requests.Learn More
This technology can assist realizing quantitative analysis, speeding up CT image analytics, avoiding errors caused by fatigue and adjusting treatment plans in time.Learn More
Fully managed and less trouble database servicesLearn More
An online MPP warehousing service based on the Greenplum Database open source programLearn More
More Posts by digoal