×
Community Blog Data De-duplication in Image Search Services

Data De-duplication in Image Search Services

Image de-duplication requires Postgres as their database and Haar wavelet algorithm adds to the possibility of searching images on popular search engines.

Image processing technology, such as image search, has a multitude of applications in the real world. For example, Internet users may upload multiple versions of a video or image, each with different formatting, audio tracks, or compression ratios. This leads to a significant number of duplicate videos stored on the service end. However, this problem can be solved using data de-duplication. But how is this normally done?

When you use search engines to look for relevant images, the search engine will process the image and the tags related to the image. For example, when I search for a "snowman" image, a search engine may return me this result.

1

Pretty accurate right? Typically, PostgreSQL is behind the implementation of the image search and its Payment Gateway Application Programming Interface (API) extends the image search function.

PostgreSQL’s Image Search Plug-in Background Technology

PostgreSQL’s image search plug-in adopts the mainstream Haar wavelet technology to convert and store an image. The following figures briefly describe the Haar wavelet technology. For additional details, refer to the following Wikipedia link: https://en.wikipedia.org/wiki/Haar_wavelet

2

3

4

Steps to Install PostgreSQL Image Search Plug-in

Below are the steps to install PostgreSQL image search plug-in:

  1. Dependent on gd.h.

    # yum install -y gd-devel


  2. Download and install imgsmlr.
    $ git clone https://github.com/postgrespro/imgsmlr
    $ cd imgsmlr
    $ export PGHOME=/home/digoal/pgsql9.5
    $ export PATH=$PGHOME/bin:$PATH:.
    
    $ make USE_PGXS=1
    $ make USE_PGXS=1 install


  3. Install the plug-in.
    $ psql
    psql (9.5.3)
    Type "help" for help.
    postgres=# create extension imgsmlr;
    CREATE EXTENSION


  4. Two data types now exist in imgsmlr.
    Data Type Storage Length Description
    Pattern 16388 bytes Result of Haar wavelet transform on the image
    Signature 64 bytes Short representation of pattern for fast search using GiST indexes


  5. For similar image searching, use the gist index method (supporting the pattern and signature types) and KNN operator.
    Data Type Left Type Right Type Return Type Description
    <-> pattern pattern float8 Eucledian distance between two patterns
    <-> signature signature float8 Eucledian distance between two signatures

    This adds several functions.


  6. You can convert the binary images into the pattern type and convert the data stored in pattern into the signature type.
    Function Return Type Description
    jpeg2pattern(bytea) pattern Convert jpeg image to pattern
    png2pattern(bytea) pattern Convert png image to pattern
    gif2pattern(bytea) pattern Convert gif image to pattern
    pattern2signature(pattern) signature Create signature from pattern
    shuffle_pattern(pattern) pattern Shuffle pattern for less sensitivity to image shift

Steps to Perform PostgreSQL Image Search Plug-in Test:

Once you are done installing, carry out these steps to perform PostgreSQL image search plug-in test:

  1. Import images, such as the following (the more the better).

    5


  2. Create the image table (id serial, data bytea);
  3. Import the images to the database.
  4. Insert into image(data) select pg_read_binary_file;
  5. Convert the image to the pattern and signature type.
    CREATE TABLE pat AS (
        SELECT
            id,
            shuffle_pattern(pattern) AS pattern, 
            pattern2signature(pattern) AS signature 
        FROM (
            SELECT 
                id, 
                jpeg2pattern(data) AS pattern 
            FROM 
                image
        ) x 
    );


  6. Create an index.
    ALTER TABLE pat ADD PRIMARY KEY (id);
    
    CREATE INDEX pat_signature_idx ON pat USING gist (signature);


  7. Perform an approximation query, such as querying images that are similar to id = :id images and retrieving the top 10 items on the similarity ranking list.
    SELECT
        id,
        smlr
    FROM
    (
        SELECT
            id,
            pattern <-> (SELECT pattern FROM pat WHERE id = :id) AS smlr
        FROM pat
        WHERE id <> :id
        ORDER BY
            signature <-> (SELECT signature FROM pat WHERE id = :id)
        LIMIT 100
    ) x
    ORDER BY x.smlr ASC 
    LIMIT 10


  8. K – Nearest Neighbour (KNN) indexing is an option here and the result is output quickly based on similarity rankings.

Testing Our Image Search Engine

For the most part, our search engine works as expected.

6

However, sometimes the image search does not work too well.

7

This is because the computer "sees" the images differently from humans. It processes an object as a 2D matrix, and transform it to a signature, which is readable for computers.

8

Video De-duplication Service

For video de-duplication, you can extract key frames in a video to generate the Cartesian product through self-correlation. Remember to calculate the similarity of two images of different videos. When the similarity reaches a certain threshold, the services deem the two videos the same.

Example:

  1. Create the image table and import the key frames of all videos into the table (id serial8 primary key, movie_id int, data bytea);
  2. Import the image (assume it is in jpeg format).
  3. Skipped
  4. Generate the pattern and signature types
    CREATE TABLE pat AS (
        SELECT
            id, movie_id,
            shuffle_pattern(pattern) AS pattern, 
            pattern2signature(pattern) AS signature 
        FROM (
            SELECT 
                id, movie_id,
                jpeg2pattern(data) AS pattern 
            FROM 
                image
        ) x 
    );


  5. Calculate the similarity of different videos.
    select t1.movie_id, t1.id, t1.signature<->t2.signature from 
    pat t1 join pat t2 on (t1.movie_id<>t2.movie_id) 
    order by t1.signature<->t2.signature desc 
    or 
    select t1.movie_id, t1.id, t1.signature<->t2.signature from 
    pat t1 join pat t2 on (t1.movie_id<>t2.movie_id) 
    where t1.signature<->t2.signature > 0.9 
    order by t1.signature<->t2.signature desc 

    Summary

    Image de-duplication requires Postgres as their database and uses its API. PostgreSQL is a powerful database with customizable functions. It not only ensures image de-duplication effectively but is also safe and reliable. Video de-duplication is the additional feature that is possible using PostgreSQL. Haar wavelet algorithm adds to the possibility of searching images on popular search engines. The implementation of PostgreSQL and installation are aspects that are worth knowing.

1 1 2
Share on

Alibaba Clouder

2,599 posts | 763 followers

You may also like

Comments

Raja_KT February 19, 2019 at 4:21 am

Good one. Maybe you want to have a look at this one: https://www.alibabacloud.com/forum/read-3422?spm=a2c5q.11423524.0.0.225c53ceNAqfiRImage search using online and offline and using DL techniques are more optimized especially for huge amount of images...