×
Community Blog PostgreSQL md5hash Plug-in: 128-bit Storage for Space Saving and Efficiency

PostgreSQL md5hash Plug-in: 128-bit Storage for Space Saving and Efficiency

In this article, the author discusses md5hash plug-in and md5hash data types to create 128-bit hashes for storage space optimization and efficiency.

By digoal

Background

MD5 hash example: c4ca4238a0b923820dcc509a6f75849b. In databases, text or varchar is required to store MD5 hashes, which is space-wasting. The reason is that the text type requires a header of 1 to 4 bytes, and each character consumes 1 byte. So, 33 bytes are needed to store the above MD5 hash.

In fact, you can compress the hash to 128 bits for storage. This is what the md5hash plug-in does. It defines a md5hash data type, which is stored at the bottom with 128 bits.

This extension provides a simple data type storing 128-bit values (e.g., MD5 hashes) in a bit more efficient way - in a fixed-length column and inline.

The extension defines the 'md5hash' type itself, casts, operators, and an operator class for b-tree indexes. It's fairly straightforward and simple to add more operators or classes.

Example

Take the usage of this plug-in in PostgreSQL (PG) 12 as an example.

1) Install PG 12.

wget http://api.pgxn.org/dist/md5hash/1.0.1/md5hash-1.0.1.zip  
unzip md5hash-1.0.1.zip   
cd md5hash-1.0.1/  
export PATH=$PGHOME/bin:$PATH  
USE_PGXS=1 make  
USE_PGXS=1 make install  

2) Install the plug-in.

postgres=# create extension md5hash;  
CREATE EXTENSION 

3) Use the md5hash data type.

postgres=# create table ts(id md5hash primary key);  
CREATE TABLE  
  
postgres=# create table ts1(id md5hash, id1 text);  
CREATE TABLE  
  
postgres=# insert into ts1 select md5(i::text),md5(i::text) from generate_series(1,100) i;  
INSERT 0 100  
postgres=# select * from ts1 limit 10;  
                id                |               id1                  
----------------------------------+----------------------------------  
 c4ca4238a0b923820dcc509a6f75849b | c4ca4238a0b923820dcc509a6f75849b  
 c81e728d9d4c2f636f067f89cc14862c | c81e728d9d4c2f636f067f89cc14862c  
 eccbc87e4b5ce2fe28308fd9f2a7baf3 | eccbc87e4b5ce2fe28308fd9f2a7baf3  
 a87ff679a2f3e71d9181a67b7542122c | a87ff679a2f3e71d9181a67b7542122c  
 e4da3b7fbbce2345d7772b0674a318d5 | e4da3b7fbbce2345d7772b0674a318d5  
 1679091c5a880faf6fb5e6087eb1b2dc | 1679091c5a880faf6fb5e6087eb1b2dc  
 8f14e45fceea167a5a36dedd4bea2543 | 8f14e45fceea167a5a36dedd4bea2543  
 c9f0f895fb98ab9159f51fd0297e236d | c9f0f895fb98ab9159f51fd0297e236d  
 45c48cce2e2d7fbdea1afc51c7c6ad26 | 45c48cce2e2d7fbdea1afc51c7c6ad26  
 d3d9446802a44259755d38e6d163e820 | d3d9446802a44259755d38e6d163e820  
(10 rows)  

When comparing md5hash with text, compulsory conversion is needed.

postgres=# select * from ts1 where id::text=id1 limit 10;  
                id                |               id1                  
----------------------------------+----------------------------------  
 c4ca4238a0b923820dcc509a6f75849b | c4ca4238a0b923820dcc509a6f75849b  
 c81e728d9d4c2f636f067f89cc14862c | c81e728d9d4c2f636f067f89cc14862c  
 eccbc87e4b5ce2fe28308fd9f2a7baf3 | eccbc87e4b5ce2fe28308fd9f2a7baf3  
 a87ff679a2f3e71d9181a67b7542122c | a87ff679a2f3e71d9181a67b7542122c  
 e4da3b7fbbce2345d7772b0674a318d5 | e4da3b7fbbce2345d7772b0674a318d5  
 1679091c5a880faf6fb5e6087eb1b2dc | 1679091c5a880faf6fb5e6087eb1b2dc  
 8f14e45fceea167a5a36dedd4bea2543 | 8f14e45fceea167a5a36dedd4bea2543  
 c9f0f895fb98ab9159f51fd0297e236d | c9f0f895fb98ab9159f51fd0297e236d  
 45c48cce2e2d7fbdea1afc51c7c6ad26 | 45c48cce2e2d7fbdea1afc51c7c6ad26  
 d3d9446802a44259755d38e6d163e820 | d3d9446802a44259755d38e6d163e820  
(10 rows)  
  
postgres=# select * from ts1 where id=id1::md5hash limit 10;  
                id                |               id1                  
----------------------------------+----------------------------------  
 c4ca4238a0b923820dcc509a6f75849b | c4ca4238a0b923820dcc509a6f75849b  
 c81e728d9d4c2f636f067f89cc14862c | c81e728d9d4c2f636f067f89cc14862c  
 eccbc87e4b5ce2fe28308fd9f2a7baf3 | eccbc87e4b5ce2fe28308fd9f2a7baf3  
 a87ff679a2f3e71d9181a67b7542122c | a87ff679a2f3e71d9181a67b7542122c  
 e4da3b7fbbce2345d7772b0674a318d5 | e4da3b7fbbce2345d7772b0674a318d5  
 1679091c5a880faf6fb5e6087eb1b2dc | 1679091c5a880faf6fb5e6087eb1b2dc  
 8f14e45fceea167a5a36dedd4bea2543 | 8f14e45fceea167a5a36dedd4bea2543  
 c9f0f895fb98ab9159f51fd0297e236d | c9f0f895fb98ab9159f51fd0297e236d  
 45c48cce2e2d7fbdea1afc51c7c6ad26 | 45c48cce2e2d7fbdea1afc51c7c6ad26  
 d3d9446802a44259755d38e6d163e820 | d3d9446802a44259755d38e6d163e820  
(10 rows)  

The comparison of the storage space shows that md5hash saves space.

postgres=# select pg_column_size(id), pg_column_size(id1) from ts1 limit 10;  
 pg_column_size | pg_column_size   
----------------+----------------  
             16 |             33  
             16 |             33  
             16 |             33  
             16 |             33  
             16 |             33  
             16 |             33  
             16 |             33  
             16 |             33  
             16 |             33  
             16 |             33  
(10 rows)  

In addition to space savings, md5hash is more efficient in data querying.

CREATE TABLE test_md5hash (id md5hash       PRIMARY KEY);  
CREATE TABLE test_text    (id varchar(32)   PRIMARY KEY);  
  
INSERT INTO test_md5hash SELECT md5(i::text)  
                           FROM generate_series(1,1000000) s(i);  
  
INSERT INTO test_text SELECT md5(i::text)  
                           FROM generate_series(1,1000000) s(i);  
  
SELECT relname,  
    (pg_relation_size(oid)/1024) AS relation_size_kB,  
    (pg_total_relation_size(oid)/1024) AS total_size_kB  
  FROM pg_class WHERE relname LIKE 'test_%';  
  
      relname      | relation_size_kb | total_size_kb   
-------------------+------------------+---------------  
 test_md5hash      |            43248 |         82744  
 test_md5hash_pkey |            39464 |         39464  
 test_text         |            66672 |        141816  
 test_text_pkey    |            75096 |         75096  
(4 rows)  

Query acceleration

Now, let's see the difference when querying the column. To eliminate planning overhead, we've used prepared statements and executed 1.000.000 queries with 'WHERE id = $1' condition (using the PK index).

TEXT     129 seconds  
md5hash  117 seconds  

Summary

The md5hash plug-in stores MD5 hashes in 128 bits, saving the storage space and the usage of memories such as shared buffers. For queries on the pkey column, md5hash is more efficient than text.

References

0 0 0
Share on

digoal

277 posts | 24 followers

You may also like

Comments