×
Community Blog Designs of a PostgreSQL Global ID Assignment (Data Dictionary) Service

Designs of a PostgreSQL Global ID Assignment (Data Dictionary) Service

In this article, we will be designing a global ID assignment service on PostgreSQL and comparing the performance of three different design approaches.

By Digoal

The goal of this article is to design a global ID assignment service, using three different methods, in accordance with the following requirements:

Design Requirements

Functional Requirements

  1. Input strings, and output IDs
    1. Return the original IDs to strings that already exist in the system (strict requirement)
    2. Assign new IDs to new strings in an ascending order
  2. Input IDs, and output strings
  3. Each topic has an ID sequence
  4. Topics can be dynamically added and deleted
  5. HSF calls supported

Performance Requirements

  1. Support read access of above 1 million QPS
  2. Support bidirectional batch query read operations, and the query latency for a batch of 100 queries is 1 ms
  3. Support write access of about 10,000 QPS
  4. Support batch write operations, and the write latency for a batch of 100 operations is 10 ms

System Requirements

  1. Stable and reliable
  2. Strict data consistency
  3. No data loss
  4. IDs increment from 0, and avoid gaps (gaps should be less than 1%)
  5. Automatic disaster recovery

Now, let's proceed with three different designs in accordance with the above requirements.

Design 1: With Locally Unique Group ID

In this design, we have a group ID with one sequence for each group, where the sequence and text are unique within a single group.

  1. Design a UDF that automatically generates a sequence, which has a one-to-one relationship with the group ID, and returns the value of the sequence.
    create or replace function get_per_gp_id(    
      Text,   -- Sequence name prefix  
      int     -- The group ID serves as the sequence name suffix  
    ) returns int8 as 
    $$
        
    declare    
    begin    
      return nextval(($1||$2)::regclass);    
      exception when others then    
        execute 'create sequence if not exists '||$1||$2||' start with 0 minvalue 0' ;    
        return nextval(($1||$2)::regclass);    
    end;    
    
    $$
     language plpgsql strict;    

  2. Create a test table
    create table tbl1(    
      gid int,   -- Group ID  
      ts text,   -- Text  
      sn int8,   -- Auto-increment sequence value  
      unique(gid,ts),     
      unique(gid,sn)    
    );    

  3. Create a UDF. When the group ID and text are entered, if the text exists, the existing sequence is returned; if the text does not exist, a unique ID is assigned and returned.
    create or replace function ins1(    
      int,   -- Group ID  
      text   -- Text  
    ) returns int8 as 
    $$
        
    declare    
      res int8;    
    begin    
      -- Checks whether the text already exists in this group  
      select sn into res from tbl1 where gid=$1 and ts=$2;    
      if found then    
        return res;    
      else    
        -- If it does not exist, an ID is generated  
        insert into tbl1 (gid,ts,sn) values ($1, $2, get_per_gp_id('seq_', $1)) returning sn into res;    
        return res;    
      end if;    
      exception when others then    
        -- An exception may be thrown when another parallel session is generating the sequence. Query again, and return the SN.  
        select sn into res from tbl1 where gid=$1 and ts=$2;    
        if found then    
          return res;    
        else     
          raise ;    
        end if;    
    end;    
    
    $$
     language plpgsql strict;    

Design 2: Without Group ID

In this design, we will not have a group ID, and the text and sequence are globally unique.

  1. Create a sequence
    create sequence seq_tbl2_sn start with 0 minvalue 0;    

  2. Create a test table
    create table tbl2(    
      ts text unique,   -- Text  
      sn int8 default nextval('public.seq_tbl2_sn'::regclass) unique  -- Sequence  
    );    

  3. Create a UDF. When the text is entered, if the text already exists, the sequence corresponding to the text is returned; if the text does not exist, a unique sequence value is assigned and returned.
    create or replace function ins2(    
      text    
    ) returns int8 as 
    $$
        
    declare    
      res int8;    
    begin    
      -- Check whether the text already exists  
      select sn into res from tbl2 where ts=$1;    
      if found then    
        return res;    
      else    
        -- If it does not exist, an ID is generated  
        insert into tbl2 (ts) values ($1) returning sn into res;    
        return res;    
      end if;    
      exception when others then    
        -- An exception may be thrown when another parallel session is generating the sequence. Query again, and return the SN.  
        select sn into res from tbl2 where ts=$1;    
        if found then    
          return res;    
        else     
          raise ;    
        end if;    
    end;    
    
    $$
     language plpgsql strict;    

Design 3: With Globally Unique Group ID

The third approach uses a globally unique group ID. If the dictionary contains 4 billion values or less, use INT4. If the dictionary exceeds 4 billion values, we need to use INT8.

  1. Create a sequence, and set the starting value to the minimum value of INT4
    create sequence seq_tbl_dict minvalue -2147483648 start with -2147483648;  

  2. Create a test table
    create table tbl_dict(    
      gid int2,    -- Group ID  
      ts text,     -- Text  
      sn int4 default nextval('public.seq_tbl_dict'::regclass),  -- Sequence  
      unique (gid,ts),  
      unique (sn)  
    );    

  3. Create a UDF. When the text is entered, if the text already exists, the sequence corresponding to the text is returned; if the text does not exist, a unique sequence value is assigned and returned.
    create or replace function get_sn(int2, text) returns int as 
    $$
      
    declare  
      res int;  
    begin    
      -- Optimistic query  
      select sn into res from tbl_dict where gid=$1 and ts=$2;   
      if found then   
        return res;   
      end if;  
      
      -- Inserts one if not found  
      insert into tbl_dict values($1,$2,nextval('public.seq_tbl_dict'::regclass)) on conflict (gid,ts) do nothing returning sn into res;  
      if found then  
        return res;  
      -- In the case of insertion conflicts, it continues the query and returns sn  
      else  
        select sn into res from tbl_dict where gid=$1 and ts=$2;  
        return res;  
      end if;  
    end;  
    
    $$
     language plpgsql strict;  

Batch Operation

select ins1(gid, ts) from (values (),(),.....()) as t(gid, ts);    
    
select ins2(ts) from (values (),(),.....()) as t(ts);    

Example and performance: it takes about 2 milliseconds to assign IDs to 100 text values

select ins1(id, 'test'||id) from generate_series(1,100) t(id);    
    
...........    
       0    
(100 rows)    
    
Time: 1.979 ms    

Write Operation Stress Test

With group ID

vi test1.sql    
\set gid random(1,10)    
\set ts random(1,100000000)    
select ins1(:gid, md5(:ts::text));    
pgbench -M prepared -n -r -P 1 -f ./test1.sql -c 56 -j 56 -T 120    
    
transaction type: ./test1.sql    
scaling factor: 1    
query mode: prepared    
number of clients: 56    
number of threads: 56    
duration: 120 s    
number of transactions actually processed: 18082960    
latency average = 0.232 ms    
latency stddev = 0.517 ms    
tps = 150680.114138 (including connections establishing)    
tps = 150687.227354 (excluding connections establishing)    
statement latencies in milliseconds:    
         0.001  \set gid random(1,10)    
         0.000  \set ts random(1,100000000)    
         0.230  select ins1(:gid, md5(:ts::text));    

Without group ID

vi test2.sql    
\set ts random(1,100000000)    
select ins2(md5(:ts::text));    
pgbench -M prepared -n -r -P 1 -f ./test2.sql -c 56 -j 56 -T 120    
    
transaction type: ./test2.sql    
scaling factor: 1    
query mode: prepared    
number of clients: 56    
number of threads: 56    
duration: 120 s    
number of transactions actually processed: 11515008    
latency average = 0.584 ms    
latency stddev = 0.766 ms    
tps = 95613.170828 (including connections establishing)    
tps = 95618.249995 (excluding connections establishing)    
statement latencies in milliseconds:    
         0.001  \set ts random(1,100000000)    
         0.582  select ins2(md5(:ts::text));    

With globally unique group ID

vi test3.sql    
\set gid random(1,10)    
\set ts random(1,100000000)    
select get_sn(:gid, md5(:ts::text));    
pgbench -M prepared -n -r -P 1 -f ./test3.sql -c 56 -j 56 -T 120    
    
transaction type: ./test3.sql  
scaling factor: 1  
query mode: prepared  
number of clients: 56  
number of threads: 56  
duration: 120 s  
number of transactions actually processed: 7665708  
latency average = 0.877 ms  
latency stddev = 0.666 ms  
tps = 63868.058538 (including connections establishing)  
tps = 63875.166407 (excluding connections establishing)  
script statistics:  
 - statement latencies in milliseconds:  
         0.001  \set gid random(1,10)    
         0.000  \set ts random(1,100000000)    
         0.875  select get_sn(:gid, md5(:ts::text));  
postgres=# select * from tbl_dict  limit 10;
 gid |                ts                |     sn      
-----+----------------------------------+-------------
   9 | 8021bdb598f73577a063b50bdf0cef31 | -2147483648
   3 | e1988c3c7a80dcd1b1c1bdcf2ac31fe7 | -2147483646
   7 | 6ee09b73df8ae9bb97a4ebd4c51bd212 | -2147483647
   1 | fa8303da6ea2b6e995a1e090fb9cd9f2 | -2147483645
   7 | ca1c614104f1ad3af92d8d9a2911a5b6 | -2147483643
   8 | 4641dd1162f46e8be5f643facc85df94 | -2147483644
   6 | 88250e10f0d27cdebbf5c5eb4a7032a3 | -2147483641
   2 | 5718da726fd20d8fd12d56e9bf2d7e9e | -2147483642
   1 | 687e553016fe6bd1dba3ca6126b8b5b8 | -2147483639
  10 | a4707645d604dd1ad9ba96ff303cf9d9 | -2147483638
(10 rows)

Gap Percentage

Cause of gaps: irreversible use of the sequence. Even if a transaction fails, the consumed sequence value cannot be returned.

It meets the requirements according to the test.

postgres=# select gid,count(*),min(sn),max(sn),((max(sn)+1)/count(*)::float8-1)*100||' %' from tbl1 group by gid;    
 gid |  count  | min |   max   | ?column?     
-----+---------+-----+---------+----------    
   1 | 1790599 |   0 | 1790598 | 0 %    
   2 | 1793384 |   0 | 1793383 | 0 %    
   3 | 1791533 |   0 | 1791532 | 0 %    
   4 | 1792755 |   0 | 1792754 | 0 %    
   5 | 1793897 |   0 | 1793896 | 0 %    
   6 | 1794786 |   0 | 1794785 | 0 %    
   7 | 1792282 |   0 | 1792281 | 0 %    
   8 | 1790630 |   0 | 1790629 | 0 %    
   9 | 1791303 |   0 | 1791302 | 0 %    
  10 | 1790307 |   0 | 1790306 | 0 %    
(10 rows)    
    
postgres=# select count(*),min(sn),max(sn),((max(sn)+1)/count(*)::float8-1)*100||' %' from tbl2;    
  count   | min |   max    |        ?column?            
----------+-----+----------+------------------------    
 10877124 |   0 | 10877128 | 4.59680334685686e-05 %    
(1 row)    

Read Operation Stress Test

As long as it is full, it just returns SN. So you can just slightly modify the stress test script

vi test1.sql    
\set gid random(1,10)    
\set ts random(1,10000)    
select ins1(:gid, md5(:ts::text));    
    
    
pgbench -M prepared -n -r -P 1 -f ./test1.sql -c 56 -j 56 -T 120    
    
transaction type: ./test1.sql    
scaling factor: 1    
query mode: prepared    
number of clients: 56    
number of threads: 56    
duration: 120 s    
number of transactions actually processed: 69229025    
latency average = 0.097 ms    
latency stddev = 0.040 ms    
tps = 574906.288558 (including connections establishing)    
tps = 575063.117108 (excluding connections establishing)    
statement latencies in milliseconds:    
         0.002  \set gid random(1,10)    
         0.001  \set ts random(1,10000)    
         0.098  select ins1(:gid, md5(:ts::text));    

You can also use SELECT

vi test3.sql    
\set gid random(1,10)    
\set ts random(1,10000)    
select * from tbl1 where gid=:gid and ts=md5(:ts::text);    
    
pgbench -M prepared -n -r -P 1 -f ./test3.sql -c 56 -j 56 -T 120    
    
transaction type: ./test3.sql    
scaling factor: 1    
query mode: prepared    
number of clients: 56    
number of threads: 56    
duration: 120 s    
number of transactions actually processed: 90985807    
latency average = 0.074 ms    
latency stddev = 0.009 ms    
tps = 758067.503368 (including connections establishing)    
tps = 758109.672642 (excluding connections establishing)    
statement latencies in milliseconds:    
         0.001  \set gid random(1,10)    
         0.001  \set ts random(1,10000)    
         0.074  select * from tbl1 where gid=:gid and ts=md5(:ts::text);    

The database itself has not been optimized. The use of the ECS virtual machine environment provides room for performance improvement. You can also shard the database by GID, and easily achieve 1 million QPS.

Summary

Using UDF, sequence and other PostgreSQL functions, you can implement the "Global ID Assignment Service" design as required at the beginning of this article.

Note that the PG instance used here is an ECS virtual machine instance. Its read performance is only half that of physical machines. It can easily achieve a read performance of 1 million QPS when running on physical machines.

For the sake of future scalability, you can assign GID to different instances to achieve horizontal expansion. This allows you to achieve a read performance of 1 million QPS for each single PG instance, and 1 million*N TPS for multiple instances.

Why don't we use the hash function to generate a global dictionary? Wouldn't it be faster? The reason is still global uniqueness. There may be conflicts when using the HASH function. Even the INT 8 HASH function cannot guarantee global uniqueness and the one-to-one relationship between the group ID and the sequence. (Another reason is that the dictionary corresponds to the number of unique values, and does not cause any gaps in sequence values. Therefore, we can use a smaller integer (INT4 is used in this example). Smaller integer means smaller size and faster access speed).

Postgrespro introduced a kernel layer dictionary in json, which makes it unnecessary for the business layer to care about the dictionary.

0 1 0
Share on

digoal

277 posts | 24 followers

You may also like

Comments

digoal

277 posts | 24 followers

Related Products