×
Community Blog Range Types in PostgreSQL and GiST Indexes Increases Speed of MySQL Index Combine Queries

Range Types in PostgreSQL and GiST Indexes Increases Speed of MySQL Index Combine Queries

This article describes the range types introduced in PostgreSQL 9.2 and indexes for range types that significantly improve query performance.

By Digoal

A friend in the Alibaba Cloud community complained that their database did not show very good performance when processing thousands of concurrent IP address library queries in MySQL. So, I asked him for his IP address library data, SQL query statements, and the table structure in MySQL. I told him that I am going to transfer the data into PostgreSQL and perform stress testing to see the query performance in PostgreSQL.

This issue and its requirement—the need to process thousands of concurrent IP address library queries in MySQL - is also common in other business scenarios, such as calculations involving age range, income range, frequent activity range, geo location chunks, geometric chunks, and line segments. In reality, all these scenarios have the same description dimension, which so happens to be scope. With the development of the Internet of Things (IoT), this type of query will become more and more common in the years to come.

However, a lack of good indexing mechanisms will lead to heavy CPU usage and performance bottlenecks. This article describes the range types introduced in PostgreSQL 9.2 and indexes for range types that significantly improve query performance.

Test Procedure

The table structure in MySQL is as follows:

CREATE TABLE ip_address_pool (  
  id int(10) NOT NULL AUTO_INCREMENT COMMENT 'auto-increment primary key',  
  start_ip varchar(20) NOT NULL COMMENT 'starting IP',  
  end_ip varchar(20) NOT NULL COMMENT 'ending IP',  
  province varchar(128) NOT NULL COMMENT 'province name',  
  city varchar(128) NOT NULL COMMENT 'city',  
  region_name varchar(128) NOT NULL COMMENT 'region name',  
  company_name varchar(128) NOT NULL COMMENT 'company name',  
  start_ip_decimal bigint(10) DEFAULT NULL,  
  end_ip_decimal bigint(10) DEFAULT NULL,  
  PRIMARY KEY (id),  
  KEY idx_start_ip_Decimal (start_ip_decimal),  
  KEY idx_end_ip_Decimal (end_ip_decimal)  
) ENGINE=InnoDB AUTO_INCREMENT=436820 DEFAULT CHARSET=utf8 COMMENT='IP address table';  

The SQL query statement in MySQL should be as follows:

select   
  province,  
  start_ip_Decimal as startIpDecimal,  
  end_ip_Decimal as endIpDecimal  
  from ip_address_pool  
  where  
  #{ip}>=start_ip_Decimal and  
  #{ip}<=end_ip_Decimal;  

A total of about 400,000 pieces of data are involved.

Next, since IP address type was not available in MySQL, the IP addresses are converted into the numeric datat to implement data storage and match a specific range of IP addresses. Specifically, 32-bit binary IP addresses are converted into decimal numerals.

Note that PostgreSQL 9.2 adds several range types, for example, range types that can be used to store a range of IP address and int values.

Range Types You can Use in PostgreSQL

  1. Like MySQL, use two fields to store the starting IP numbers respectively.
  2. Use iprange to directly store the IP address range
  3. Use int8range to store the converted numeric range

Table Structures for these Methods

1.  Like MySQL, use two fields to store the starting IP numbers respectively.

CREATE TABLE ip_address_pool (  
  id serial8 primary key,  
  start_ip inet NOT NULL ,  
  end_ip inet NOT NULL ,  
  province varchar(128) NOT NULL ,  
  city varchar(128) NOT NULL ,  
  region_name varchar(128) NOT NULL ,  
  company_name varchar(128) NOT NULL ,  
  start_ip_decimal bigint ,  
  end_ip_decimal bigint   
) ;  

For this method, only one of the following indexes is needed. The B-tree index type in PostgreSQL supports several operators such as >=, >, <=, <, and =. At the same time, CIDR blocks do not overlap.

Avoid Data Interleaving in Tables

One major question throughout this procedure is how can we avoid interleaving? It is impossible to avoid the possibility of overlapping when we insert data into the ip_address_pool table and update this table in parallel, for example.

Let me explain. Let's say say you first query whether CIDR blocks to be inserted already exist in the table. Then, you insert CIDR blocks if they don't already exist in the table. At the point when you inset the CIDR block is when the issue occurs. In a concurrent insert scenario, multiple threads may consider that data to be inserted doesn't already exist in the table, and therefore they insert all the data, but the data inserted concurrently may actually have overlapping parts.

In MySQL, only a full table lock can avoid this problem. Another issue is that Full table locks are not needed in PostgreSQL because PostgreSQL allows using range types to create exclusion constraints of the range type. This is described in my two previous blog posts about range types:

create index idx_ip_address_pool_sip on ip_address_pool (start_ip_decimal);  
create index idx_ip_address_pool_eip on ip_address_pool (end_ip_decimal);  

2.  Use iprange to directly store the IP address range and use the GiST index.

create type iprange as range (subtype=inet);  
  
CREATE TABLE ip_address_pool_2 (  
  id serial8 primary key,  
  ip_segment iprange NOT NULL ,  
  province varchar(128) NOT NULL ,  
  city varchar(128) NOT NULL ,  
  region_name varchar(128) NOT NULL ,  
  company_name varchar(128) NOT NULL  
) ;  

CREATE INDEX ip_address_pool_2_range ON ip_address_pool_2 USING gist (ip_segment);  

3.  Use int8range to store the converted numeric range

CREATE TABLE ip_address_pool_3 (  
  id serial8 primary key,  
  start_ip inet NOT NULL ,  
  end_ip inet NOT NULL ,  
  province varchar(128) NOT NULL ,  
  city varchar(128) NOT NULL ,  
  region_name varchar(128) NOT NULL ,  
  company_name varchar(128) NOT NULL ,  
  ip_decimal_segment int8range  
) ;  

Next, convert data from the first table to the range type and store the data to this table, and use the GiST index type.

insert into ip_address_pool_3 (id,start_ip,end_ip,province,city,region_name,company_name,ip_decimal_segment) select id,start_ip,end_ip,province,city,region_name,company_name,int8range(start_ip_decimal,end_ip_decimal+1) from ip_address_pool;  
  
CREATE INDEX ip_address_pool_3_range ON ip_address_pool_3 USING gist (ip_decimal_segment);  

Performance Testing for the First and Third Options in PostgreSQL

Option 1

Test script:

\setrandom ip 0 2094967294  
select province, start_ip_Decimal as startIpDecimal, end_ip_Decimal as endIpDecimal from ip_address_pool where :ip>=start_ip_Decimal and :ip<=end_ip_Decimal;  

Test result:

pg92@db-172-16-3-33-> pgbench -M prepared -c 8 -j 8 -f ./ip_test.sql -n -T 60 -h 127.0.0.1 -U postgres postgres  
transaction type: Custom query  
scaling factor: 1  
query mode: simple  
number of clients: 8  
number of threads: 8  
duration: 60s  
number of transactions actually processed: 20389  
tps = 339.576580 (including connections establishing)  
tps = 339.618604 (excluding connections establishing)  

Why is it the case that only around 300 transactions per second (TPS) is observed? The reason is that the query created is not a composite index. Because this is a range retrieval operation (not =), the retrieval speed heavily depends on the value range. Let's retrieve three values (in ascending order) and see how much time this query takes.

postgres=# explain analyze select province, start_ip_Decimal as startIpDecimal, end_ip_Decimal as endIpDecimal from ip_address_pool where 1>=start_ip_Decimal and 1<=end_ip_Decimal;  
                                                                          QUERY PLAN                                                  
                            
------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_ip_address_pool_sip on ip_address_pool  (cost=10000000000.00.. 10000000004.51 rows=1 width=22) (actual time=0.004.. 0.004 rows=1 loops=1)  
   Index Cond: (1 >= start_ip_decimal)  
   Filter: (1 <= end_ip_decimal)  
 Total runtime: 0.014 ms  
(4 rows)  
  
postgres=# explain analyze select province, start_ip_Decimal as startIpDecimal, end_ip_Decimal as endIpDecimal from ip_address_pool where 1123371940>=start_ip_Decimal and 1123371940<=end_ip_Decimal;  
                                                                    QUERY PLAN                                                        
                 
------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_ip_address_pool_sip on ip_address_pool  (cost=0.00.. 3899.49 rows=75277 width=22) (actual time=37.572.. 37.573 rows=1 loops=1)  
   Index Cond: (1123371940 >= start_ip_decimal)  
   Filter: (1123371940 <= end_ip_decimal)  
   Rows Removed by Filter: 96523  
 Total runtime: 37.604 ms  
(5 rows)  
  
postgres=# explain analyze select province, start_ip_Decimal as startIpDecimal, end_ip_Decimal as endIpDecimal from ip_address_pool where 4123371940>=start_ip_Decimal and 4123371940<=end_ip_Decimal;  
                                                                     QUERY PLAN                                                       
                   
------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_ip_address_pool_sip on ip_address_pool  (cost=0.00.. 17557.23 rows=1251 width=22) (actual time=168.138.. 168.139 rows=1 loops=1)  
   Index Cond: (4123371940::bigint >= start_ip_decimal)  
   Filter: (4123371940::bigint <= end_ip_decimal)  
   Rows Removed by Filter: 436810  
 Total runtime: 168.165 ms  
(5 rows)  

Create a composite index.

create index idx_ip_address_pool_ip on ip_address_pool (start_ip_decimal,end_ip_decimal);  

After the index is created, still use three values to test the response time:

postgres=# explain analyze select province, start_ip_Decimal as startIpDecimal, end_ip_Decimal as endIpDecimal from ip_address_pool where 1>=start_ip_Decimal and 1<=end_ip_Decimal;  
                                                               QUERY PLAN                                                             
       
------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_ip_address_pool_ip on ip_address_pool  (cost=0.00.. 4.61 rows=1 width=22) (actual time=0.004.. 0.005 rows=1 loops=1)  
   Index Cond: ((1 >= start_ip_decimal) AND (1 <= end_ip_decimal))  
 Total runtime: 0.014 ms  
(3 rows)  
  
postgres=# explain analyze select province, start_ip_Decimal as startIpDecimal, end_ip_Decimal as endIpDecimal from ip_address_pool where 1123371940>=start_ip_Decimal and 1123371940<=end_ip_Decimal;  
                                                                   QUERY PLAN                                                         
              
------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_ip_address_pool_ip on ip_address_pool  (cost=0.00.. 8754.53 rows=75277 width=22) (actual time=5.995.. 5.996 rows=1 loops=1)  
   Index Cond: ((1123371940 >= start_ip_decimal) AND (1123371940 <= end_ip_decimal))  
 Total runtime: 6.017 ms  
(3 rows)  
  
postgres=# explain analyze select province, start_ip_Decimal as startIpDecimal, end_ip_Decimal as endIpDecimal from ip_address_pool where 4123371940>=start_ip_Decimal and 4123371940<=end_ip_Decimal;  
                                                                   QUERY PLAN                                                         
               
------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_ip_address_pool_ip on ip_address_pool  (cost=0.00.. 8737.49 rows=1251 width=22) (actual time=27.042.. 27.044 rows=1 loops=1)  
   Index Cond: ((4123371940::bigint >= start_ip_decimal) AND (4123371940::bigint <= end_ip_decimal))  
 Total runtime: 27.079 ms  
(3 rows)  

Let's see the TPS throughput in this case.

pg92@db-172-16-3-33-> pgbench -M prepared -c 8 -j 8 -f ./ip_test.sql -n -T 60 -h 127.0.0.1 -U postgres postgres  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 8  
number of threads: 8  
duration: 60s  
number of transactions actually processed: 216400  
tps = 3606.368660 (including connections establishing)  
tps = 3606.821632 (excluding connections establishing)  

Despite a tenfold TPS increase, the performance improvements are still far from enough.

Option 2

Test script:

\setrandom ip 0 2094967294  
select province,ip_decimal_segment  from ip_address_pool_3 where ip_decimal_segment @> :ip::int8;  
Test result:
pg92@db-172-16-3-33-> pgbench -M simple -c 8 -j 8 -f ./ip_test.sql -n -T 60 -h 127.0.0.1 -U postgres postgres  
transaction type: Custom query  
scaling factor: 1  
query mode: simple  
number of clients: 8  
number of threads: 8  
duration: 60s  
number of transactions actually processed: 3498195  
tps = 58301.468890 (including connections establishing)  
tps = 58307.865068 (excluding connections establishing)  

Using prepared enables both variable binding and performance improvements. The following shows the result of a test performed on the HP DL360 8-core machine purchased approximately in 2010.

pg92@db-172-16-3-33-> pgbench -M prepared -c 8 -j 8 -f ./ip_test.sql -n -T 60 -h 127.0.0.1 -U postgres postgres  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 8  
number of threads: 8  
duration: 60s  
number of transactions actually processed: 4810415  
tps = 80171.925111 (including connections establishing)  
tps = 80180.458975 (excluding connections establishing)  

When the range type is used, the time usedfor querying the three values is evenly distributed.

postgres=# explain analyze select province,ip_decimal_segment  from ip_address_pool_3 where ip_decimal_segment @> int8 '1';  
                                                                   QUERY PLAN                                                         
              
------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using ip_address_pool_3_range on ip_address_pool_3  (cost=0.00.. 862.55 rows=437 width=38) (actual time=0.034.. 0.035 rows=1 loops=1)  
   Index Cond: (ip_decimal_segment @> 1::bigint)  
 Total runtime: 0.045 ms  
(3 rows)  
  
postgres=# explain analyze select province,ip_decimal_segment  from ip_address_pool_3 where ip_decimal_segment @> int8 '1123371940';  
                                                                   QUERY PLAN                                                         
              
------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using ip_address_pool_3_range on ip_address_pool_3  (cost=0.00.. 862.55 rows=437 width=38) (actual time=0.036.. 0.036 rows=1 loops=1)  
   Index Cond: (ip_decimal_segment @> 1123371940::bigint)  
 Total runtime: 0.052 ms  
(3 rows)  
  
postgres=# explain analyze select province,ip_decimal_segment  from ip_address_pool_3 where ip_decimal_segment @> int8 '4123371940';  
                                                                   QUERY PLAN                                                         
              
------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using ip_address_pool_3_range on ip_address_pool_3  (cost=0.00.. 862.55 rows=437 width=38) (actual time=0.058.. 0.059 rows=1 loops=1)  
   Index Cond: (ip_decimal_segment @> 4123371940::bigint)  
 Total runtime: 0.069 ms  
(3 rows)  

Using Functional Indexes without Modifying Original Table Structures

  1. PostgreSQL supports functional indexes. We can use functional indexes to speed up queries without modifying original table structures.

Example:

CREATE TABLE ip_address_pool (  
  id serial8 primary key,  
  start_ip inet NOT NULL ,  
  end_ip inet NOT NULL ,  
  province varchar(128) NOT NULL ,  
  city varchar(128) NOT NULL ,  
  region_name varchar(128) NOT NULL ,  
  company_name varchar(128) NOT NULL ,  
  start_ip_decimal bigint ,  
  end_ip_decimal bigint   
) ;  
  
create index idx_ip_address_1 on ip_address_pool using index gist (int8range(start_ip_decimal, end_ip_decimal+1::int8));  
select * from ip_address_pool where int8range(start_ip_decimal, end_ip_decimal+1::int8) @> ? ;  

Additional Considerations

  1. Random numbers in pgbench are signed int4 type numbers, which exceed the maximum value in this case (4 billion). Therefore, we can use the numerical range from 2 to 2 billion. However, this does not affect the test result. To make pgbench support int8, we need to modify the pgbench source code.
  2. In addition to using the GiST index type in PostgreSQL, range types in PostgreSQL can also be used to implement exclusion constraints, that is, to avoid data overlapping. In MySQL, only full table locks can be used to do the same thing.
  3. In MySQL, OSDB can be used to perform tests. See OSDB source code to learn more.
  4. When storing IP data in PostgreSQL, we can also use masking, which allows us to store one single field instead of two fields. We can also add a field that stores bits and use the bit function to deal with the containment relationship. Another method is to run this bit operation in memory, store IP bits, and the IDs of corresponding records in the database and perform queries in the database after IDs are obtained, that is, to switch from the range query in the database to the primary key query. This method can also improve the efficiency.
0 0 0
Share on

digoal

277 posts | 24 followers

You may also like

Comments