By Digoal
A friend in the Alibaba Cloud community complained that their database did not show very good performance when processing thousands of concurrent IP address library queries in MySQL. So, I asked him for his IP address library data, SQL query statements, and the table structure in MySQL. I told him that I am going to transfer the data into PostgreSQL and perform stress testing to see the query performance in PostgreSQL.
This issue and its requirement—the need to process thousands of concurrent IP address library queries in MySQL - is also common in other business scenarios, such as calculations involving age range, income range, frequent activity range, geo location chunks, geometric chunks, and line segments. In reality, all these scenarios have the same description dimension, which so happens to be scope. With the development of the Internet of Things (IoT), this type of query will become more and more common in the years to come.
However, a lack of good indexing mechanisms will lead to heavy CPU usage and performance bottlenecks. This article describes the range types introduced in PostgreSQL 9.2 and indexes for range types that significantly improve query performance.
The table structure in MySQL is as follows:
CREATE TABLE ip_address_pool (
id int(10) NOT NULL AUTO_INCREMENT COMMENT 'auto-increment primary key',
start_ip varchar(20) NOT NULL COMMENT 'starting IP',
end_ip varchar(20) NOT NULL COMMENT 'ending IP',
province varchar(128) NOT NULL COMMENT 'province name',
city varchar(128) NOT NULL COMMENT 'city',
region_name varchar(128) NOT NULL COMMENT 'region name',
company_name varchar(128) NOT NULL COMMENT 'company name',
start_ip_decimal bigint(10) DEFAULT NULL,
end_ip_decimal bigint(10) DEFAULT NULL,
PRIMARY KEY (id),
KEY idx_start_ip_Decimal (start_ip_decimal),
KEY idx_end_ip_Decimal (end_ip_decimal)
) ENGINE=InnoDB AUTO_INCREMENT=436820 DEFAULT CHARSET=utf8 COMMENT='IP address table';
The SQL query statement in MySQL should be as follows:
select
province,
start_ip_Decimal as startIpDecimal,
end_ip_Decimal as endIpDecimal
from ip_address_pool
where
#{ip}>=start_ip_Decimal and
#{ip}<=end_ip_Decimal;
A total of about 400,000 pieces of data are involved.
Next, since IP address type was not available in MySQL, the IP addresses are converted into the numeric datat to implement data storage and match a specific range of IP addresses. Specifically, 32-bit binary IP addresses are converted into decimal numerals.
Note that PostgreSQL 9.2 adds several range types, for example, range types that can be used to store a range of IP address and int values.
1. Like MySQL, use two fields to store the starting IP numbers respectively.
CREATE TABLE ip_address_pool (
id serial8 primary key,
start_ip inet NOT NULL ,
end_ip inet NOT NULL ,
province varchar(128) NOT NULL ,
city varchar(128) NOT NULL ,
region_name varchar(128) NOT NULL ,
company_name varchar(128) NOT NULL ,
start_ip_decimal bigint ,
end_ip_decimal bigint
) ;
For this method, only one of the following indexes is needed. The B-tree index type in PostgreSQL supports several operators such as >=, >, <=, <, and =. At the same time, CIDR blocks do not overlap.
One major question throughout this procedure is how can we avoid interleaving? It is impossible to avoid the possibility of overlapping when we insert data into the ip_address_pool table and update this table in parallel, for example.
Let me explain. Let's say say you first query whether CIDR blocks to be inserted already exist in the table. Then, you insert CIDR blocks if they don't already exist in the table. At the point when you inset the CIDR block is when the issue occurs. In a concurrent insert scenario, multiple threads may consider that data to be inserted doesn't already exist in the table, and therefore they insert all the data, but the data inserted concurrently may actually have overlapping parts.
In MySQL, only a full table lock can avoid this problem. Another issue is that Full table locks are not needed in PostgreSQL because PostgreSQL allows using range types to create exclusion constraints of the range type. This is described in my two previous blog posts about range types:
create index idx_ip_address_pool_sip on ip_address_pool (start_ip_decimal);
create index idx_ip_address_pool_eip on ip_address_pool (end_ip_decimal);
2. Use iprange to directly store the IP address range and use the GiST index.
create type iprange as range (subtype=inet);
CREATE TABLE ip_address_pool_2 (
id serial8 primary key,
ip_segment iprange NOT NULL ,
province varchar(128) NOT NULL ,
city varchar(128) NOT NULL ,
region_name varchar(128) NOT NULL ,
company_name varchar(128) NOT NULL
) ;
CREATE INDEX ip_address_pool_2_range ON ip_address_pool_2 USING gist (ip_segment);
3. Use int8range to store the converted numeric range
CREATE TABLE ip_address_pool_3 (
id serial8 primary key,
start_ip inet NOT NULL ,
end_ip inet NOT NULL ,
province varchar(128) NOT NULL ,
city varchar(128) NOT NULL ,
region_name varchar(128) NOT NULL ,
company_name varchar(128) NOT NULL ,
ip_decimal_segment int8range
) ;
Next, convert data from the first table to the range type and store the data to this table, and use the GiST index type.
insert into ip_address_pool_3 (id,start_ip,end_ip,province,city,region_name,company_name,ip_decimal_segment) select id,start_ip,end_ip,province,city,region_name,company_name,int8range(start_ip_decimal,end_ip_decimal+1) from ip_address_pool;
CREATE INDEX ip_address_pool_3_range ON ip_address_pool_3 USING gist (ip_decimal_segment);
Test script:
\setrandom ip 0 2094967294
select province, start_ip_Decimal as startIpDecimal, end_ip_Decimal as endIpDecimal from ip_address_pool where :ip>=start_ip_Decimal and :ip<=end_ip_Decimal;
Test result:
pg92@db-172-16-3-33-> pgbench -M prepared -c 8 -j 8 -f ./ip_test.sql -n -T 60 -h 127.0.0.1 -U postgres postgres
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 8
number of threads: 8
duration: 60s
number of transactions actually processed: 20389
tps = 339.576580 (including connections establishing)
tps = 339.618604 (excluding connections establishing)
Why is it the case that only around 300 transactions per second (TPS) is observed? The reason is that the query created is not a composite index. Because this is a range retrieval operation (not =), the retrieval speed heavily depends on the value range. Let's retrieve three values (in ascending order) and see how much time this query takes.
postgres=# explain analyze select province, start_ip_Decimal as startIpDecimal, end_ip_Decimal as endIpDecimal from ip_address_pool where 1>=start_ip_Decimal and 1<=end_ip_Decimal;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_ip_address_pool_sip on ip_address_pool (cost=10000000000.00.. 10000000004.51 rows=1 width=22) (actual time=0.004.. 0.004 rows=1 loops=1)
Index Cond: (1 >= start_ip_decimal)
Filter: (1 <= end_ip_decimal)
Total runtime: 0.014 ms
(4 rows)
postgres=# explain analyze select province, start_ip_Decimal as startIpDecimal, end_ip_Decimal as endIpDecimal from ip_address_pool where 1123371940>=start_ip_Decimal and 1123371940<=end_ip_Decimal;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_ip_address_pool_sip on ip_address_pool (cost=0.00.. 3899.49 rows=75277 width=22) (actual time=37.572.. 37.573 rows=1 loops=1)
Index Cond: (1123371940 >= start_ip_decimal)
Filter: (1123371940 <= end_ip_decimal)
Rows Removed by Filter: 96523
Total runtime: 37.604 ms
(5 rows)
postgres=# explain analyze select province, start_ip_Decimal as startIpDecimal, end_ip_Decimal as endIpDecimal from ip_address_pool where 4123371940>=start_ip_Decimal and 4123371940<=end_ip_Decimal;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_ip_address_pool_sip on ip_address_pool (cost=0.00.. 17557.23 rows=1251 width=22) (actual time=168.138.. 168.139 rows=1 loops=1)
Index Cond: (4123371940::bigint >= start_ip_decimal)
Filter: (4123371940::bigint <= end_ip_decimal)
Rows Removed by Filter: 436810
Total runtime: 168.165 ms
(5 rows)
Create a composite index.
create index idx_ip_address_pool_ip on ip_address_pool (start_ip_decimal,end_ip_decimal);
After the index is created, still use three values to test the response time:
postgres=# explain analyze select province, start_ip_Decimal as startIpDecimal, end_ip_Decimal as endIpDecimal from ip_address_pool where 1>=start_ip_Decimal and 1<=end_ip_Decimal;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_ip_address_pool_ip on ip_address_pool (cost=0.00.. 4.61 rows=1 width=22) (actual time=0.004.. 0.005 rows=1 loops=1)
Index Cond: ((1 >= start_ip_decimal) AND (1 <= end_ip_decimal))
Total runtime: 0.014 ms
(3 rows)
postgres=# explain analyze select province, start_ip_Decimal as startIpDecimal, end_ip_Decimal as endIpDecimal from ip_address_pool where 1123371940>=start_ip_Decimal and 1123371940<=end_ip_Decimal;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_ip_address_pool_ip on ip_address_pool (cost=0.00.. 8754.53 rows=75277 width=22) (actual time=5.995.. 5.996 rows=1 loops=1)
Index Cond: ((1123371940 >= start_ip_decimal) AND (1123371940 <= end_ip_decimal))
Total runtime: 6.017 ms
(3 rows)
postgres=# explain analyze select province, start_ip_Decimal as startIpDecimal, end_ip_Decimal as endIpDecimal from ip_address_pool where 4123371940>=start_ip_Decimal and 4123371940<=end_ip_Decimal;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_ip_address_pool_ip on ip_address_pool (cost=0.00.. 8737.49 rows=1251 width=22) (actual time=27.042.. 27.044 rows=1 loops=1)
Index Cond: ((4123371940::bigint >= start_ip_decimal) AND (4123371940::bigint <= end_ip_decimal))
Total runtime: 27.079 ms
(3 rows)
Let's see the TPS throughput in this case.
pg92@db-172-16-3-33-> pgbench -M prepared -c 8 -j 8 -f ./ip_test.sql -n -T 60 -h 127.0.0.1 -U postgres postgres
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 8
number of threads: 8
duration: 60s
number of transactions actually processed: 216400
tps = 3606.368660 (including connections establishing)
tps = 3606.821632 (excluding connections establishing)
Despite a tenfold TPS increase, the performance improvements are still far from enough.
Test script:
\setrandom ip 0 2094967294
select province,ip_decimal_segment from ip_address_pool_3 where ip_decimal_segment @> :ip::int8;
Test result:
pg92@db-172-16-3-33-> pgbench -M simple -c 8 -j 8 -f ./ip_test.sql -n -T 60 -h 127.0.0.1 -U postgres postgres
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 8
number of threads: 8
duration: 60s
number of transactions actually processed: 3498195
tps = 58301.468890 (including connections establishing)
tps = 58307.865068 (excluding connections establishing)
Using prepared enables both variable binding and performance improvements. The following shows the result of a test performed on the HP DL360 8-core machine purchased approximately in 2010.
pg92@db-172-16-3-33-> pgbench -M prepared -c 8 -j 8 -f ./ip_test.sql -n -T 60 -h 127.0.0.1 -U postgres postgres
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 8
number of threads: 8
duration: 60s
number of transactions actually processed: 4810415
tps = 80171.925111 (including connections establishing)
tps = 80180.458975 (excluding connections establishing)
When the range type is used, the time usedfor querying the three values is evenly distributed.
postgres=# explain analyze select province,ip_decimal_segment from ip_address_pool_3 where ip_decimal_segment @> int8 '1';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Index Scan using ip_address_pool_3_range on ip_address_pool_3 (cost=0.00.. 862.55 rows=437 width=38) (actual time=0.034.. 0.035 rows=1 loops=1)
Index Cond: (ip_decimal_segment @> 1::bigint)
Total runtime: 0.045 ms
(3 rows)
postgres=# explain analyze select province,ip_decimal_segment from ip_address_pool_3 where ip_decimal_segment @> int8 '1123371940';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Index Scan using ip_address_pool_3_range on ip_address_pool_3 (cost=0.00.. 862.55 rows=437 width=38) (actual time=0.036.. 0.036 rows=1 loops=1)
Index Cond: (ip_decimal_segment @> 1123371940::bigint)
Total runtime: 0.052 ms
(3 rows)
postgres=# explain analyze select province,ip_decimal_segment from ip_address_pool_3 where ip_decimal_segment @> int8 '4123371940';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Index Scan using ip_address_pool_3_range on ip_address_pool_3 (cost=0.00.. 862.55 rows=437 width=38) (actual time=0.058.. 0.059 rows=1 loops=1)
Index Cond: (ip_decimal_segment @> 4123371940::bigint)
Total runtime: 0.069 ms
(3 rows)
Example:
CREATE TABLE ip_address_pool (
id serial8 primary key,
start_ip inet NOT NULL ,
end_ip inet NOT NULL ,
province varchar(128) NOT NULL ,
city varchar(128) NOT NULL ,
region_name varchar(128) NOT NULL ,
company_name varchar(128) NOT NULL ,
start_ip_decimal bigint ,
end_ip_decimal bigint
) ;
create index idx_ip_address_1 on ip_address_pool using index gist (int8range(start_ip_decimal, end_ip_decimal+1::int8));
select * from ip_address_pool where int8range(start_ip_decimal, end_ip_decimal+1::int8) @> ? ;
PostgreSQL 10.0 Preview: Full-Text Search of JSON-Formatted Content
Testing the Performance of PipelineDB for Real-Time Statistics on Virtual Machines
digoal - December 11, 2019
digoal - July 24, 2019
digoal - December 18, 2020
Alibaba Clouder - December 11, 2017
ApsaraDB - October 20, 2020
digoal - June 26, 2019
An online MPP warehousing service based on the Greenplum Database open source program
Learn MoreA cloud solution for smart technology providers to quickly build stable, cost-efficient, and reliable ubiquitous platforms
Learn MoreAlibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreMore Posts by digoal