Evaluating the performance of a database usually involves either industry standard testing or or modeling testing based on the business model.
For example, PostgreSQL pgbench supports both tpc-b testing and custom modeling testing; benchmarksql supports tpc-c testing; and gp_tpch supports tpc-h testing.
These testing methods can only be used after the database has been built. Is there any way to evaluate performance before building the database?
These indicators have the biggest influence on the performance of a database:
• CPU clock speed
• CPU instruction set
• Number of CPU cores
• Memory clock speed, bus bandwidth
• Hard drive capacity - random IOPS performance
• Hard drive capacity - sequential IOPS performance
• Hard drive bandwidth
• Network bandwidth
For a Greenplum database, the main influencers are:
1.CPU clock speed
Determines the calculation speed of the database. But what operations involve calculations? For example, WHERE clause filtering, operator calculations in a SELECT sub-statement, aggregate calculations, sorting, etc.
2.CPU instruction set
The instruction set determines the performance of certain optimizations in the database. For example, vector computation.
3.Number of CPU cores
CPU clock speed determines the computing capacity of a single CPU core, while the number of CPU cores determines the parallel computing capacity of the database.
4.Memory clock speed, bus bandwidth
When reading/writing in the memory, the memory clock speed and bus bandwidth determine the total read/write capacity, which is a very important factor.
For example, for DDR 2 667, the bandwidth is 64bit×667MHz÷8≈5.3GB/s. If it is a dual channel memory, we have to multiply by 2, so the memory data bandwidth of a dual channel DDR 2 667 is 10.6GB/s.
For example, the maximum memory read/write bandwidth is
64*2*2400/8/1024= 37.5 GB/s dmidecode --type 17 Array Handle: 0x0034 Error Information Handle: Not Provided Total Width: 72 bits ## ECC, 64+8 Data Width: 72 bits Size: 32 GB Form Factor: DIMM Set: None Locator: CPU0_A0 Bank Locator: NODE 1 Type: DDR4 Type Detail: Speed: 2400 MHz Manufacturer: Serial Number: Asset Tag: Part Number: Rank: 2 Configured Clock Speed: 2133 MHz
Note: This is the technical hard cap of the memory and will not typically be reached by a single CPU core.
What is the computing speed of a single CPU core? We can find out with a simple test.
Memory speed #dd if=/dev/zero of=/dev/null bs=4k count=1024000000 ^C68517474+0 records in 68517473+0 records out 280647569408 bytes (281 GB) copied, 34.1855 s, 8.2 GB/s Block device speed #dd if=/dev/Block device name of=/dev/null bs=4k count=102300000 ^C2687957+0 records in 2687956+0 records out 11009867776 bytes (11 GB) copied, 4.6525 s, 2.4 GB/s
In reality, the speed of a single core can't possibly reach 8.2 GB/s when used in calculations for a database.
5.Hard drive capacity - random IOPS performance
Random IOPS performance will be involved in index access, and (concurrent) access to data in the same hard drive by multiple sessions or processes.
(Cached read can improve the performance of parallel access, nearing sequential IOPS performance.)
6.Hard drive capacity - sequential IOPS performance
Leaving alone concurrencies, an AP system generally reads/writes files in a sequential manner as long as it is not an index scan.
7.Hard drive bandwidth and interface speed
The bandwidth interface speeds of a hard drive determine its maximum data scanning speed.
For example, some manufacturers may present such data about read/write bandwidth:
Note: this is the technical limit of the hard drive. The computing capacity of a single CPU core cannot reach this limit in most of the cases.
Network bandwidth determines the data import speed, as well as the redistribution speed during JOIN operations.
A single host may have multiple NICs and data nodes. The network bandwidth is estimated based on the total output bandwidth. For example, if there are ten hosts in a GP cluster and each has two 10 GB NICs, then the total network bandwidth would be 200 GB.
The shortcoming of a distributed system is that the slowest node determines the overall processing time. This becomes a larger issue in case of data skew.
The above are the main influencing factors for the performance of a database. Then how can we evaluate the SQL response speed based on these factors?
The cost model of PostgreSQL includes some of these factors, so we can work out the final SQL running cost based on the cost calculation formula and statistical data. If we align the cost with the time, we will be able to see the SQL running time.
The above evaluation still requires a database, and data (or statistical information of data) to be imported into the database. How can we evaluate the SQL response time if we only have hardware and data indicators, instead of an actual database? We can take samples of the formula, and then use the database cluster and data indicators as well as our intended SQL requests to evaluate expected performance.
We can simplify the evaluation model, as the CPU has significant effect (e.g. LLVM, vector optimization, or other optimizations) on the results. Here, I choose to ignore the deviation introduced by the CPU. We will not take into account data skew either.
We will discuss how to conduct a performance evaluation with the following environment as an example.
• Hard drive – 2 hard drives, with a read/write bandwidth of 2 GB/s each, to be made into one hard drive through LVM. The total bandwidth is 4 GB/s
• Memory – 512 GB, with a read/write bandwidth of 37.5 GB/s
• CPU – 2.5 GHz, 32-Core
• NIC – Two 10 GB NICs
• Number of machines – 8
• Number of data nodes on each machine – 16 data nodes
Performance indicators obtained in another environment
Let’s take integer data for example:
postgres=# create table mmtest(id int) with (appendonly=true, blocksize=2097152, ORIENTATION=COLUMN); NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table. HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew. CREATE TABLE postgres=# insert into mmtest select generate_series(1,100000); INSERT 0 100000 insert into mmtest select * from mmtest ; ... postgres=# insert into mmtest select * from mmtest ; INSERT 0 409600000 postgres=# select pg_size_pretty(pg_total_relation_size('mmtest')); pg_size_pretty ---------------- 3133 MB (1 row) postgres=# select count(*) from mmtest ; count ----------- 819200000 (1 row) Time: 779.444 ms postgres=# select * from mmtest where id=0; id ---- (0 rows) Time: 422.538 ms
postgres=# create table mmtest1(id int) postgres-# ; NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table. HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew. CREATE TABLE Time: 273.659 ms postgres=# insert into mmtest1 select * from mmtest; postgres=# select pg_size_pretty(pg_total_relation_size('mmtest1')); pg_size_pretty ---------------- 28 GB (1 row) postgres=# select count(*) from mmtest1 ; count ----------- 819200000 (1 row) Time: 1171.229 ms postgres=# select * from mmtest1 where id=0; id ---- (0 rows) Time: 452.582 ms
create unlogged table mmtest(id int); postgres=# insert into mmtest select generate_series(1,100000); INSERT 0 100000 insert into mmtest select * from mmtest ; ... postgres=# insert into mmtest select * from mmtest ; INSERT 0 409600000 postgres=# select pg_size_pretty(pg_total_relation_size('mmtest')); pg_size_pretty ---------------- 28 GB (1 row) postgres=# select * from mmtest where id=0; id ---- (0 rows) Time: 56410.222 ms (00:56.410) 32 Parallel computing 3.02 seconds
|GP column-store||Single core integer data filtering speed of 40 million rows/s
Entire machine integer data filtering speed of 1.88 billion rows/s
(Including scanning time)
|GP row-store||Single core integer data filtering speed of 37 million rows/s
Entire machine integer data filtering speed of 1.77 billion rows/s
(Including scanning time)
|PG row-store||Single core integer data filtering speed of 15 million rows/s
Entire machine integer data filtering speed of 264.9 million rows/s
(Including scanning time)
Alibaba Clouder - December 12, 2017
Michael Peng - September 24, 2019
Alibaba Clouder - May 20, 2020
Alibaba Clouder - July 1, 2020
Alibaba Clouder - June 5, 2019
ApsaraDB - April 28, 2020
An on-demand database hosting service for MySQL, SQL Server and PostgreSQL with automated monitoring, backup and disaster recovery capabilitiesLearn More
An online MPP warehousing service based on the Greenplum Database open source programLearn More
Supports data migration and data synchronization between data engines, such as relational database, NoSQL and OLAPLearn More
More Posts by Alibaba Clouder