This topic describes the results of performance tests, including index creation, single-term query, and multi-term query, based on the data of the Wikipedia language database.
Specifications of the test instance
Engine version: AnalyticDB for PostgreSQL V7.0.
Minor version: V7.1.0.0.
Edition: High-availability Edition.
Node specifications: 4 cores and 32 GB memory.
Number of compute nodes: 4.
Node storage capacity: 200 GB.
Disk type: PL1 Enterprise SSD (ESSD).
Coordinator node resources: 8 compute units (CUs).
Test dataset
The test data is taken from a snapshot of the Wikipedia language database and contains 5 million rows of data. Download the packages "wiki-articles-500w-aa" to "wiki-articles-500w-az" from softhuafei/wiki-dataset on GitHub. In this example, the test data is successively imported 20 times to create a dataset that contains 0.1 billion rows of data.
The dataset contains the url, title, and body fields.
The dataset is in the JSON format.
The dataset is 2.34 GB in size.
You can run the cat wiki-articles-500w* > wiki-articles-500w.json.bz2 command to merge the test data files, and then run the bzip2 -dk wiki-articles-500w.json.bz2 command to decompress the package.
Performance test results
The amount of time required to create a BM25 index is approximately 63% of the amount of time required to create a GIN(tsvector) index.
In terms of single-term queries, the queries per second (QPS) performance of a BM25 index is six to seven times the QPS performance of a GIN(tsvector) index.
In terms of multi-term queries, the QPS performance of a BM25 index is six to eleven times the QPS performance of a GIN(tsvector) index.
In terms of single-term AND point queries, the QPS performance of a BM25 index is six to nine times the QPS performance of GIN(tsvector) and B-tree indexes.
In terms of single-term AND range queries, the QPS performance of a BM25 index is one to nine times the QPS performance of GIN(tsvector) and B-tree indexes.
Create an index
Index | Amount of time required to import data | Amount of time required to create an index | Amount of time required for stream importing (with indexes) |
BM25 | 24 minutes | 36 minutes | 84 minutes |
GIN(tsvector) | 137 minutes | 57 minutes | N/A |
Perform a single-term query
Query the top 20 results that contain a term in the body column. Example: body:Gabriel Basso. Among the top 20 results, the average occurrence of the term is 56,685. The following table describes the maximum QPS values and the corresponding response time (RT) values.
Concurrency | BM25 | GIN(tsquery) | ||
QPS | RT (ms) | QPS | RT (ms) | |
1 | 158.86 | 6.45 | 25.97 | 38.35 |
5 | 544.43 | 9.19 | 73.92 | 64.35 |
10 | 738.21 | 12.63 | 89.90 | 128.65 |
20 | 863.07 | 20.69 | 172.83 | 262.89 |
30 | 925.02 | 31.05 | 239.74 | 378.18 |
40 | 1,234.45 | 39.64 | 203.91 | 616.93 |
Perform a multi-term query
Query the top 20 results that contain one or more terms in the body column. Example: body:handsworth OR body:cern. All terms are extracted from the test dataset. The terms whose occurrence is less than 100 are not included. The number of terms contained in each query conforms to the weight distribution parameter [0.03, 0.15, 0.25, 0.25, 0.15, 0.08, 0.04, 0.03, 0.02]. In this case, 3% of queries contain one term and 15% of queries contain two terms.
Concurrency | BM25 | GIN(tsvector) | ||
QPS | RT (ms) | QPS | RT (ms) | |
1 | 216.88 | 4.54 | 31.97 | 31.49 |
5 | 842.12 | 5.86 | 104.89 | 47.27 |
10 | 1,161.79 | 8.58 | 112.89 | 87.40 |
20 | 1,332.61 | 15.03 | 118.87 | 159.25 |
30 | 1,380.56 | 21.66 | 120.93 | 264.10 |
40 | 1,470.47 | 26.73 | 131.87 | 341.69 |
Perform a single-term AND point query
Only a single row of data is matched based on the point query conditions.
SELECT * FROM articles ORDER BY body @@@ pgsearch.config('body:"David" AND id:80000000');
SELECT * FROM articles_gin WHERE body_ts @@ to_tsquery('David') AND id = '80000000';Concurrency | BM25 | GIN(tsvector) and B-tree | ||
QPS | RT (ms) | QPS | RT (ms) | |
1 | 342.92 | 2.81 | 2.00 | 985.66 |
5 | 1,434.90 | 3.30 | 4.99 | 2,401.95 |
10 | 2,293.54 | 4.21 | 6.99 | 4,306.61 |
20 | 3,084.70 | 6.32 | 13.98 | 8,536.26 |
30 | 2,835.97 | 10.38 | 19.98 | 12,565.44 |
40 | 3,081.71 | 12.78 | 24.97 | 17,342.22 |
Perform a single-term AND range query
Ten thousand rows of data are matched and 99.99% of data is filtered out based on the range query conditions.
SELECT * FROM articles WHERE body @@ pgsearch.config('body:"David" AND id:[110000 TO 120000]');
SELECT * FROM articles_gin WHERE body_ts @@ to_tsquery('David') AND id >= '110000' AND id <= '120000';Concurrency | BM25 | GIN(tsvector) and B-tree | ||
QPS | RT (ms) | QPS | RT (ms) | |
1 | 9.99 | 110.55 | 1.00 | 1,189.95 |
5 | 22.97 | 229.57 | 4.99 | 2,074.63 |
10 | 25.97 | 451.81 | 8.99 | 4,236.33 |
20 | 32.97 | 951.86 | 14.99 | 8,674.06 |
30 | 29.97 | 1,366.22 | 19.98 | 13,458.88 |
40 | 39.97 | 2,071.66 | 32.96 | 18,251.81 |