Hologres has supported inverted indexes since V4.0, enabling high-performance full-text search. This article describes how to run a full-text search performance test on Hologres using the PMC dataset and presents the results.
The pmc dataset originates from PubMed Central (PMC), a repository of scientific papers. It contains the full text of approximately 574,000 academic papers, with a raw data size of 23.3 GB (5.9 GB compressed). Each record includes fields such as name (article identifier), journal (journal name), date (publication date), volume (volume number), issue (issue number), accession (PMC number), timestamp, pmid (PubMed ID), and body (full text of the paper). This dataset is widely used as a benchmark to evaluate the performance of search engines and databases in full-text search and academic literature analysis.
Prepare the test environment
Test resources:
-
Hologres:
-
Compute resources: 48 CU
-
Version: V4.1.12
-
Number of shards: 6. If you increase the number of compute nodes, increase the number of shards linearly.
-
-
ECS:
-
Instance type: ecs.c9i.16xlarge or ecs.g9i.16xlarge
-
Operating system: Debian 13.2 64-bit
-
Environment setup:
-
Prepare a Hologres instance
-
Purchase a Hologres instance (V4.1) and create a database.
-
Create a user. For more information, see user management.
-
-
Prepare an ECS instance
-
Purchase an ECS instance.
-
Install dependencies
# Update the apt cache sudo apt update # Install the PostgreSQL client to connect to the database sudo apt install -y postgresql-client -
Prepare the dataset: Download and decompress the
pmcdataset from the official source:mkdir ~/data && cd ~/data wget http://benchmarks.elasticsearch.org.s3.amazonaws.com/corpora/pmc/documents.json.bz2 bunzip2 -k documents.json.bz2
-
Run the performance test
An open-source benchmark tool developed by Hologres automates the performance test process described in this article, including table creation, data import, and index creation. For details about the tool, see the Git project alibabacloud-hologres-benchmark. For a table creation example, see the appendix.
-
Install the benchmark tool
# Create an isolated environment sudo apt install -y python3-venv python3 -m venv .venv # Activate the isolated environment source .venv/bin/activate python3 -m pip install -U pip # Install dependencies git clone https://github.com/aliyun/alibabacloud-hologres-benchmark.git cd alibabacloud-hologres-benchmark/fulltext_search/pmc pip3 install -r requirements.txt -
Modify the configuration file
{ "host": "<hologres_endpoint>", "port": <hologres_port>, "database": "<database_name>", "username": "<user_name>", "password": "<password>", "table_name": "pmc" } -
Run the benchmark script
cd alibabacloud-hologres-benchmark/fulltext_search/pmc # This script runs the full benchmark process, including data import and queries. # If the data already exists, the import step is skipped. python3 hologres_benchmark.py \ --config config.json \ --queries-config benchmark_queries.yaml \ --data-dir ~/data
Test results
Results overview
|
Metric |
Unit |
Hologres result |
|
Data import time |
s |
135.361 |
|
Index creation time |
s |
303.154 |
|
Total data preparation time |
s |
438.516 |
|
Data + index storage |
GB |
16 |
|
Total query time (100 runs) |
s |
8.522 |
Note: Total query time is the time taken to run six different queries, 100 consecutive times each.
Performance details: The following table shows the average response time in milliseconds for each query. Hologres delivers excellent performance for both full-text search queries (such as term and phrase) and aggregate analytics scenarios.
|
Query name |
Average time (ms) |
|
|
6 |
|
|
7 |
|
|
7 |
|
|
7 |
|
|
9 |
|
|
49 |
Appendix: Hologres table and index creation
-
Create the test table in Hologres
-- Create a new table group with 6 shards. CALL HG_CREATE_TABLE_GROUP ('tg_6', 6); -- Create the core table. CREATE TABLE pmc ( id BIGINT PRIMARY KEY, name TEXT, journal TEXT, "date" TEXT, volume TEXT, issue TEXT, accession TEXT, "timestamp" timestamptz NOT NULL, pmid INTEGER, body TEXT ) WITH ( table_group = 'tg_6', bitmap_columns = 'journal,"date",volume,issue,accession', segment_key = '"timestamp"', clustering_key = '"timestamp"', distribution_key = 'id' ); -
Create full-text indexes in Hologres: Create inverted indexes on multiple fields.
-- Create full-text indexes. CREATE INDEX pmc_accession_idx ON public.pmc USING fulltext(accession) WITH ( tokenizer = 'keyword', analyzer_params = '{"tokenizer":{"type":"keyword"}}' ); CREATE INDEX pmc_body_idx ON public.pmc USING fulltext(body) WITH ( tokenizer = 'standard', analyzer_params = '{"filter":["lowercase"],"tokenizer":{"max_token_length":255,"type":"standard"}}' ); CREATE INDEX pmc_date_idx ON public.pmc USING fulltext(date) WITH ( tokenizer = 'standard', analyzer_params = '{"filter":["lowercase"],"tokenizer":{"max_token_length":255,"type":"standard"}}' ); CREATE INDEX pmc_issue_idx ON public.pmc USING fulltext(issue) WITH ( tokenizer = 'standard', analyzer_params = '{"filter":["lowercase"],"tokenizer":{"max_token_length":255,"type":"standard"}}' ); CREATE INDEX pmc_journal_idx ON public.pmc USING fulltext(journal) WITH ( tokenizer = 'standard', analyzer_params = '{"filter":["lowercase"],"tokenizer":{"max_token_length":255,"type":"standard"}}' ); CREATE INDEX pmc_name_idx ON public.pmc USING fulltext(name) WITH ( tokenizer = 'keyword', analyzer_params = '{"tokenizer":{"type":"keyword"}}' ); CREATE INDEX pmc_volume_idx ON public.pmc USING fulltext(volume) WITH ( tokenizer = 'standard', analyzer_params = '{"filter":["lowercase"],"tokenizer":{"max_token_length":255,"type":"standard"}}' ); -- Perform a full index build. VACUUM pmc;