Full-text search performance test on the PMC dataset - Hologres

Hologres has supported inverted indexes since V4.0, enabling high-performance full-text search. This article describes how to run a full-text search performance test on Hologres using the PMC dataset and presents the results.

The pmc dataset originates from PubMed Central (PMC), a repository of scientific papers. It contains the full text of approximately 574,000 academic papers, with a raw data size of 23.3 GB (5.9 GB compressed). Each record includes fields such as name (article identifier), journal (journal name), date (publication date), volume (volume number), issue (issue number), accession (PMC number), timestamp, pmid (PubMed ID), and body (full text of the paper). This dataset is widely used as a benchmark to evaluate the performance of search engines and databases in full-text search and academic literature analysis.

Prepare the test environment

Test resources:

Hologres:
- Compute resources: 48 CU
- Version: V4.1.12
- Number of shards: 6. If you increase the number of compute nodes, increase the number of shards linearly.
ECS:
- Instance type: ecs.c9i.16xlarge or ecs.g9i.16xlarge
- Operating system: Debian 13.2 64-bit

Environment setup:

Prepare a Hologres instance
- Purchase a Hologres instance (V4.1) and create a database.
- Create a user. For more information, see user management.

Prepare an ECS instance

Purchase an ECS instance.

Install dependencies

# Update the apt cache
sudo apt update
# Install the PostgreSQL client to connect to the database
sudo apt install -y postgresql-client

Prepare the dataset: Download and decompress the pmc dataset from the official source:

mkdir ~/data && cd ~/data
wget http://benchmarks.elasticsearch.org.s3.amazonaws.com/corpora/pmc/documents.json.bz2
bunzip2 -k documents.json.bz2

Run the performance test

An open-source benchmark tool developed by Hologres automates the performance test process described in this article, including table creation, data import, and index creation. For details about the tool, see the Git project alibabacloud-hologres-benchmark. For a table creation example, see the appendix.

Install the benchmark tool

# Create an isolated environment
sudo apt install -y python3-venv
python3 -m venv .venv

# Activate the isolated environment
source .venv/bin/activate
python3 -m pip install -U pip

# Install dependencies
git clone https://github.com/aliyun/alibabacloud-hologres-benchmark.git
cd alibabacloud-hologres-benchmark/fulltext_search/pmc
pip3 install -r requirements.txt

Modify the configuration file

{
  "host": "<hologres_endpoint>",
  "port": <hologres_port>,
  "database": "<database_name>",
  "username": "<user_name>",
  "password": "<password>",
  "table_name": "pmc"
}

Run the benchmark script

cd alibabacloud-hologres-benchmark/fulltext_search/pmc

# This script runs the full benchmark process, including data import and queries. 
# If the data already exists, the import step is skipped.
python3 hologres_benchmark.py \
    --config config.json \
    --queries-config benchmark_queries.yaml \
    --data-dir ~/data

Test results

Results overview

Metric	Unit	Hologres result
Data import time	s	135.361
Index creation time	s	303.154
Total data preparation time	s	438.516
Data + index storage	GB	16
Total query time (100 runs)	s	8.522

Note

Note: Total query time is the time taken to run six different queries, 100 consecutive times each.

Performance details: The following table shows the average response time in milliseconds for each query. Hologres delivers excellent performance for both full-text search queries (such as term and phrase) and aggregate analytics scenarios.

Query name	Average time (ms)
`default` (match_all)	6
`term`	7
`phrase`	7
`articles_monthly_agg_cached`	7
`articles_monthly_agg_uncached`	9
`scroll`	49

Appendix: Hologres table and index creation

Create the test table in Hologres

-- Create a new table group with 6 shards.
CALL HG_CREATE_TABLE_GROUP ('tg_6', 6);

-- Create the core table.
CREATE TABLE pmc (
  id BIGINT PRIMARY KEY,
  name TEXT,
  journal TEXT,
  "date" TEXT,
  volume TEXT,
  issue TEXT,
  accession TEXT,
  "timestamp" timestamptz NOT NULL,
  pmid INTEGER,
  body TEXT
) WITH (
  table_group = 'tg_6',
  bitmap_columns = 'journal,"date",volume,issue,accession',
  segment_key = '"timestamp"',
  clustering_key = '"timestamp"',
  distribution_key = 'id'
);

Create full-text indexes in Hologres: Create inverted indexes on multiple fields.

-- Create full-text indexes.
CREATE INDEX pmc_accession_idx ON public.pmc USING fulltext(accession)
 WITH (
tokenizer = 'keyword',
analyzer_params = '{"tokenizer":{"type":"keyword"}}'
);

CREATE INDEX pmc_body_idx ON public.pmc USING fulltext(body)
 WITH (
tokenizer = 'standard',
analyzer_params = '{"filter":["lowercase"],"tokenizer":{"max_token_length":255,"type":"standard"}}'
);

CREATE INDEX pmc_date_idx ON public.pmc USING fulltext(date)
 WITH (
tokenizer = 'standard',
analyzer_params = '{"filter":["lowercase"],"tokenizer":{"max_token_length":255,"type":"standard"}}'
);

CREATE INDEX pmc_issue_idx ON public.pmc USING fulltext(issue)
 WITH (
tokenizer = 'standard',
analyzer_params = '{"filter":["lowercase"],"tokenizer":{"max_token_length":255,"type":"standard"}}'
);

CREATE INDEX pmc_journal_idx ON public.pmc USING fulltext(journal)
 WITH (
tokenizer = 'standard',
analyzer_params = '{"filter":["lowercase"],"tokenizer":{"max_token_length":255,"type":"standard"}}'
);

CREATE INDEX pmc_name_idx ON public.pmc USING fulltext(name)
 WITH (
tokenizer = 'keyword',
analyzer_params = '{"tokenizer":{"type":"keyword"}}'
);

CREATE INDEX pmc_volume_idx ON public.pmc USING fulltext(volume)
 WITH (
tokenizer = 'standard',
analyzer_params = '{"filter":["lowercase"],"tokenizer":{"max_token_length":255,"type":"standard"}}'
);

-- Perform a full index build.
VACUUM pmc;