All Products
Search
Document Center

Hologres:Full-text search performance test on the PMC dataset

Last Updated:May 07, 2026

Hologres has supported inverted indexes since V4.0, enabling high-performance full-text search. This article describes how to run a full-text search performance test on Hologres using the PMC dataset and presents the results.

The pmc dataset originates from PubMed Central (PMC), a repository of scientific papers. It contains the full text of approximately 574,000 academic papers, with a raw data size of 23.3 GB (5.9 GB compressed). Each record includes fields such as name (article identifier), journal (journal name), date (publication date), volume (volume number), issue (issue number), accession (PMC number), timestamp, pmid (PubMed ID), and body (full text of the paper). This dataset is widely used as a benchmark to evaluate the performance of search engines and databases in full-text search and academic literature analysis.

Prepare the test environment

Test resources:

  • Hologres:

    • Compute resources: 48 CU

    • Version: V4.1.12

    • Number of shards: 6. If you increase the number of compute nodes, increase the number of shards linearly.

  • ECS:

    • Instance type: ecs.c9i.16xlarge or ecs.g9i.16xlarge

    • Operating system: Debian 13.2 64-bit

Environment setup:

  • Prepare a Hologres instance

  • Prepare an ECS instance

    • Purchase an ECS instance.

    • Install dependencies

      # Update the apt cache
      sudo apt update
      # Install the PostgreSQL client to connect to the database
      sudo apt install -y postgresql-client
    • Prepare the dataset: Download and decompress the pmc dataset from the official source:

      mkdir ~/data && cd ~/data
      wget http://benchmarks.elasticsearch.org.s3.amazonaws.com/corpora/pmc/documents.json.bz2
      bunzip2 -k documents.json.bz2

Run the performance test

An open-source benchmark tool developed by Hologres automates the performance test process described in this article, including table creation, data import, and index creation. For details about the tool, see the Git project alibabacloud-hologres-benchmark. For a table creation example, see the appendix.

  • Install the benchmark tool

    # Create an isolated environment
    sudo apt install -y python3-venv
    python3 -m venv .venv
    
    # Activate the isolated environment
    source .venv/bin/activate
    python3 -m pip install -U pip
    
    # Install dependencies
    git clone https://github.com/aliyun/alibabacloud-hologres-benchmark.git
    cd alibabacloud-hologres-benchmark/fulltext_search/pmc
    pip3 install -r requirements.txt
  • Modify the configuration file

    {
      "host": "<hologres_endpoint>",
      "port": <hologres_port>,
      "database": "<database_name>",
      "username": "<user_name>",
      "password": "<password>",
      "table_name": "pmc"
    }
  • Run the benchmark script

    cd alibabacloud-hologres-benchmark/fulltext_search/pmc
    
    # This script runs the full benchmark process, including data import and queries. 
    # If the data already exists, the import step is skipped.
    python3 hologres_benchmark.py \
        --config config.json \
        --queries-config benchmark_queries.yaml \
        --data-dir ~/data

Test results

Results overview

Metric

Unit

Hologres result

Data import time

s

135.361

Index creation time

s

303.154

Total data preparation time

s

438.516

Data + index storage

GB

16

Total query time (100 runs)

s

8.522

Note

Note: Total query time is the time taken to run six different queries, 100 consecutive times each.

Performance details: The following table shows the average response time in milliseconds for each query. Hologres delivers excellent performance for both full-text search queries (such as term and phrase) and aggregate analytics scenarios.

Query name

Average time (ms)

default (match_all)

6

term

7

phrase

7

articles_monthly_agg_cached

7

articles_monthly_agg_uncached

9

scroll

49

Appendix: Hologres table and index creation

  • Create the test table in Hologres

    -- Create a new table group with 6 shards.
    CALL HG_CREATE_TABLE_GROUP ('tg_6', 6);
    
    -- Create the core table.
    CREATE TABLE pmc (
      id BIGINT PRIMARY KEY,
      name TEXT,
      journal TEXT,
      "date" TEXT,
      volume TEXT,
      issue TEXT,
      accession TEXT,
      "timestamp" timestamptz NOT NULL,
      pmid INTEGER,
      body TEXT
    ) WITH (
      table_group = 'tg_6',
      bitmap_columns = 'journal,"date",volume,issue,accession',
      segment_key = '"timestamp"',
      clustering_key = '"timestamp"',
      distribution_key = 'id'
    );
  • Create full-text indexes in Hologres: Create inverted indexes on multiple fields.

    -- Create full-text indexes.
    CREATE INDEX pmc_accession_idx ON public.pmc USING fulltext(accession)
     WITH (
    tokenizer = 'keyword',
    analyzer_params = '{"tokenizer":{"type":"keyword"}}'
    );
    
    CREATE INDEX pmc_body_idx ON public.pmc USING fulltext(body)
     WITH (
    tokenizer = 'standard',
    analyzer_params = '{"filter":["lowercase"],"tokenizer":{"max_token_length":255,"type":"standard"}}'
    );
    
    CREATE INDEX pmc_date_idx ON public.pmc USING fulltext(date)
     WITH (
    tokenizer = 'standard',
    analyzer_params = '{"filter":["lowercase"],"tokenizer":{"max_token_length":255,"type":"standard"}}'
    );
    
    CREATE INDEX pmc_issue_idx ON public.pmc USING fulltext(issue)
     WITH (
    tokenizer = 'standard',
    analyzer_params = '{"filter":["lowercase"],"tokenizer":{"max_token_length":255,"type":"standard"}}'
    );
    
    CREATE INDEX pmc_journal_idx ON public.pmc USING fulltext(journal)
     WITH (
    tokenizer = 'standard',
    analyzer_params = '{"filter":["lowercase"],"tokenizer":{"max_token_length":255,"type":"standard"}}'
    );
    
    CREATE INDEX pmc_name_idx ON public.pmc USING fulltext(name)
     WITH (
    tokenizer = 'keyword',
    analyzer_params = '{"tokenizer":{"type":"keyword"}}'
    );
    
    CREATE INDEX pmc_volume_idx ON public.pmc USING fulltext(volume)
     WITH (
    tokenizer = 'standard',
    analyzer_params = '{"filter":["lowercase"],"tokenizer":{"max_token_length":255,"type":"standard"}}'
    );
    
    -- Perform a full index build.
    VACUUM pmc;