基於 http_logs 的全文檢索索引效能測試 - Hologres

Hologres 自 V4.0 版本起支援全文倒排索引，實現高效能的全文檢索索引能力。本文介紹 Hologres 基於 http_logs 資料集進行全文檢索索引效能測試的方法與結果。

資料集 http_logs 源自 1998 年世界盃官方網站的伺服器訪問日誌。它包含 2.47 億條記錄，未經處理資料大小約為 32GB。每條記錄包含 @timestamp（時間戳記）、clientip（用戶端 IP）、request（HTTP 要求）、status（狀態代碼）和 size（響應大小）等欄位。該資料集被廣泛用作評估搜尋引擎和資料庫全文檢索索引與分析效能的基準。

測試環境準備

測試資源：

Hologres：
- 計算資源：48 CU
- 版本：V4.1.6
- 分區（Shard）數：6。如需增加計算節點數，建議對應線性增加分區數
ECS：
- 規格：ecs.c9i.16xlarge 或 ecs.g9i.16xlarge
- 作業系統：Debian 13.2 64 位元

環境準備：

準備 Hologres 執行個體
- 購買Hologres執行個體 V4.1 版本執行個體並建立資料庫。
- 建立使用者。具體操作，請參見使用者管理。

準備 ECS 執行個體

購買 ECS 執行個體。

安裝依賴

# 更新 apt 緩衝
sudo apt update
# 安裝 PostgreSQL 用戶端用於串連資料庫
sudo apt install -y postgresql-client

資料集準備：從官方源下載並解壓 http_logs 資料集：

mkdir ~/data && cd ~/data
wget https://rally-tracks.elastic.co/http_logs/documents-181998.json.bz2 && bunzip2 documents-181998.json.bz2
wget https://rally-tracks.elastic.co/http_logs/documents-191998.json.bz2 && bunzip2 documents-191998.json.bz2
wget https://rally-tracks.elastic.co/http_logs/documents-201998.json.bz2 && bunzip2 documents-201998.json.bz2
wget https://rally-tracks.elastic.co/http_logs/documents-211998.json.bz2 && bunzip2 documents-211998.json.bz2
wget https://rally-tracks.elastic.co/http_logs/documents-221998.json.bz2 && bunzip2 documents-221998.json.bz2
wget https://rally-tracks.elastic.co/http_logs/documents-231998.json.bz2 && bunzip2 documents-231998.json.bz2
wget https://rally-tracks.elastic.co/http_logs/documents-241998.json.bz2 && bunzip2 documents-241998.json.bz2

效能測試

本文效能測試過程，包括 Hologres 建表、資料匯入、索引構建，均由 Hologres 研發的開源測試載入器完成，無需手動處理。測試載入器詳見Git 專案 alibabacloud-hologres-benchmark。建表範例詳見附錄。

安裝測試載入器

# 建立隔離環境
sudo apt install -y python3-venv
python3 -m venv .venv

# 啟用隔離環境
source .venv/bin/activate
python3 -m pip install -U pip

# 安裝依賴
git clone https://github.com/aliyun/alibabacloud-hologres-benchmark
cd alibabacloud-hologres-benchmark/fulltext_search/http_logs
pip3 install -r requirements.txt

修改設定檔

{
  "host": "<hologres_endpoint>",
  "port": <hologres_port>,
  "database": "<database_name>",
  "username": "<user_name>",
  "password": "<password>",
  "table_name": "http_logs"
}

執行測試指令碼

cd alibabacloud-hologres-benchmark/fulltext_search/http_logs

# 包含資料匯入、查詢 benchmark 全流程，如資料已存在，則跳過匯入步驟
python3 hologres_benchmark.py \
    --config config.json \
    --queries-config benchmark_queries.yaml \
    --data-dir ~/data

測試結果

結果總覽
指標
單位
Hologres 結果
資料匯入時間
秒
203.583
資料+索引儲存
GB
6.105
查詢總耗時
秒
36.392
說明
說明：查詢總耗時為 20 條查詢分別連續執行 10 次的總時間長度。

效能詳情

下表展示了各查詢的平均回應時間（單位：毫秒）。Hologres 在絕大多數簡單查詢（如 term、range）的平均回應時間都在 10 毫秒以內，在複雜的彙總（hourly_agg）和排序情境也可達到百毫秒級響應。

查詢名稱	平均時間（毫秒）	查詢名稱	平均時間（毫秒）
`sort_status_asc`	1442	`desc_sort_timestamp`	37
`sort_size_asc`	727	`desc_sort_timestamp_can_match_shortcut`	35
`sort_numeric_no_can_match_shortcut`	251	`desc_sort_timestamp_no_can_match_shortcut`	33
`terms_enum`	251	`term`	12
`sort_numeric_can_match_shortcut`	240	`range`	11
`hourly_agg`	197	`200s-in-range`	10
`sort_size_desc`	139	`400s-in-range`	9
`sort_status_desc`	103	`asc_sort_with_after_timestamp`	9
`desc_sort_with_after_timestamp`	63	`default`	7
`scroll`	40	`asc_sort_timestamp`	7

附錄：Hologres 建表與索引構建

Hologres 中建立測試表

-- 建立新 Table Group，Shard 數設為 6
CALL HG_CREATE_TABLE_GROUP ('tg_6', 6);

-- 建立核心表
CREATE TABLE http_logs (
  id BIGINT,
  "@timestamp" BIGINT NOT NULL,
  clientip TEXT,
  request TEXT,
  status INTEGER,
  size INTEGER
) WITH (
  table_group = 'tg_6',          -- 指定 Table Group
  bitmap_columns = 'status',     -- 對 status 列建立位元影像索引，加速等值/範圍查詢
  segment_key = '"@timestamp"',  -- 按時間戳記分段，提升時間範圍查詢效率
  clustering_key = '"@timestamp"'-- 按時間戳記聚簇儲存，進一步最佳化範圍掃描
);

ECS 中轉換未經處理資料檔案格式：Hologres 使用標準 COPY 協議進行高速資料匯入，由於未經處理資料是 NDJSON 格式，匯入 Hologres 前，建議先轉換為 CSV
```
python3 ndjson_to_csv.py ~/data ~/csv
```

ECS 中執行轉換後，使用 psql 的 COPY 命令匯入資料

# 設定環境變數
export PGHOST=<hologres_endpoint>
export PGPORT=<hologres_port>
export PGUSER=<user_name>
export PGPASSWORD='<password>'
export PGDATABASE=<database_name>

# COPY 匯入資料到 Hologres
cd ~/csv
psql -c "COPY http_logs FROM STDIN WITH (FORMAT CSV)" < documents-181998.csv
psql -c "COPY http_logs FROM STDIN WITH (FORMAT CSV)" < documents-191998.csv
psql -c "COPY http_logs FROM STDIN WITH (FORMAT CSV)" < documents-201998.csv
psql -c "COPY http_logs FROM STDIN WITH (FORMAT CSV)" < documents-211998.csv
psql -c "COPY http_logs FROM STDIN WITH (FORMAT CSV)" < documents-221998.csv
psql -c "COPY http_logs FROM STDIN WITH (FORMAT CSV)" < documents-231998.csv
psql -c "COPY http_logs FROM STDIN WITH (FORMAT CSV)" < documents-241998.csv

Hologres 中構建全文索引：針對 request 欄位建立全文倒排索引

-- 建立全文索引
CREATE INDEX http_logs_request_idx
  ON http_logs
  USING FULLTEXT (request)
  WITH (tokenizer = 'keyword');

-- 執行索引全量構建
VACUUM http_logs;

指標	單位	Hologres 結果
資料匯入時間	秒	203.583
資料+索引儲存	GB	6.105
查詢總耗時	秒	36.392