HGraph索引使用指南 - Hologres

注意事項

Hologres V4.0版本起支援HGraph向量檢索演算法。
僅列存、行列共存表支援建立向量索引，行存表不支援。
建立了向量索引的表如果涉及刪表、重建表操作，如Insert Overwrite等，暫不建議開啟資源回收筒功能，資源回收筒中的表仍存在部分記憶體佔用。
建立向量索引後，索引檔案將在資料匯入後的Compaction過程中構建。
記憶體表（Mem Table）中的資料沒有向量索引，在執行向量檢索請求時，該部分資料會暴力計算。
建議使用Serverless Computing資源執行資料的大量匯入，Serverless資源將在資料匯入時同步完成Compaction及向量索引構建，參見使用Serverless Computing執行讀寫任務、使用Serverless Computing執行Compaction任務。
如不使用Serverless資源，建議在大量匯入資料或修改索引後，手動執行如下命令觸發Compaction，參見Compaction（Beta）。

SELECT hologres.hg_full_compact_table('<SCHEMA_NAME>.<TABLE_NAME>', 'max_file_size_mb=4096');

支援使用Serverless Computing資源執行向量檢索的查詢。

管理向量索引

建立索引

文法：建表時建立向量索引。

說明：向量在Hologres中通過float4數組表示，向量維度通過一維數組的長度表示，即下文的array_length。

CREATE TABLE <TABLE_NAME> (
    <VECTOR_COLUMN_NAME> float4[] CHECK (array_ndims(<VECTOR_COLUMN_NAME>) = 1 AND array_length(<VECTOR_COLUMN_NAME>, 1) = <DIM>)
)
WITH (
    vectors = '{
    "<VECTOR_COLUMN_NAME>": {
        "algorithm": "<ALGORITHM>",
        "distance_method": "<DISTANCE_METHOD>",
        "builder_params": {
            "<BUILDER_PARAMETERS_NAME>": <VALUE>
            [, ...]
        }
    }
    [ , "<VECTOR_COLUMN_NAME_2>": { ... } ]
  }'
);

參數說明：

參數	說明
table_name	目標表名。
vector_column_name	目標向量列名。
dim	目標列的向量維度。

向量索引參數vectors的取值有如下要求：

僅支援JSON格式字串，頂層僅支援一個vector_column_name鍵，用於指定需構建向量索引向量列名。
vector_column_name鍵的值為JSON對象，用於配置向量索引參數，支援如下鍵。

鍵	說明
algorithm	向量索引演算法。必填，僅支援HGraph。
distance_method	向量距離計算方法。必填。支援如下取值： Euclidean：歐氏距離，只支援正排，即ORDER BY distance ASC。 InnerProduct：內積距離，只支援倒排，即ORDER BY distance DESC。 Cosine：餘弦距離，只支援倒排，即ORDER BY distance DESC。說明：向量檢索使用的距離計算函數需要和向量索引使用的距離計算方法對應，且需要滿足對應的排序要求，否則無法使用向量索引。
builder_params	向量索引構建參數。僅支援JSON格式字串，參數說明見下文。 max_degree ef_construction base_quantization_type use_reorder precise_quantization_type precise_io_type

向量索引構建參數builder_params支援如下參數：

參數	說明
max_degree	在索引構建過程中，每個頂點將嘗試與其最近的`max_degree`個頂點建立串連。非必填，預設為64。值越大，每個頂點的搜尋範圍越大，搜尋效率越高，但圖構建和儲存的成本也越高，一般不建議超過96。
ef_construction	用於控制索引構建過程中的搜尋深度。非必填，預設為400。值越大，在索引構建過程中被視為頂點近鄰向量的候選者越多，索引精度越高，但索引構建的時間消耗和計算複雜度也相應增加，一般不建議超過600。
base_quantization_type	HGraph低精度索引的量化方法。必填。支援如下方法： sq8 sq8_uniform fp16 fp32 rabitq
use_reorder	是否使用HGraph高精度索引。非必填，預設為FALSE。
precise_quantization_type	HGraph高精度索引的量化方法。非必填，僅use_reorder為TRUE時生效，預設為fp32，不建議修改。支援如下方法，建議選擇比base_quantization_type更高精度的量化方法。 sq8 sq8_uniform fp16 fp32
precise_io_type	HGraph高精度+低精度混合索引的儲存介質。非必填，僅use_reorder為TRUE時生效，預設為 block_memory_io。支援如下取值： block_memory_io：低精度索引、高精度索引全部儲存在記憶體。 reader_io：低精度索引儲存在記憶體，高精度索引儲存在磁碟。
builder_thread_count	該參數為非必填項，預設值為4。其用於控制在寫入過程中，builder向量索引的線程數量。通常情況下，無需進行調整。增大該參數可能導致CPU佔用過高，因此在一般情境下不建議進行修改。對該參數的修改不會觸發索引重建。
graph_storage_type	非必填項，預設值為 flat。該參數用於控制記憶體中圖索引的壓縮情況。可支援的取值如下： flat（預設）：不壓縮圖索引。 compressed：壓縮圖索引，可節省50%記憶體，最大QPS僅減少約5%。說明自Hologres V4.0.10 版本起支援設定該參數。
extra_columns	在向量索引上附加列資訊。V4.1.1版本起支援。非必填。僅支援INT、BIGINT、SMALLINT類型列。檢索時可直接通過索引擷取列值，無需查詢目標表對應列，提升向量檢索效能。設定樣本：`"extra_columns": "id"`

修改索引

文法：

ALTER TABLE <TABLE_NAME>
SET (
    vectors = '{
    "<VECTOR_COLUMN_NAME>": {
        "algorithm": "<ALGORITHM>",
        "distance_method": "<DISTANCE_METHOD>",
        "builder_params": {
            "<BUILDER_PARAMETERS_NAME>": <VALUE>
            [, ...]
        }
    }
  }'
);

刪除索引

-- 刪除表中全部列的向量索引
ALTER TABLE <TABLE_NAME>
SET (
    vectors = '{}'
);

-- 如果表中有col1、col2兩列均構建向量索引，需刪除col2列的索引，則通過ALTER TABLE語句僅保留col1列的索引即可
ALTER TABLE <TABLE_NAME>
SET (
    vectors = '{
    "col1": { ... }
  }'
);

查看索引

Hologres提供hologres.hg_table_properties系統資料表，可查看已建立的向量索引。

SELECT
    *
FROM
    hologres.hg_table_properties
WHERE 
    table_name = '<TABLE_NAME>'
    AND property_key = 'vectors';

使用向量索引進行向量檢索

向量距離計算函數

Hologres的向量檢索支援近似檢索和精確檢索，僅近似檢索函數可使用已構建的向量索引進行加速查詢（函數需要和向量索引的distance_method距離計算方法對應），精確檢索函數無法使用向量索引。

說明：向量距離計算函數不支援全部常量入參。

函數	檢索類型	入參	傳回值	說明
approx_euclidean_distance	近似檢索	float4[], float4[]	float4	歐氏距離近似檢索函數。
approx_inner_product_distance	近似檢索	float4[], float4[]	float4	內積距離近似檢索函數。
approx_cosine_distance	近似檢索	float4[], float4[]	float4	餘弦距離近似檢索函數。
euclidean_distance	精確檢索	float4[], float4[]	float4	歐氏距離精確檢索函數。
inner_product_distance	精確檢索	float4[], float4[]	float4	內積距離精確檢索函數。
cosine_distance	精確檢索	float4[], float4[]	float4	餘弦距離精確檢索函數。

向量索引使用驗證

可通過執行計畫查看SQL是否使用了向量索引，若其中出現“Vector Filter”，說明已成功使用，參見EXPLAIN和EXPLAIN ANALYZE。

樣本SQL：

SELECT
    id,
    approx_euclidean_distance (feature, '{0.1,0.2,0.3,0.4}') AS distance
FROM
    feature_tb
ORDER BY
    distance
LIMIT 40;

執行計畫：

Limit  (cost=0.00..182.75 rows=40 width=12)
  ->  Sort  (cost=0.00..182.75 rows=160 width=12)
        Sort Key: (VectorDistanceRef)
        ->  Gather  (cost=0.00..181.95 rows=160 width=12)
              ->  Limit  (cost=0.00..181.94 rows=160 width=12)
                    ->  Sort  (cost=0.00..181.94 rows=40000 width=12)
                          Sort Key: (VectorDistanceRef)
                          ->  Local Gather  (cost=0.00..91.53 rows=40000 width=12)
                                ->  Limit  (cost=0.00..91.53 rows=40000 width=12)
                                      ->  Sort  (cost=0.00..91.53 rows=40000 width=12)
                                            Sort Key: (VectorDistanceRef)
                                            ->  Project  (cost=0.00..1.12 rows=40000 width=12)
                                                  ->  Index Scan using Clustering_index on feature_tb  (cost=0.00..1.00 rows=40000 width=8)
                                                        Vector Filter: VectorCond => KNN: '40'::bigint distance_method: approx_euclidean_distance search_params: {NULL} args: {feature'{0.100000001,0.200000003,0.300000012,0.400000006}'::real[]}
Query Queue: init_warehouse.default_queue
Optimizer: HQO version 4.0.0

使用樣本

建表。

-- 建立一個Shard Count = 4 的Table Group 
CALL HG_CREATE_TABLE_GROUP ('test_tg_shard_4', 4);

-- 建表
CREATE TABLE feature_tb (
    id bigint,
  	feature float4[] CHECK(array_ndims(feature) = 1 AND array_length(feature, 1) = 4)
)
WITH (
    table_group = 'test_tg_shard_4',
    vectors = '{
    "feature": {
        "algorithm": "HGraph",
        "distance_method": "Cosine",
        "builder_params": {
            "base_quantization_type": "rabitq",
            "graph_storage_type": "compressed",
            "max_degree": 64,
            "ef_construction": 400,
            "precise_quantization_type": "fp32",
            "use_reorder": true,
            "extra_columns": "id",
            "max_total_size_to_merge_mb" : 4096
        }
    }
    }'
);

資料匯入。

-- （可選）推薦使用Serverless Computing執行巨量資料量離線匯入和ETL作業，並在匯入時同步完成Compaction與索引構建
SET hg_computing_resource = 'serverless';
SET hg_serverless_computing_run_compaction_before_commit_bulk_load = on;

INSERT INTO feature_tb SELECT i, array[random(), random(), random(), random()]::float4[] FROM generate_series(1, 100000) i;

-- 重設配置，保證非必要的SQL不會使用serverless資源。
RESET hg_computing_resource;

向量近似檢索。
```
-- 計算歐氏距離的Top 40
SELECT
    id,
    approx_cosine_distance (feature, '{0.1,0.2,0.3,0.4}') AS distance
FROM
    feature_tb
ORDER BY
    distance DESC
LIMIT 40;
```
說明
目標表設定了參數"extra_columns": "id"時，該向量近似檢索樣本可直接通過向量索引擷取id列值，無需查詢目標表的id列。可通過Explain Analyze結果中的vector_index_extra_columns_used參數，查看通過extra_columns擷取值的向量索引檔案數。

向量精確檢索。

-- 精確檢索不使用向量索引，因此距離計算函數無需與向量索引的distance_method相同
SELECT
    id,
    cosine_distance (feature, '{0.1,0.2,0.3,0.4}') AS distance
FROM
    feature_tb
ORDER BY
    distance DESC
LIMIT 40;

效能調優

合理使用向量索引

當資料量較小（比如幾萬條），或執行個體計算資源較多情況下，建議不設定向量索引，直接暴力計算。當直接計算無法滿足延遲、吞吐等需求時，再使用向量索引，原因如下：

向量索引是有損索引，結果準確率（召回率）無法達到100%。
向量索引可能出現召回條數不足的情況，如LIMIT 1000卻只返回500條。

當選擇使用向量索引時，配置建議如下（以單表、單列、768維度向量為例）：

延時敏感：建議選擇純記憶體索引，索引量化方法建議使用sq8_uniform或rabitq，單Shard建議資料量不超過500萬行。
延時不敏感或巨量資料量：建議選擇記憶體+磁碟混合索引，索引量化方法建議使用rabitq，單Shard建議資料量不超過3000~5000萬行。
說明：當需對多列設定向量索引時，單Shard建議資料量需要等比縮小。同時，向量維度大小也會影響該建議值。

使用樣本：

-- 混合索引完整例子
CREATE TABLE feature_tb (
    id bigint,
  	feature float4[] CHECK(array_ndims(feature) = 1 AND array_length(feature, 1) = 4)
)
WITH (
    table_group = 'test_tg_shard_4',
    vectors = '{
    "feature": {
        "algorithm": "HGraph",
        "distance_method": "Cosine",
        "builder_params": {
            "base_quantization_type": "rabitq",
            "graph_storage_type": "compressed",
            "max_degree": 64,
            "ef_construction": 400,
            "precise_quantization_type": "fp32",
            "precise_io_type": "reader_io",
            "use_reorder": true,
            "max_total_size_to_merge_mb" : 4096
        }
    }
    }'
);


-- 全記憶體索引完整例子
CREATE TABLE feature_tb (
    id bigint,
  	feature float4[] CHECK(array_ndims(feature) = 1 AND array_length(feature, 1) = 4)
)
WITH (
    table_group = 'test_tg_shard_4',
    vectors = '{
    "feature": {
        "algorithm": "HGraph",
        "distance_method": "Cosine",
        "builder_params": {
            "base_quantization_type": "sq8_uniform",
            "graph_storage_type": "compressed",
            "max_degree": 64,
            "ef_construction": 400,
            "precise_quantization_type": "fp32",
            "use_reorder": true,
            "max_total_size_to_merge_mb" : 4096
        }
    }
    }'
);

提升召回率

本節以 VectorDBBench 資料集為例，說明如何提升召回率。

一般情況下影響召回率的因素有多個，如下索引參數配置下，系統的預設召回率一般可以達到95%以上：

索引參數：
- base_quantization_type為rabitq、sq8_uniform
- precise_quantization_type為fp32
- max_degree為64
- ef_construction為400
查詢參數（GUC）：
- hg_vector_ef_search ：建議使用預設值 80。該參數控制檢索期間候選列表的大小，用以平衡精度和速度。值越大，準確性越高，但資源開銷也會越大。

如需進一步提升召回率至99%以上，可以在保持其餘參數不變的情況下，執行SET hg_vector_ef_search = 400;。但召回率的提升會使得查詢延遲和計算資源使用率相應增加。

如需進一步提升召回率至 99.5% ~ 99.7%，可進一步調整 max_degree、ef_construction、hg_vector_ef_search 三個值，查詢延遲、查詢資源消耗、索引構建時間、索引構建資源消耗均會相應增加，如：

max_degree = 96。
ef_construction = 500 或 600。
hg_vector_ef_search = 500 或 600。

設定合適的Shard Count

Shard Count越多，實際構建索引的檔案數量就越多，導致向量近似查詢的輸送量下降。因此，建議在實際應用中合理設定Shard Count。通常情況下，可以按照以下思路進行設定：

根據向量資料量，選擇合適的計算資源規格。以768維度向量為例，可以按照以下思路進行選擇。其他維度具體選擇請參見向量計算執行個體規格推薦。
1. 全記憶體索引：500萬向量/Worker。
2. 記憶體+磁碟混合索引：1億向量/Worker。

根據計算資源規格來確定Shard Count。通常情況下，Shard Count可以設定為與Worker數量相等，例如，對於64CU的執行個體，可以將shard_count設定為4。

SQL設定樣本如下：

-- 建立向量表，並且放於Shard Count = 4 的Table Group中
CALL HG_CREATE_TABLE_GROUP ('test_tg_shard_4', 4);

CREATE TABLE feature_tb (
    id bigint,
  	feature float4[] CHECK(array_ndims(feature) = 1 AND array_length(feature, 1) = 4)
)
WITH (
    table_group = 'test_tg_shard_4',
    vectors = '{
    "feature": {
        "algorithm": "HGraph",
        "distance_method": "Cosine",
        "builder_params": {
            "base_quantization_type": "sq8_uniform",
            "graph_storage_type": "compressed",
            "max_degree": 64,
            "ef_construction": 400,
            "precise_quantization_type": "fp32",
            "use_reorder": true,
            "max_total_size_to_merge_mb" : 4096
        }
    }
    }'
);

向量+標量混合查詢情境

對於帶過濾條件的向量檢索，情況細分為幾種常見的過濾情境，分別如下：

查詢情境1：某個字串列為過濾條件

樣本查詢如下，常見的情境為在某個組織內尋找對應的向量資料，例如尋找班級內的人臉資料。

SELECT(feature, '{1,2,3,4}') AS d FROM feature_tb WHERE uuid = 'x' ORDER BY d LIMIT 10;

建議做以下最佳化：

將UUID設定為Distribution Key，這樣相同的過濾資料會儲存在同一個Shard，查詢時一次查詢只會落到一個Shard上。
將UUID設定為表的Clustering Key，資料將會在檔案內根據Clustering Key排序。

查詢情境2：某個時間欄位為過濾條件

樣本查詢如下，一般是根據時間欄位過濾出對應的向量資料。建議將時間欄位time_field設定為表的Segment Key，可以快速的定位到資料所在的檔案。

SELECT xx_distance(feature, '{1,2,3,4}') AS d FROM feature_tb WHERE time_field BETWEEN '2020-08-30 00:00:00' AND '2020-08-30 12:00:00' ORDER BY d LIMIT 10;

因此對於帶任何過濾條件的向量檢索而言，其建表語句通常如下：

-- 說明：如果沒有按照時間過濾的話，則time_field相關的索引可以刪除。
CREATE TABLE feature_tb (
    time_field timestamptz NOT NULL,
    uuid text NOT NULL,
    feature float4[] CHECK(array_ndims(feature) = 1 AND array_length(feature, 1) = 4)
)
WITH (
    distribution_key = 'uuid',
    segment_key = 'time_field',
    clustering_key = 'uuid',
    vectors = '{
    "feature": {
        "algorithm": "HGraph",
        "distance_method": "Cosine",
        "builder_params": {
            "base_quantization_type": "sq8_uniform",
            "graph_storage_type": "compressed",
            "max_degree": 64,
            "ef_construction": 400,
            "precise_quantization_type": "fp32",
            "use_reorder": true,
            "max_total_size_to_merge_mb" : 4096
        }
    }
    }'
);

使用Serverless資源重建索引

如果表屬性修改，可能觸發Compaction並重建索引，帶來大量CPU消耗。如有下列表屬性修改需求，建議按下文步驟操作：

修改bitmap_columns、dictionary_encoding_columns、向量索引，均會觸發Compaction並重建索引，因此不建議使用ALTER TABLE xxx SET文法修改，建議執行如下命令，通過Rebuild文法、使用Serverless Computing資源執行，詳情參見REBUILD。

ASYNC REBUILD TABLE <table_name> 
WITH (
    rebuild_guc_hg_computing_resource = 'serverless'
)
SET (
    bitmap_columns = '<col1>,<col2>',
    dictionary_encoding_columns = '<col1>:on,<col2>:off',
    vectors = '{
    "<col_vector>": {
        "algorithm": "HGraph",
        "distance_method": "Cosine",
        "builder_params": {
            "base_quantization_type": "rabitq",
            "graph_storage_type": "compressed",
            "max_degree": 64,
            "ef_construction": 400,
            "precise_quantization_type": "fp32",
            "use_reorder": true,
            "max_total_size_to_merge_mb" : 4096
        }
    }
    }'
);

修改列式JSONB列、全文索引列，也會觸發Compaction並重建索引，暫不支援通過Rebuild文法執行，建議通過建立暫存資料表方案修改，詳見如下步驟：

BEGIN ;
-- 清理潛在的暫存資料表
DROP TABLE IF EXISTS <table_new>;
-- 建立暫存資料表
SET hg_experimental_enable_create_table_like_properties=on;
CALL HG_CREATE_TABLE_LIKE ('<table_new>', 'select * from <table>');
COMMIT ;

-- 對應列開啟列式JSONB
ALTER TABLE <table_new> ALTER COLUMN <column_name> SET (enable_columnar_type = ON);
-- 對應列建立全文索引
CREATE INDEX <idx_name> ON <table_new> USING FULLTEXT (column_name);

-- 向暫存資料表插入資料，使用Serverless資源執行，並同步完成索引構建
SET hg_computing_resource = 'serverless';
INSERT INTO <table_new> SELECT * FROM <table>;
ANALYZE <table_new>;

BEGIN ;
-- 刪除舊錶
DROP TABLE IF EXISTS  <table>;
-- 暫存資料表改名
ALTER TABLE <table_new> RENAME TO <table>;
COMMIT ;

修改其他屬性，如distribution_key、clustering_key、segment_key、儲存格式等，均建議通過Rebuild文法、使用Serverless Computing資源執行。

常見問題

Q：報錯“Writting column: feature with array size: 5 violates fixed size list (4) constraint declared in schema”。

A：原因：由於寫入到特徵向量列的資料維度與表中定義的維度數不一致導致，可以排查下是否有髒資料。
Q：報錯“The size of two array must be the same in DistanceFunction, size of left array: 4, size of right array: x”。

A：原因：這是由於xx_distance(left, right)裡面，left的維度與right的維度不一致所致。

Q：通過Java寫入向量資料樣本。

A：樣本Java代碼如下：

private static void insertIntoVector(Connection conn) throws Exception {
    try (PreparedStatement stmt = conn.prepareStatement("INSERT INTO feature_tb VALUES(?,?);")) {
        for (int i = 0; i < 100; ++i) {
           stmt.setInt(1, i);
           Float[] featureVector = {0.1f,0.2f,0.3f,0.4f};
           Array array = conn.createArrayOf("FLOAT4", featureVector);
           stmt.setArray(2, array);
           stmt.execute();
        }
    }
}

Q：如何將 Proxima Graph 索引改成 HGraph 的索引。

A：將Proxima Graph 索引改成 HGraph 的索引，依次完成如下兩個操作步驟：
- 步驟一：刪除現有表的 Proxima Graph 索引，SQL命令如下：
```
CALL set_table_property ('<TABLE_NAME>', 'proxima_vectors', '{}');
```
<TABLE_NAME>替換為實際表名。
- 步驟二：等待原Proxima Graph索引刪除後，建立HGraph索引，詳細操作參考建立索引。