HGraph index guide - Hologres - Alibaba Cloud Documentation Center

Vector computing in Hologres supports scenarios such as similarity search, image retrieval, and scene recognition. Use vector computing to improve data processing and analysis, and to build more precise search and recommendation features. This topic describes how to use vector computing in Hologres and provides complete examples.

Notes

Hologres supports the HGraph vector retrieval algorithm starting from V4.0.
You can create vector indexes only on column-oriented tables and hybrid row-column tables. Row-oriented tables are not supported.
If you delete or rebuild a table that has a vector index, for example, using `INSERT OVERWRITE`, do not enable the recycle bin feature. Tables in the recycle bin still consume some memory.
After you create a vector index, the index file is built during the compaction process that follows data import.
Data in memory tables (Mem Tables) does not have a vector index. When you run a vector retrieval request, this data is processed using brute-force calculation.
Use Serverless Computing resources to perform batch data imports. Serverless resources synchronously complete compaction and index building during data import. For more information, see Use Serverless Computing to run read and write jobs and Use Serverless Computing to run compaction jobs.
If you do not use Serverless resources, manually run the following command to trigger compaction after you import data in batches or modify an index. For more information, see Compaction (Beta).

SELECT hologres.hg_full_compact_table('<SCHEMA_NAME>.<TABLE_NAME>', 'max_file_size_mb=4096');

You can use Serverless Computing resources to run vector retrieval queries.

Manage vector indexes

Create an index

Syntax: Create a vector index when you create a table.

Description: In Hologres, a vector is represented by a float4 array. The vector dimension is represented by the length of the one-dimensional array, which is `array_length` in the following syntax.

CREATE TABLE <TABLE_NAME> (
    <VECTOR_COLUMN_NAME> float4[] CHECK (array_ndims(<VECTOR_COLUMN_NAME>) = 1 AND array_length(<VECTOR_COLUMN_NAME>, 1) = <DIM>)
)
WITH (
    vectors = '{
    "<VECTOR_COLUMN_NAME>": {
        "algorithm": "<ALGORITHM>",
        "distance_method": "<DISTANCE_METHOD>",
        "builder_params": {
            "<BUILDER_PARAMETERS_NAME>": <VALUE>
            [, ...]
        }
    }
    [ , "<VECTOR_COLUMN_NAME_2>": { ... } ]
  }'
);

Parameter description:

Parameter	Description
table_name	The name of the target table.
vector_column_name	The name of the target vector field.
dim	The vector dimension of the target column.

The value of the vector index parameter `vectors` must meet the following requirements:

Only JSON-formatted strings are supported. At the top level, only one `vector_column_name` key is supported. This key specifies the name of the vector field for which to build a vector index.
The value of the `vector_column_name` key is a JSON object used to configure vector index parameters. The following keys are supported.

Key	Description
algorithm	The vector index algorithm. This parameter is required. Only HGraph is supported.
distance_method	The vector distance calculation method. This parameter is required. The following values are supported: Euclidean: Euclidean distance. Only ascending order is supported, which is `ORDER BY distance ASC`. InnerProduct: Inner product distance. Only descending order is supported, which is `ORDER BY distance DESC`. Cosine: Cosine distance. Only descending order is supported, which is `ORDER BY distance DESC`. Note: The distance calculation function used for vector retrieval must correspond to the distance calculation method used for the vector index, and the sorting requirements must be met. Otherwise, the vector index cannot be used.
builder_params	The vector index building parameters. Only JSON-formatted strings are supported. For more information about the parameters, see the following section. max_degree ef_construction base_quantization_type use_reorder precise_quantization_type precise_io_type

The `builder_params` parameter supports the following parameters:

Parameter	Description
max_degree	During the index construction process, each vertex tries to connect to its nearest `max_degree` vertices. Optional. Default value is 64. Larger values increase each vertex's search scope and search efficiency, but also increase the costs of graph construction and storage. We generally do not recommend exceeding 96.
ef_construction	Controls the search depth during index building. This parameter is optional. The default value is 400. A larger value includes more candidates for neighboring vectors of a vertex during index building. This improves index accuracy but also increases the time and computational complexity of index building. We recommend that you do not set this value to more than 600.
base_quantization_type	The quantization method for the HGraph low-precision index. This parameter is required. The following methods are supported: sq8 sq8_uniform fp16 fp32 rabitq
use_reorder	Specifies whether to use the HGraph high-precision index. This parameter is optional. The default value is FALSE.
precise_quantization_type	The quantization method for the HGraph high-precision index. This parameter is optional and takes effect only when `use_reorder` is TRUE. The default value is fp32. We recommend that you do not change this value. The following methods are supported. We recommend that you select a quantization method with higher precision than `base_quantization_type`. sq8 sq8_uniform fp16 fp32
precise_io_type	The storage medium for the hybrid HGraph index that uses both high-precision and low-precision indexes. This parameter is optional and takes effect only when `use_reorder` is TRUE. The default value is `block_memory_io`. The following values are supported: block_memory_io: Both low-precision and high-precision indexes are stored in memory. reader_io: The low-precision index is stored in memory, and the high-precision index is stored on disk.
builder_thread_count	This parameter is optional. The default value is 4. It controls the number of threads for building the vector index during data writes. You do not typically need to adjust this parameter. Increasing this value may cause high CPU usage. Therefore, we recommend that you do not change this parameter in most scenarios. Modifying this parameter does not trigger reindexing.
graph_storage_type	This parameter is optional. The default value is `flat`. It controls the compression of the graph index in memory. The following values are supported: flat (default): Does not compress the graph index. compressed: Compresses the graph index. This can save 50% of memory, and the maximum QPS decreases by only about 5%. Note You can set this parameter in Hologres V4.0.10 and later.
extra_columns	Attaches column information to the vector index. This feature is supported in V4.1.1 and later. This parameter is optional. Only columns of the INT, BIGINT, and SMALLINT types are supported. During retrieval, you can directly obtain column values from the index without querying the corresponding columns in the target table. This improves vector retrieval performance. Example: `"extra_columns": "id"`

Modify an index

Syntax:

ALTER TABLE <TABLE_NAME>
SET (
    vectors = '{
    "<VECTOR_COLUMN_NAME>": {
        "algorithm": "<ALGORITHM>",
        "distance_method": "<DISTANCE_METHOD>",
        "builder_params": {
            "<BUILDER_PARAMETERS_NAME>": <VALUE>
            [, ...]
        }
    }
  }'
);

Delete an index

-- Delete the vector indexes of all columns in a table.
ALTER TABLE <TABLE_NAME>
SET (
    vectors = '{}'
);

-- If a table has vector indexes built on both the col1 and col2 columns, and you need to delete the index on the col2 column, run the ALTER TABLE statement to retain only the index on the col1 column.
ALTER TABLE <TABLE_NAME>
SET (
    vectors = '{
    "col1": { ... }
  }'
);

View an index

Hologres provides the `hologres.hg_table_properties` system table. You can use this table to view created vector indexes.

SELECT
    *
FROM
    hologres.hg_table_properties
WHERE 
    table_name = '<TABLE_NAME>'
    AND property_key = 'vectors';

Use a vector index for vector retrieval

Vector distance calculation functions

Vector retrieval in Hologres supports approximate retrieval and exact retrieval. Only approximate retrieval functions can use a created vector index to accelerate queries. The function must correspond to the `distance_method` of the vector index. Exact retrieval functions cannot use vector indexes.

Note: Vector distance calculation functions do not support all-constant input parameters.

Function	Retrieval type	Input parameters	Return value	Description
approx_euclidean_distance	Approximate retrieval	float4[], float4[]	float4	An approximate retrieval function for Euclidean distance.
approx_inner_product_distance	Approximate retrieval	float4[], float4[]	float4	An approximate retrieval function for inner product distance.
approx_cosine_distance	Approximate retrieval	float4[], float4[]	float4	An approximate retrieval function for cosine distance.
euclidean_distance	Exact retrieval	float4[], float4[]	float4	An exact retrieval function for Euclidean distance.
inner_product_distance	Exact retrieval	float4[], float4[]	float4	An exact retrieval function for inner product distance.
cosine_distance	Exact retrieval	float4[], float4[]	float4	An exact retrieval function for cosine distance.

Verify the use of a vector index

You can view the execution plan to check whether the SQL statement uses a vector index. If the plan contains "Vector Filter", the index is used successfully. For more information, see EXPLAIN and EXPLAIN ANALYZE.

Example SQL:

SELECT
    id,
    approx_euclidean_distance (feature, '{0.1,0.2,0.3,0.4}') AS distance
FROM
    feature_tb
ORDER BY
    distance
LIMIT 40;

Execution plan:

Limit  (cost=0.00..182.75 rows=40 width=12)
  ->  Sort  (cost=0.00..182.75 rows=160 width=12)
        Sort Key: (VectorDistanceRef)
        ->  Gather  (cost=0.00..181.95 rows=160 width=12)
              ->  Limit  (cost=0.00..181.94 rows=160 width=12)
                    ->  Sort  (cost=0.00..181.94 rows=40000 width=12)
                          Sort Key: (VectorDistanceRef)
                          ->  Local Gather  (cost=0.00..91.53 rows=40000 width=12)
                                ->  Limit  (cost=0.00..91.53 rows=40000 width=12)
                                      ->  Sort  (cost=0.00..91.53 rows=40000 width=12)
                                            Sort Key: (VectorDistanceRef)
                                            ->  Project  (cost=0.00..1.12 rows=40000 width=12)
                                                  ->  Index Scan using Clustering_index on feature_tb  (cost=0.00..1.00 rows=40000 width=8)
                                                        Vector Filter: VectorCond => KNN: '40'::bigint distance_method: approx_euclidean_distance search_params: {NULL} args: {feature'{0.100000001,0.200000003,0.300000012,0.400000006}'::real[]}
Query Queue: init_warehouse.default_queue
Optimizer: HQO version 4.0.0

Examples

Create a table.

-- Create a table group with a shard count of 4.
CALL HG_CREATE_TABLE_GROUP ('test_tg_shard_4', 4);

-- Create a table.
CREATE TABLE feature_tb (
    id bigint,
  	feature float4[] CHECK(array_ndims(feature) = 1 AND array_length(feature, 1) = 4)
)
WITH (
    table_group = 'test_tg_shard_4',
    vectors = '{
    "feature": {
        "algorithm": "HGraph",
        "distance_method": "Cosine",
        "builder_params": {
            "base_quantization_type": "rabitq",
            "graph_storage_type": "compressed",
            "max_degree": 64,
            "ef_construction": 400,
            "precise_quantization_type": "fp32",
            "use_reorder": true,
            "extra_columns": "id",
            "max_total_size_to_merge_mb" : 4096
        }
    }
    }'
);

Import data.

-- (Optional) Use Serverless Computing to run large-scale offline data import and ETL jobs. Compaction and index building are synchronously completed during the import.
SET hg_computing_resource = 'serverless';
SET hg_serverless_computing_run_compaction_before_commit_bulk_load = on;

INSERT INTO feature_tb SELECT i, array[random(), random(), random(), random()]::float4[] FROM generate_series(1, 100000) i;

-- Reset the configuration to ensure that unnecessary SQL statements do not use Serverless resources.
RESET hg_computing_resource;

Perform approximate vector retrieval.
```
-- Calculate the top 40 results for cosine distance.
SELECT
    id,
    approx_cosine_distance (feature, '{0.1,0.2,0.3,0.4}') AS distance
FROM
    feature_tb
ORDER BY
    distance DESC
LIMIT 40;
```
Note
When the "extra_columns": "id" parameter is set for the target table, this approximate vector search example can retrieve values from the id column directly from the vector index without querying the id column of the target table. You can check the vector_index_extra_columns_used parameter in the EXPLAIN ANALYZE result to view the number of vector manifest files that retrieve values using `extra_columns`.

Perform exact vector retrieval.

-- Exact retrieval does not use a vector index. Therefore, the distance calculation function does not need to be the same as the distance_method of the vector index.
SELECT
    id,
    cosine_distance (feature, '{0.1,0.2,0.3,0.4}') AS distance
FROM
    feature_tb
ORDER BY
    distance DESC
LIMIT 40;

Performance tuning

Use vector indexes efficiently

When the data volume is small (for example, tens of thousands of rows) or the instance has sufficient computing resources, do not set a vector index. Use brute-force calculation instead. Use a vector index only when brute-force calculation cannot meet your requirements for latency, throughput, or other metrics. This is because:

Vector indexes are lossy, so the accuracy (recall rate) cannot reach 100%.
A vector index may return fewer items than requested. For example, a query with `LIMIT 1000` might return only 500 items.

When you choose to use a vector index, configure it as follows (using a non-partitioned table with a single 768-dimension vector field as an example):

Latency-sensitive scenarios: Use an in-memory-only index. For the index quantization method, we recommend that you use `sq8_uniform` or `rabitq`. The recommended data volume is no more than 5 million rows per shard.
Latency-insensitive or large data volume scenarios: Use a hybrid memory-disk index. For the index quantization method, we recommend that you use `rabitq`. The recommended data volume is no more than 30 million to 50 million rows per shard.
Note: To set vector indexes on multiple columns, the recommended data volume per shard must be reduced proportionally. The vector dimension also affects this recommendation.

Examples:

-- Complete example of a hybrid index
CREATE TABLE feature_tb (
    id bigint,
  	feature float4[] CHECK(array_ndims(feature) = 1 AND array_length(feature, 1) = 4)
)
WITH (
    table_group = 'test_tg_shard_4',
    vectors = '{
    "feature": {
        "algorithm": "HGraph",
        "distance_method": "Cosine",
        "builder_params": {
            "base_quantization_type": "rabitq",
            "graph_storage_type": "compressed",
            "max_degree": 64,
            "ef_construction": 400,
            "precise_quantization_type": "fp32",
            "precise_io_type": "reader_io",
            "use_reorder": true,
            "max_total_size_to_merge_mb" : 4096
        }
    }
    }'
);


-- Complete example of an in-memory-only index
CREATE TABLE feature_tb (
    id bigint,
  	feature float4[] CHECK(array_ndims(feature) = 1 AND array_length(feature, 1) = 4)
)
WITH (
    table_group = 'test_tg_shard_4',
    vectors = '{
    "feature": {
        "algorithm": "HGraph",
        "distance_method": "Cosine",
        "builder_params": {
            "base_quantization_type": "sq8_uniform",
            "graph_storage_type": "compressed",
            "max_degree": 64,
            "ef_construction": 400,
            "precise_quantization_type": "fp32",
            "use_reorder": true,
            "max_total_size_to_merge_mb" : 4096
        }
    }
    }'
);

Improve the recall rate

This section uses the VectorDBBench dataset as an example to describe how to improve the recall rate.

Multiple factors can affect the recall rate. With the following index parameter settings, the default recall rate of the system can typically reach over 95%:

Index parameters:
- `base_quantization_type` is `rabitq` or `sq8_uniform`
- `precise_quantization_type` is `fp32`
- `max_degree` is `64`
- `ef_construction` is `400`
Query parameters (GUC):
- `hg_vector_ef_search`: The recommended value is the default value of 80. This parameter controls the size of the candidate list during retrieval to balance accuracy and speed. A larger value provides higher accuracy but also increases resource overhead.

To further improve the recall rate to over 99%, you can run SET hg_vector_ef_search = 400; while keeping other parameters unchanged. However, improving the recall rate increases query latency and computing resource usage.

To further improve the recall rate to 99.5% to 99.7%, you can adjust the `max_degree`, `ef_construction`, and `hg_vector_ef_search` parameters. This increases query latency, query resource consumption, index building time, and index building resource consumption. For example:

`max_degree` = 96.
`ef_construction` = 500 or 600.
`hg_vector_ef_search` = 500 or 600.

Set an appropriate shard count

A higher shard count results in more index files being built, which reduces the throughput of approximate vector queries. Therefore, set an appropriate shard count for your application. You can typically follow these steps:

Select appropriate instance specifications based on the volume of vector data. The following examples are for 768-dimension vectors. For information about other dimensions, see Recommended instance types for vector computing.
1. In-memory-only index: 5 million vectors per worker.
2. Hybrid memory-disk index: 100 million vectors per worker.

Determine the shard count based on the instance specifications. Typically, you can set the shard count to be equal to the number of workers. For example, for a 64 CU instance, you can set `shard_count` to 4.

The following is an SQL example:

-- Create a vector table and place it in a table group with a shard count of 4.
CALL HG_CREATE_TABLE_GROUP ('test_tg_shard_4', 4);

CREATE TABLE feature_tb (
    id bigint,
  	feature float4[] CHECK(array_ndims(feature) = 1 AND array_length(feature, 1) = 4)
)
WITH (
    table_group = 'test_tg_shard_4',
    vectors = '{
    "feature": {
        "algorithm": "HGraph",
        "distance_method": "Cosine",
        "builder_params": {
            "base_quantization_type": "sq8_uniform",
            "graph_storage_type": "compressed",
            "max_degree": 64,
            "ef_construction": 400,
            "precise_quantization_type": "fp32",
            "use_reorder": true,
            "max_total_size_to_merge_mb" : 4096
        }
    }
    }'
);

Vector and scalar hybrid query scenarios

For vector retrieval with filter conditions, the following are common filtering scenarios:

Query scenario 1: A string column is used as a filter condition

The following is an example query. A common scenario is to find corresponding vector data within an organization, such as finding face data in a class.

SELECT(feature, '{1,2,3,4}') AS d FROM feature_tb WHERE uuid = 'x' ORDER BY d LIMIT 10;

We recommend the following optimizations:

Set `uuid` as the distribution key. This ensures that data with the same filter condition is stored in the same shard. A query then only accesses a single shard.
Set `uuid` as the clustering key for the table. Data is sorted within files based on the clustering key.

Query scenario 2: A time field is used as a filter condition

The following is an example query. This is typically used to filter vector data based on a time field. We recommend that you set the time field `time_field` as the segment key for the table. This lets you quickly locate the files where the data resides.

SELECT xx_distance(feature, '{1,2,3,4}') AS d FROM feature_tb WHERE time_field BETWEEN '2020-08-30 00:00:00' AND '2020-08-30 12:00:00' ORDER BY d LIMIT 10;

Therefore, for vector retrieval with any filter conditions, the `CREATE TABLE` statement is typically as follows:

-- Note: If you do not filter by time, you can remove the index related to time_field.
CREATE TABLE feature_tb (
    time_field timestamptz NOT NULL,
    uuid text NOT NULL,
    feature float4[] CHECK(array_ndims(feature) = 1 AND array_length(feature, 1) = 4)
)
WITH (
    distribution_key = 'uuid',
    segment_key = 'time_field',
    clustering_key = 'uuid',
    vectors = '{
    "feature": {
        "algorithm": "HGraph",
        "distance_method": "Cosine",
        "builder_params": {
            "base_quantization_type": "sq8_uniform",
            "graph_storage_type": "compressed",
            "max_degree": 64,
            "ef_construction": 400,
            "precise_quantization_type": "fp32",
            "use_reorder": true,
            "max_total_size_to_merge_mb" : 4096
        }
    }
    }'
);

Reindex using Serverless resources

If you modify table properties, compaction may be triggered and the index may be rebuilt, which consumes a large amount of CPU resources. To modify the following table properties, perform the steps below:

Modifying `bitmap_columns`, `dictionary_encoding_columns`, or vector indexes triggers compaction and reindexing. Therefore, do not use the `ALTER TABLE xxx SET` syntax. Instead, run the following command to use the `REBUILD` syntax with Serverless Computing resources. For more information, see REBUILD.

ASYNC REBUILD TABLE <table_name> 
WITH (
    rebuild_guc_hg_computing_resource = 'serverless'
)
SET (
    bitmap_columns = '<col1>,<col2>',
    dictionary_encoding_columns = '<col1>:on,<col2>:off',
    vectors = '{
    "<col_vector>": {
        "algorithm": "HGraph",
        "distance_method": "Cosine",
        "builder_params": {
            "base_quantization_type": "rabitq",
            "graph_storage_type": "compressed",
            "max_degree": 64,
            "ef_construction": 400,
            "precise_quantization_type": "fp32",
            "use_reorder": true,
            "max_total_size_to_merge_mb" : 4096
        }
    }
    }'
);

Modifying column-oriented JSONB columns or full-text index columns also triggers compaction and reindexing. The `REBUILD` syntax is not supported for these modifications. Instead, create a temporary table by following these steps:

BEGIN ;
-- Clean up potential temporary tables.
DROP TABLE IF EXISTS <table_new>;
-- Create a temporary table.
SET hg_experimental_enable_create_table_like_properties=on;
CALL HG_CREATE_TABLE_LIKE ('<table_new>', 'select * from <table>');
COMMIT ;

-- Enable column-oriented storage for JSON-formatted data for the corresponding column.
ALTER TABLE <table_new> ALTER COLUMN <column_name> SET (enable_columnar_type = ON);
-- Create a full-text index on the corresponding column.
CREATE INDEX <idx_name> ON <table_new> USING FULLTEXT (column_name);

-- Insert data into the temporary table, use Serverless resources to run the job, and synchronously complete index building.
SET hg_computing_resource = 'serverless';
INSERT INTO <table_new> SELECT * FROM <table>;
ANALYZE <table_new>;

BEGIN ;
-- Delete the old table.
DROP TABLE IF EXISTS  <table>;
-- Rename the temporary table.
ALTER TABLE <table_new> RENAME TO <table>;
COMMIT ;

To modify other properties, such as `distribution_key`, `clustering_key`, `segment_key`, and storage format, use the `REBUILD` syntax with Serverless Computing resources.

FAQ

Q: I receive the error "Writting column: feature with array size: 5 violates fixed size list (4) constraint declared in schema".

A: This error occurs because the dimension of the data written to the feature vector field is inconsistent with the dimension defined in the table. Check for dirty data.
Q: I receive the error "The size of two array must be the same in DistanceFunction, size of left array: 4, size of right array: x".

A: This error occurs because the dimension of `left` is different from the dimension of `right` in `xx_distance(left, right)`.

Q: How do I write vector data using Java?

A: The following is a Java code example:

private static void insertIntoVector(Connection conn) throws Exception {
    try (PreparedStatement stmt = conn.prepareStatement("INSERT INTO feature_tb VALUES(?,?);")) {
        for (int i = 0; i < 100; ++i) {
           stmt.setInt(1, i);
           Float[] featureVector = {0.1f,0.2f,0.3f,0.4f};
           Array array = conn.createArrayOf("FLOAT4", featureVector);
           stmt.setArray(2, array);
           stmt.execute();
        }
    }
}

Q: How do I change a Proxima Graph index to an HGraph index?

A: To change a Proxima Graph index to an HGraph index, perform the following two steps in order:
- Step 1: Delete the existing Proxima Graph index from the table. The SQL command is as follows:
```
CALL set_table_property ('<TABLE_NAME>', 'proxima_vectors', '{}');
```
Replace <TABLE_NAME> with the actual table name.
- Step 2: After the original Proxima Graph index is deleted, create a new HGraph index. For more information, see Create an index.