Build a Proxima Graph index and perform vector queries - Hologres

Hologres integrates Proxima to support approximate nearest neighbor (ANN) search directly in your database. This enables similarity searches, image retrieval, and scene recognition over large vector datasets without exporting data to an external vector store.

This guide covers how to set up a vector table with a Proxima index, load data, run queries, and tune performance.

Prerequisites

Before you begin, ensure that you have:

A Hologres instance with superuser access (required to install the Proxima extension)
A connected development tool. For connection instructions, see Connect to a development tool
(Required for Java Database Connectivity) A JDBC connection configured in Prepare Statement mode

Set up Proxima

Step 1: Install the Proxima extension

A superuser must install the Proxima extension before any other steps. The extension is installed per database — run this once per database:

CREATE EXTENSION proxima;

To uninstall the extension:

DROP EXTENSION proxima;

Important

Never run DROP EXTENSION <extension_name> CASCADE. The CASCADE option removes the extension along with all dependent objects, including PostGIS data, RoaringBitmap data, Proxima data, binary logging (Binlog) data, BSI data, metadata, tables, views, and server data.

Step 2: Create a vector table and index

Vectors in Hologres are stored as FLOAT4 arrays. Vector indexes are supported only on column-oriented tables and hybrid row-column tables — not on row-oriented tables.

Choose the approach that matches your data loading pattern:

Option A: Create the index before importing data (real-time ingestion)

-- Single vector column
BEGIN;
CREATE TABLE feature_tb (
    id BIGINT,
    feature_col FLOAT4[] CHECK(array_ndims(feature_col) = 1 AND array_length(feature_col, 1) = <dimension>)
);
CALL set_table_property(
    'feature_tb',
    'proxima_vectors',
    '{"<feature_col>":{"algorithm":"Graph",
      "distance_method":"<SquaredEuclidean|Euclidean|InnerProduct>",
      "builder_params":{"min_flush_proxima_row_count": 1000,
                        "min_compaction_proxima_row_count": 1000,
                        "max_total_size_to_merge_mb": 2000}}}');
COMMIT;

Option B: Import data first, then build the index (batch analytics)

Starting from Hologres V2.0.11, importing data before creating the index avoids building the index during compaction, which reduces total index creation time.

-- Create the table (no index yet)
BEGIN;
CREATE TABLE feature_tb (
    id BIGINT,
    feature_col FLOAT4[] CHECK(array_ndims(feature_col) = 1 AND array_length(feature_col, 1) = <dimension>)
);
COMMIT;

-- (Optional) Use Serverless Computing for large-volume batch imports
SET hg_computing_resource = 'serverless';

-- Import data
INSERT INTO feature_tb ...;
VACUUM feature_tb;

-- Build the index after data is loaded
CALL set_table_property(
    'feature_tb',
    'proxima_vectors',
    '{"<feature_col>":{"algorithm":"Graph",
      "distance_method":"<SquaredEuclidean|Euclidean|InnerProduct>",
      "builder_params":{"min_flush_proxima_row_count": 1000,
                        "min_compaction_proxima_row_count": 1000,
                        "max_total_size_to_merge_mb": 2000}}}');

-- Reset to avoid routing unrelated queries to serverless resources
RESET hg_computing_resource;

Starting from Hologres V2.1.17, Serverless Computing is available for batch importing or querying large volumes of vector data. It uses additional serverless resources instead of your instance's resources, so you don't need to reserve extra compute capacity. You are charged only for the tasks you run. For details, see Serverless Computing overview and Use Serverless Computing.

Vector column parameters

Category	Parameter	Description	Example
Vector column	`feature_col`	Name of the vector column	`feature`
	`array_ndims`	Number of array dimensions. Only `1` (one-dimensional) is supported	`array_ndims(feature) = 1`
	`array_length`	Length of the vector (number of elements). Maximum: 1,000,000	`array_length(feature, 1) = 4`
Index	`algorithm`	Index-building algorithm. Only `Graph` is supported	`Graph`
	`distance_method`	Distance metric for the index. Must match the query UDF	`SquaredEuclidean` (recommended)
	`min_flush_proxima_row_count`	Minimum rows required to build an index when data is written to disk	`1000`
	`min_compaction_proxima_row_count`	Minimum rows required to build an index when data is merged on disk	`1000`
	`max_total_size_to_merge_mb`	Maximum file size for data merging on disk. Unit: MB	`2000`
	`proxima_builder_thread_count`	Threads used to build the index during data writes. Default: 4	`4`

Choose a distance method

The distance_method value in the index definition must match the UDF used in queries. Mismatches prevent index usage.

Distance method	`distance_method` value	Query UDF	Sort order	Recommendation
Squared Euclidean	`SquaredEuclidean`	`pm_approx_squared_euclidean_distance`	`ASC`	Recommended. Highest query efficiency. Produces the same top K results as Euclidean but skips the square root calculation.
Euclidean	`Euclidean`	`pm_approx_euclidean_distance`	`ASC`	Acceptable, but slower than squared Euclidean.
Inner product	`InnerProduct`	`pm_approx_inner_product_distance`	`DESC`	Avoid unless required. Internally converted to a Euclidean calculation, adding overhead to both index building and queries.

To verify the table and index were created, run a \d feature_tb describe command or query pg_indexes. If the table exists and no error was raised during COMMIT, setup is complete.

Step 3: Import vector data

Import data using batch or real-time mode. After a batch import, run VACUUM and then ANALYZE to optimize query performance:

-- Compact backend files for more efficient queries
VACUUM <table_name>;

-- Collect statistics for the Query Optimizer (QO)
ANALYZE <table_name>;

VACUUM compacts backend files into larger files. It consumes CPU resources and runtime scales with data volume. Wait for a running VACUUM to finish before starting another.
ANALYZE collects performance statistics that the Query Optimizer (QO) uses to generate better execution plans.

Step 4: Query vector data

Hologres supports two query modes:

Mode	UDF prefix	Uses index	Best for
Approximate	`pm_approx_*`	Yes	Large datasets. Default recall rate is over 99%.
Exact	`pm_*`	No	Small datasets where 100% recall is required.

The vector index works only in single-table queries. Avoid JOIN operations when querying vectors.

Approximate queries (index-enabled)

-- Top K by squared Euclidean distance (requires distance_method: SquaredEuclidean)
SELECT pm_approx_squared_euclidean_distance(feature, '{0.1,0.2,0.3,0.4}') AS distance
FROM feature_tb
ORDER BY distance ASC
LIMIT 10;

-- Top K by Euclidean distance (requires distance_method: Euclidean)
SELECT pm_approx_euclidean_distance(feature, '{0.1,0.2,0.3,0.4}') AS distance
FROM feature_tb
ORDER BY distance ASC
LIMIT 10;

-- Top K by inner product distance (requires distance_method: InnerProduct)
SELECT pm_approx_inner_product_distance(feature, '{0.1,0.2,0.3,0.4}') AS distance
FROM feature_tb
ORDER BY distance DESC
LIMIT 10;

The second parameter must be a constant value in approximate queries. Squared Euclidean and Euclidean require ORDER BY distance ASC; inner product requires ORDER BY distance DESC. Approximate queries are lossy — precision loss is possible, but the default recall rate is typically over 99%.

Exact queries (no index)

-- Top K by squared Euclidean distance (full scan)
SELECT pm_squared_euclidean_distance(feature, '{0.1,0.2,0.3,0.4}') AS distance
FROM feature_tb
ORDER BY distance ASC
LIMIT 10;

-- Top K by Euclidean distance (full scan)
SELECT pm_euclidean_distance(feature, '{0.1,0.2,0.3,0.4}') AS distance
FROM feature_tb
ORDER BY distance ASC
LIMIT 10;

-- Top K by inner product distance (full scan)
SELECT pm_inner_product_distance(feature, '{0.1,0.2,0.3,0.4}') AS distance
FROM feature_tb
ORDER BY distance DESC
LIMIT 10;

Exact queries scan all rows in the vector column and sort results. Use them when data volume is small and recall rate is critical.

Verify index usage

To confirm the index is being used, run EXPLAIN on your query and check the plan output for Proxima filter: xxxx. If that string is absent, the index is not in use — typically because the table's distance_method doesn't match the query UDF.

Complete example

This example retrieves the 40 nearest vectors from a table of 100,000 4-dimensional vectors using squared Euclidean distance.

1. Create the table

CREATE EXTENSION proxima;

BEGIN;
-- Place the table in a table group with shard_count = 4
CALL HG_CREATE_TABLE_GROUP('test_tg_shard_4', 4);

CREATE TABLE feature_tb (
    id BIGINT,
    feature FLOAT4[] CHECK(array_ndims(feature) = 1 AND array_length(feature, 1) = 4)
);

CALL set_table_property('feature_tb', 'table_group', 'test_tg_shard_4');
CALL set_table_property('feature_tb', 'proxima_vectors',
    '{"feature":{"algorithm":"Graph","distance_method":"SquaredEuclidean",
      "builder_params":{"min_flush_proxima_row_count": 1000,
                        "min_compaction_proxima_row_count": 1000,
                        "max_total_size_to_merge_mb": 2000}}}');
COMMIT;

2. Import data

-- (Optional) Use Serverless Computing for large-volume batch imports
SET hg_computing_resource = 'serverless';

INSERT INTO feature_tb
SELECT
    i,
    ARRAY[random(), random(), random(), random()]::FLOAT4[]
FROM generate_series(1, 100000) i;

VACUUM feature_tb;
ANALYZE feature_tb;

-- Reset to avoid routing unrelated queries to serverless resources
RESET hg_computing_resource;

3. Run a query

-- (Optional) Use Serverless Computing for large-volume vector queries
SET hg_computing_resource = 'serverless';

SELECT
    pm_approx_squared_euclidean_distance(feature, '{0.1,0.2,0.3,0.4}') AS distance
FROM feature_tb
ORDER BY distance
LIMIT 40;

-- Reset to avoid routing unrelated queries to serverless resources
RESET hg_computing_resource;

To confirm the query used the index, run EXPLAIN on the query and look for Proxima filter: in the output.

Performance tuning

When to use a vector index

For small datasets (tens of thousands of rows) or when your instance has enough resources, direct distance calculation without an index may be sufficient. Consider using a Proxima index when direct calculation can't meet your latency or throughput requirements.

Keep the following trade-offs in mind before enabling an index:

The Proxima index is lossy — results may have a small accuracy bias.
The index may return fewer results than the LIMIT value. For example, a LIMIT 1000 query might return only 500 rows.
A Proxima index can be difficult to use.

Set an appropriate shard count

Each shard builds its own set of Proxima index files. More shards mean more index files, which reduces query throughput. Set the shard count to the number of workers as a starting point. For a 64-core instance, a shard count of 4 is typical.

Fewer shards lower single-query latency but also reduce write throughput.

BEGIN;
CALL HG_CREATE_TABLE_GROUP('test_tg_shard_4', 4);
CREATE TABLE proxima_test (
    id BIGINT NOT NULL,
    vectors FLOAT4[] CHECK(array_ndims(vectors) = 1 AND array_length(vectors, 1) = 128),
    PRIMARY KEY (id)
);
CALL set_table_property('proxima_test', 'proxima_vectors',
    '{"vectors":{"algorithm":"Graph","distance_method":"SquaredEuclidean",
      "builder_params":{}, "searcher_init_params":{}}}');
CALL set_table_property('proxima_test', 'table_group', 'test_tg_shard_4');
COMMIT;

Optimize for filter conditions

WHERE filter conditions can prevent the index from being used. The general strategy: minimize the number of shards a filtered query must touch.

No filter conditions (recommended): The index works most efficiently. Aim for one vector index file per shard so each query runs on a single shard.
Filter conditions present: Design the table so that filtered queries are routed to as few shards as possible.

Queries without filter conditions

BEGIN;
CREATE TABLE feature_tb (
    uuid TEXT,
    feature FLOAT4[] NOT NULL CHECK(array_ndims(feature) = 1 AND array_length(feature, 1) = N)
);
CALL set_table_property('feature_tb', 'shard_count', '<shard_count>');
CALL set_table_property('feature_tb', 'proxima_vectors',
    '{"feature":{"algorithm":"Graph","distance_method":"InnerProduct"}}');
END;

Queries with filter conditions

When filter conditions are unavoidable, match the table design to the filter type:

Filtering by a string column (for example, finding vectors within a specific organization):

SELECT pm_xx_distance(feature, '{1,2,3,4}') AS d
FROM feature_tb
WHERE uuid = 'x'
ORDER BY d
LIMIT 10;

Set uuid as both the distribution key and the clustering key. Rows with the same uuid are colocated on one shard, and data is sorted within each file — so the query runs on a single shard and scans less data.

Filtering by a time field (for example, finding vectors within a time window):

SELECT pm_xx_distance(feature, '{1,2,3,4}') AS d
FROM feature_tb
WHERE time_field BETWEEN '2020-08-30 00:00:00' AND '2020-08-30 12:00:00'
ORDER BY d
LIMIT 10;

Set time_field as the segment key. This lets the query quickly locate the files that contain data in the specified time range.

Table design for both filter types:

BEGIN;
CREATE TABLE feature_tb (
    time_field TIMESTAMPTZ NOT NULL,
    uuid TEXT,
    feature FLOAT4[] NOT NULL CHECK(array_ndims(feature) = 1 AND array_length(feature, 1) = N)
);
CALL set_table_property('feature_tb', 'distribution_key', 'uuid');
CALL set_table_property('feature_tb', 'segment_key', 'time_field');
CALL set_table_property('feature_tb', 'clustering_key', 'uuid');
CALL set_table_property('feature_tb', 'proxima_vectors',
    '{"feature":{"algorithm":"Graph","distance_method":"InnerProduct"}}');
COMMIT;

-- If you don't need time-based filtering, remove the segment_key setting.

Distance function reference

Hologres supports three distance functions. The distance_method value set at index creation determines which approximate UDF (pm_approx_*) you must use in queries.

Squared Euclidean distance (SquaredEuclidean)

Euclidean distance (Euclidean)

Inner product distance (InnerProduct)

Squared Euclidean distance requires one fewer square root calculation than Euclidean distance but produces the same top K results. Use squared Euclidean distance when it meets your requirements — it offers better performance.

FAQ

`ERROR: function pm_approx_inner_product_distance(real[], unknown) does not exist`

The Proxima extension is not initialized in this database. The UDFs are registered during extension installation, so they don't exist until you run:

CREATE EXTENSION proxima;

`Writing column: feature with array size: 5 violates fixed size list (4) constraint declared in schema`

The dimension of the data being written doesn't match the dimension defined in the table schema. This typically occurs when input data contains rows with incorrect vector lengths. Check your input data for rows with unexpected dimensions (dirty data).

`The size of two arrays must be the same in DistanceFunction, size of left array: 4, size of right array:`

The two arrays passed to pm_xx_distance(left, right) have different dimensions. This typically occurs when the query vector dimension doesn't match the table's vector column dimension. Make sure both arrays have the same length before querying.

`BackPressure Exceed Reject Limit ctxId: XXXXXXXX, tableId: YY, shardId: ZZ`

The real-time write job is building too many small index files. This typically occurs when min_flush_proxima_row_count is set too low and the write rate is high — index-building overhead accumulates and eventually blocks the write pipeline. Increase min_flush_proxima_row_count to reduce the frequency of index builds during real-time ingestion.

How do I write vector data using Java?

Use PreparedStatement with conn.createArrayOf("FLOAT4", ...) to pass FLOAT4 arrays:

private static void insertIntoVector(Connection conn) throws Exception {
    try (PreparedStatement stmt = conn.prepareStatement("INSERT INTO feature_tb VALUES(?,?);")) {
        for (int i = 0; i < 100; ++i) {
            stmt.setInt(1, i);
            Float[] featureVector = {0.1f, 0.2f, 0.3f, 0.4f};
            Array array = conn.createArrayOf("FLOAT4", featureVector);
            stmt.setArray(2, array);
            stmt.execute();
        }
    }
}

How do I check whether a query is using the Proxima index?

Run EXPLAIN on your query. If the plan output contains Proxima filter: xxxx, the index is being used. If that string is absent, the index is not in use — usually because the distance_method set at table creation doesn't match the UDF used in the query.