Index types and related operations in AnalyticDB for PostgreSQL - AnalyticDB

Indexes in a distributed database

In most traditional databases, indexes speed up data retrieval significantly. In a distributed analytical database like AnalyticDB for PostgreSQL, the tradeoff is different. Each segment node already scans a small slice of the total dataset using fast sequential scans. Indexes add a random seek pattern that can sometimes cost more than the sequential scan it replaces.

Use indexes selectively:

Indexes are most effective for queries that return small result sets, such as online transaction processing (OLTP)-style point lookups.
Indexes help on compressed append-optimized (AO) tables because only the relevant rows are decompressed during a scan.
AnalyticDB for PostgreSQL does not support hash indexes.

Start by running your queries without any indexes. Add indexes only when query performance on the target columns improves measurably.

Index types

B-tree index

B-tree indexes suit columns with high selectivity — a large proportion of distinct values relative to total rows. Selectivity is calculated as the number of distinct values divided by the total number of rows. For example, a column with 800 distinct values in a 1,000-row table has a selectivity of 0.8, which is considered good for a B-tree index.

B-tree indexes work well for equality and range queries (=, <, <=, >, >=) and for columns used in WHERE clauses and JOIN conditions.

Bitmap index

Bitmap indexes store a compact bitmap for each distinct key value. They occupy less storage space than conventional indexes and perform best when:

The column has between 100 and 100,000 distinct values (low to medium cardinality)
The column is frequently queried together with other indexed columns
The workload is read-heavy rather than write-heavy (data warehouse queries rather than high-concurrency OLTP)

When not to use bitmap indexes: Avoid bitmap indexes on columns with high cardinality (unique values such as customer IDs or phone numbers). The performance and storage advantages diminish above 100,000 distinct values.

BRIN index

Block Range INdex (BRIN) records the minimum and maximum values for each block range in the table. BRIN indexes work best when:

Data is large and sequentially distributed (for example, an auto-incrementing timestamp or primary key)
Queries filter with range conditions: <, <=, =, >=, >
Applicable only to heap tables

For large datasets, BRIN indexes provide similar query performance to B-tree indexes but occupy far less space.

VACUUM for indexed data collection applies only to BRIN indexes.

GIN index

Generalized Inverted Index (GIN) is suited for columns that contain multiple values per row, such as arrays or full text search vectors (tsvector). Use GIN when queries need to search inside composite values.

GiST index

Generalized Search Tree (GiST) supports geometric and full text search workloads. Use GiST for columns that require nearest-neighbor search or complex matching predicates.

Choose the right index type

Column characteristic	Recommended index
High cardinality (many distinct values), range or equality queries	B-tree
Low to medium cardinality (100–100,000 distinct values), combined predicates	Bitmap
Large table, sequentially distributed data, range filters	BRIN
Array columns, full text search (tsvector)	GIN
Geometric data, full text search (alternative)	GiST

Choose the right columns

Join columns: Index columns used as join keys or foreign keys. This gives the query optimizer more join strategies and improves join performance.
Predicate columns: Index columns frequently referenced in WHERE clauses.
Avoid frequently updated columns: Every update to an indexed column increases read and write overhead.

Avoid redundant indexes. In a composite index, any index whose leading column matches the leading column of another index is redundant. For example, if you already have an index on (store_id, time), a separate index on (store_id) alone is redundant.

When designing composite indexes, order columns from equality predicates to range predicates. For example, place store_id = 'x' columns before time > '09:30' columns so the index can efficiently narrow the result.

Create an index

Use CREATE INDEX to add an index to a table.

-- B-tree index on the gender column of the employee table
CREATE INDEX gender_idx ON employee (gender);

-- Bitmap index on the title column of the films table
CREATE INDEX title_bmp_idx ON films USING bitmap (title);

-- BRIN index on the c_custkey column of the customer table
CREATE INDEX c_custkey_brin_idx ON customer USING brin(c_custkey) WITH (pages_per_range=2);

-- GIN index for full text search on the l_comment column of the lineitem table
CREATE INDEX lineitem_idx ON lineitem USING gin(to_tsvector('english', l_comment));

-- GIN index on the intarray column of the arrayt table
CREATE INDEX arrayt_idx ON arrayt USING gin(intarray);

-- GiST index for full text search on the c_comment column of the customer table
CREATE INDEX customer_idx ON customer USING gist(to_tsvector('english', c_comment));

After creating an index, run ANALYZE on the table so the query optimizer has up-to-date statistics:

ANALYZE <table_name>;

Rebuild an index

Use REINDEX to rebuild a corrupted or bloated index.

-- Rebuild a specific index
REINDEX INDEX my_index;

-- Rebuild all indexes on a table
REINDEX TABLE my_table;

Drop an index

Use DROP INDEX to remove an index.

DROP INDEX title_idx;

Drop all indexes on a table before bulk-loading data. Recreate the indexes after the load completes. Maintaining indexes during a large data load is slower than rebuilding them afterward.

Collect indexed data (BRIN only)

Run VACUUM to update BRIN index metadata after inserts or deletes:

VACUUM customer;

Indexed data collection via VACUUM applies only to BRIN indexes.

Best practices

Test before committing. Compare query performance with and without the index on representative data.
Drop indexes before bulk loads. Drop all indexes, load the data, then recreate the indexes.
Run ANALYZE after creating indexes. Keep optimizer statistics current.
Avoid redundant indexes. Multiple indexes with the same leading column waste storage and slow writes.

References

For more information, see the Pivotal Greenplum documentation.