AnalyticDB:Dictionary encoding - AnalyticDB for PostgreSQL

Dictionary encoding compresses low-cardinality string columns into compact integer keys, reducing storage footprint and accelerating filter, aggregate, and sort queries on those columns.

Supported versions

Dictionary encoding requires an AnalyticDB for PostgreSQL V7.0.x instance in elastic storage mode.

Dictionary encoding is generally available starting from V7.0.6.2, which resolves issues found during the public preview. Update your instance to V7.0.6.2 or later before using this feature in production.

How it works

Dictionary encoding assigns each unique string value in a column a short integer key. The query engine then operates on these compact keys rather than the original strings. A Decode operator in the execution plan translates keys back to strings when results are returned.

This approach works best on low-cardinality columns — columns that hold a bounded, repeating set of values. Good candidates include country names, status codes, and category labels (for example, ALGERIA, ARGENTINA, BRAZIL). Poor candidates are columns with mostly unique values, such as free-text comments or UUIDs, because the dictionary grows too large to be effective.

Apply dictionary encoding

Apply encoding to a table

To enable dictionary encoding on all eligible string columns in a table, set compresstype='gdict' in the USING beam WITH (...) clause:

CREATE TABLE NATION (
    n_nationkey integer NOT NULL,
    n_name character(25) NOT NULL,
    n_regionkey integer NOT NULL,
    n_comment character varying(152)
)
USING beam WITH (compresstype = 'gdict')
DISTRIBUTED BY (n_nationkey);

Non-string columns are not compressed when you use table-level dictionary encoding.

Apply encoding to a specific column

To apply dictionary encoding to one column while using a different algorithm for the rest of the table, use the ENCODING clause on the column definition:

CREATE TABLE NATION (
    n_nationkey integer NOT NULL,
    n_name character(25) NOT NULL ENCODING (compresstype = 'gdict'),
    n_regionkey integer NOT NULL,
    n_comment character varying(152)
)
USING beam WITH (compresstype = 'lz4', compresslevel = 9)
DISTRIBUTED BY (n_nationkey);

In this example, n_name uses dictionary encoding while the rest of the table uses LZ4 compression.

Compression results

The following table shows the compression effect on the n_name column (type CHARACTER(25)) from the TPC-H nation table. The CHARACTER(25) type pads values shorter than 25 characters with spaces, so each entry occupies a fixed 25 bytes before compression.

String value	Original size (bytes)	Dictionary value	Compressed size (bytes)
ALGERIA	25	0	2
ARGENTINA	25	1	2
BRAZIL	25	2	2
CANADA	25	3	2
CHINA	25	4	2
EGYPT	25	5	2
ETHIOPIA	25	6	2
FRANCE	25	7	2
Total	200	/	16

Dictionary encoding compresses each n_name value from 25 bytes to 2 bytes — an 8x reduction.

Accelerate queries with dictionary encoding

After applying dictionary encoding, enable the query acceleration feature so the optimizer can exploit the encoded representation during query execution. In filtering, aggregating, and sorting scenarios on low-cardinality string columns, this can improve query performance by 10% to 200% depending on the SQL statement.

Set adbpg_enable_encode_optimize to ON to enable acceleration. The default is OFF.

SET adbpg_enable_encode_optimize TO ON;

With acceleration enabled, the optimizer automatically generates execution plans that operate on the encoded integer values. The following example query filters and aggregates on n_name:

SELECT
    n_name,
    max(n_regionkey)
FROM
    nation
WHERE
    n_name > 'ALGERIA'
    AND n_name < 'FRANCE'
GROUP BY
    n_name;

Run EXPLAIN to inspect the execution plan:

 QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=76.72..78.27 rows=81 width=36)
   ->  GroupAggregate  (cost=76.72..77.19 rows=27 width=36)
         Group Key: n_name
         ->  Sort  (cost=76.72..76.79 rows=28 width=36)
               Sort Key: n_name
               ->  Redistribute Motion 3:3  (slice2; segments: 3)  (cost=0.00..76.05 rows=28 width=36)
                     Hash Key: n_name
                     ->  Decode  (cost=0.00..75.50 rows=28 width=36)
                           Params: dictKeyIdx[1], dictID[0]
                           ->  Seq Scan on nation  (cost=0.00..75.50 rows=28 width=36)
                                 Filter: ((ordered_encode(n_name) > ('ALGERIA'::text)::integer with dict ID 0) AND (ordered_encode(n_name) < ('FRANCE'::text)::integer with dict ID 0))
 Optimizer: Postgres query optimizer
(12 rows)

The filter expression uses ordered_encode() to compare integer keys instead of strings. The Decode operator translates the keys back to the original string values before results are returned.

Use automatic compression

If you are not sure whether dictionary encoding is the right choice for a column, set compresstype='auto' when creating a Beam table. The Beam storage engine analyzes each column and selects the most appropriate compression algorithm automatically.

CREATE TABLE NATION (
    n_nationkey integer NOT NULL,
    n_name character(25) NOT NULL,
    n_regionkey integer NOT NULL,
    n_comment character varying(152)
)
USING beam WITH (compresstype = 'auto')
DISTRIBUTED BY (n_nationkey);

The following table shows how automatic compression selects encodings for the nation table:

Column	Data type	Encoding method	Notes
n_nationkey	Integer	LZ4 or integer encoding	Default for integer columns
n_name	Character(25)	Dictionary encoding	Fixed-length strings with a bounded set of country names — a good fit for dictionary encoding
n_regionkey	Integer	LZ4 or integer encoding	Default for integer columns
n_comment	Character varying(152)	Dictionary encoding	The `n_comment` column contains unique string data. If the number of data rows is greater than the encoding threshold 255, LZ4 encoding instead of dictionary encoding is used for the `n_comment` column.

Delta support for dictionary encoding

The Beam storage engine uses two complementary storage layers:

Delta storage (row-oriented): handles real-time writes
Base storage (PAX-based, column-oriented): handles batch writes and large-scale scanning

Important

Hybrid row-column Base storage is supported on V7.0.2.3 and later. Row-oriented Delta storage for dictionary encoding is supported on V7.0.4.0 and later.

The following example creates a Beam table with automatic compression, inserts data into Delta storage, and enables dictionary encoding acceleration. The EXPLAIN output shows that the Decode operator is applied even to data in Delta storage.

CREATE TABLE test_auto(a int, b text)
USING beam WITH (compresstype = auto, compresslevel = 5);

INSERT INTO test_auto VALUES (1,'adbpg1'),(2,'adbpg2'),(3,'adbpg3'),(10,'adbpg10');

SET adbpg_enable_encode_optimize TO ON;

EXPLAIN SELECT * FROM test_auto WHERE b = 'adbpg1';

 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..2.04 rows=1 width=36)
   ->  Decode  (cost=0.00..2.02 rows=1 width=36)
         Params: dictKeyIdx[2], dictID[0]
         ->  Seq Scan on test_auto  (cost=0.00..2.02 rows=1 width=36)
               Filter: (encode(b) = ('adbpg1'::text)::integer with dict ID 0)
 Optimizer: Postgres-based planner

The filter uses the encode() function to match the encoded integer key, and the Decode operator restores the original string values in the output.