Dictionary encoding compresses low-cardinality string columns into compact integer keys, reducing storage footprint and accelerating filter, aggregate, and sort queries on those columns.
Supported versions
Dictionary encoding requires an AnalyticDB for PostgreSQL V7.0.x instance in elastic storage mode.
Dictionary encoding is generally available starting from V7.0.6.2, which resolves issues found during the public preview. Update your instance to V7.0.6.2 or later before using this feature in production.
How it works
Dictionary encoding assigns each unique string value in a column a short integer key. The query engine then operates on these compact keys rather than the original strings. A Decode operator in the execution plan translates keys back to strings when results are returned.
This approach works best on low-cardinality columns — columns that hold a bounded, repeating set of values. Good candidates include country names, status codes, and category labels (for example, ALGERIA, ARGENTINA, BRAZIL). Poor candidates are columns with mostly unique values, such as free-text comments or UUIDs, because the dictionary grows too large to be effective.
Apply dictionary encoding
Apply encoding to a table
To enable dictionary encoding on all eligible string columns in a table, set compresstype='gdict' in the USING beam WITH (...) clause:
CREATE TABLE NATION (
n_nationkey integer NOT NULL,
n_name character(25) NOT NULL,
n_regionkey integer NOT NULL,
n_comment character varying(152)
)
USING beam WITH (compresstype = 'gdict')
DISTRIBUTED BY (n_nationkey);Non-string columns are not compressed when you use table-level dictionary encoding.
Apply encoding to a specific column
To apply dictionary encoding to one column while using a different algorithm for the rest of the table, use the ENCODING clause on the column definition:
CREATE TABLE NATION (
n_nationkey integer NOT NULL,
n_name character(25) NOT NULL ENCODING (compresstype = 'gdict'),
n_regionkey integer NOT NULL,
n_comment character varying(152)
)
USING beam WITH (compresstype = 'lz4', compresslevel = 9)
DISTRIBUTED BY (n_nationkey);In this example, n_name uses dictionary encoding while the rest of the table uses LZ4 compression.
Compression results
The following table shows the compression effect on the n_name column (type CHARACTER(25)) from the TPC-H nation table. The CHARACTER(25) type pads values shorter than 25 characters with spaces, so each entry occupies a fixed 25 bytes before compression.
| String value | Original size (bytes) | Dictionary value | Compressed size (bytes) |
|---|---|---|---|
| ALGERIA | 25 | 0 | 2 |
| ARGENTINA | 25 | 1 | 2 |
| BRAZIL | 25 | 2 | 2 |
| CANADA | 25 | 3 | 2 |
| CHINA | 25 | 4 | 2 |
| EGYPT | 25 | 5 | 2 |
| ETHIOPIA | 25 | 6 | 2 |
| FRANCE | 25 | 7 | 2 |
| Total | 200 | / | 16 |
Dictionary encoding compresses each n_name value from 25 bytes to 2 bytes — an 8x reduction.
Accelerate queries with dictionary encoding
After applying dictionary encoding, enable the query acceleration feature so the optimizer can exploit the encoded representation during query execution. In filtering, aggregating, and sorting scenarios on low-cardinality string columns, this can improve query performance by 10% to 200% depending on the SQL statement.
Set adbpg_enable_encode_optimize to ON to enable acceleration. The default is OFF.
SET adbpg_enable_encode_optimize TO ON;With acceleration enabled, the optimizer automatically generates execution plans that operate on the encoded integer values. The following example query filters and aggregates on n_name:
SELECT
n_name,
max(n_regionkey)
FROM
nation
WHERE
n_name > 'ALGERIA'
AND n_name < 'FRANCE'
GROUP BY
n_name;Run EXPLAIN to inspect the execution plan:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather Motion 3:1 (slice1; segments: 3) (cost=76.72..78.27 rows=81 width=36)
-> GroupAggregate (cost=76.72..77.19 rows=27 width=36)
Group Key: n_name
-> Sort (cost=76.72..76.79 rows=28 width=36)
Sort Key: n_name
-> Redistribute Motion 3:3 (slice2; segments: 3) (cost=0.00..76.05 rows=28 width=36)
Hash Key: n_name
-> Decode (cost=0.00..75.50 rows=28 width=36)
Params: dictKeyIdx[1], dictID[0]
-> Seq Scan on nation (cost=0.00..75.50 rows=28 width=36)
Filter: ((ordered_encode(n_name) > ('ALGERIA'::text)::integer with dict ID 0) AND (ordered_encode(n_name) < ('FRANCE'::text)::integer with dict ID 0))
Optimizer: Postgres query optimizer
(12 rows)The filter expression uses ordered_encode() to compare integer keys instead of strings. The Decode operator translates the keys back to the original string values before results are returned.
Use automatic compression
If you are not sure whether dictionary encoding is the right choice for a column, set compresstype='auto' when creating a Beam table. The Beam storage engine analyzes each column and selects the most appropriate compression algorithm automatically.
CREATE TABLE NATION (
n_nationkey integer NOT NULL,
n_name character(25) NOT NULL,
n_regionkey integer NOT NULL,
n_comment character varying(152)
)
USING beam WITH (compresstype = 'auto')
DISTRIBUTED BY (n_nationkey);The following table shows how automatic compression selects encodings for the nation table:
| Column | Data type | Encoding method | Notes |
|---|---|---|---|
| n_nationkey | Integer | LZ4 or integer encoding | Default for integer columns |
| n_name | Character(25) | Dictionary encoding | Fixed-length strings with a bounded set of country names — a good fit for dictionary encoding |
| n_regionkey | Integer | LZ4 or integer encoding | Default for integer columns |
| n_comment | Character varying(152) | Dictionary encoding | The n_comment column contains unique string data. If the number of data rows is greater than the encoding threshold 255, LZ4 encoding instead of dictionary encoding is used for the n_comment column. |
Delta support for dictionary encoding
The Beam storage engine uses two complementary storage layers:
Delta storage (row-oriented): handles real-time writes
Base storage (PAX-based, column-oriented): handles batch writes and large-scale scanning
Hybrid row-column Base storage is supported on V7.0.2.3 and later. Row-oriented Delta storage for dictionary encoding is supported on V7.0.4.0 and later.
The following example creates a Beam table with automatic compression, inserts data into Delta storage, and enables dictionary encoding acceleration. The EXPLAIN output shows that the Decode operator is applied even to data in Delta storage.
CREATE TABLE test_auto(a int, b text)
USING beam WITH (compresstype = auto, compresslevel = 5);
INSERT INTO test_auto VALUES (1,'adbpg1'),(2,'adbpg2'),(3,'adbpg3'),(10,'adbpg10');
SET adbpg_enable_encode_optimize TO ON;
EXPLAIN SELECT * FROM test_auto WHERE b = 'adbpg1'; Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..2.04 rows=1 width=36)
-> Decode (cost=0.00..2.02 rows=1 width=36)
Params: dictKeyIdx[2], dictID[0]
-> Seq Scan on test_auto (cost=0.00..2.02 rows=1 width=36)
Filter: (encode(b) = ('adbpg1'::text)::integer with dict ID 0)
Optimizer: Postgres-based plannerThe filter uses the encode() function to match the encoded integer key, and the Decode operator restores the original string values in the output.