Configure collation to manage data sorting for PostgreSQL - ApsaraDB RDS

Different languages order text differently: German sorts accented letters alongside their unaccented counterparts, while Swedish places them at the end of the alphabet. Configuring the right collation ensures that queries and ORDER BY results reflect the language expectations of your application.

On ApsaraDB RDS for PostgreSQL, you can set collations at three levels:

Database level — set when you create the database; applies to all columns by default
Column level — override the database default for individual columns
Query level — override the column collation for a single query or comparison

Locale variables

PostgreSQL uses POSIX locale categories to control locale-sensitive behavior. The following categories are most relevant to text handling:

Variable	Controls
`LC_COLLATE`	String sort order
`LC_CTYPE`	Character classification (what counts as a letter, and what is its uppercase equivalent)
`LC_MESSAGES`	Message language
`LC_MONETARY`	Currency format
`LC_NUMERIC`	Numeric value format
`LC_TIME`	Date and time format

Supported character sets

See Character Set Support in the PostgreSQL documentation.

Query supported collations

Run the following query to see which collations are available for a given character set. The results come from the pg_collation system table.

select pg_encoding_to_char(collencoding) as encoding,collname,collcollate,collctype from pg_collation ;

If the encoding column is empty, the collation supports all character sets.

Sample output (869 rows total):

  encoding  |       collname        |      collcollate      |       collctype
------------+-----------------------+-----------------------+-----------------------
            | default               |                       |
            | C                     | C                     | C
            | POSIX                 | POSIX                 | POSIX
 UTF8       | aa_DJ                 | aa_DJ.utf8            | aa_DJ.utf8
 LATIN1     | aa_DJ                 | aa_DJ                 | aa_DJ
 LATIN1     | aa_DJ.iso88591        | aa_DJ.iso88591        | aa_DJ.iso88591
 UTF8       | aa_DJ.utf8            | aa_DJ.utf8            | aa_DJ.utf8
 UTF8       | aa_ER                 | aa_ER                 | aa_ER
 UTF8       | aa_ER.utf8            | aa_ER.utf8            | aa_ER.utf8
.......
 EUC_CN     | zh_CN                 | zh_CN                 | zh_CN
 UTF8       | zh_CN                 | zh_CN.utf8            | zh_CN.utf8
 EUC_CN     | zh_CN.gb2312          | zh_CN.gb2312          | zh_CN.gb2312
 UTF8       | zh_CN.utf8            | zh_CN.utf8            | zh_CN.utf8
 UTF8       | zh_HK                 | zh_HK.utf8            | zh_HK.utf8
 UTF8       | zh_HK.utf8            | zh_HK.utf8            | zh_HK.utf8
 EUC_CN     | zh_SG                 | zh_SG                 | zh_SG
 UTF8       | zh_SG                 | zh_SG.utf8            | zh_SG.utf8
 EUC_CN     | zh_SG.gb2312          | zh_SG.gb2312          | zh_SG.gb2312
 UTF8       | zh_SG.utf8            | zh_SG.utf8            | zh_SG.utf8
 EUC_TW     | zh_TW                 | zh_TW.euctw           | zh_TW.euctw
 UTF8       | zh_TW                 | zh_TW.utf8            | zh_TW.utf8
 EUC_TW     | zh_TW.euctw           | zh_TW.euctw           | zh_TW.euctw
 UTF8       | zh_TW.utf8            | zh_TW.utf8            | zh_TW.utf8
 UTF8       | zu_ZA                 | zu_ZA.utf8            | zu_ZA.utf8
 LATIN1     | zu_ZA                 | zu_ZA                 | zu_ZA
 LATIN1     | zu_ZA.iso88591        | zu_ZA.iso88591        | zu_ZA.iso88591
 UTF8       | zu_ZA.utf8            | zu_ZA.utf8            | zu_ZA.utf8
(869 rows)

Set collations at the column level

Prerequisites

Before you begin, ensure that you have:

An ApsaraDB RDS for PostgreSQL instance
The encoding of the target database (run the query below to check)

Query the encoding of your databases:

select datname,pg_encoding_to_char(encoding) as encoding from pg_database;

Sample output:

      datname       | encoding
--------------------+-----------
 template1          | UTF8
 template0          | UTF8
 db                 | SQL_ASCII
 db1                | EUC_CN
 contrib_regression | UTF8
 test01             | UTF8
 test02             | UTF8
 postgres           | UTF8
(8 rows)

Identify the encoding of your database, then use pg_collation to confirm which collations that encoding supports.

Assign collations when creating a table

Specify a collation per column in CREATE TABLE:

CREATE TABLE test1 (
 a text COLLATE "de_DE",
 b text COLLATE "es_ES"
);

Change the collation of an existing column

Use ALTER TABLE to update a column's collation:

alter table test1 alter a type text COLLATE "en_US";

Note

When you modify the collation of a column in a table, the table is rewritten. Proceed with caution if the table is large.

Override collation at query time

Override the collation for a single ORDER BY clause without changing table or column definitions:

select * from <table_name> order by <column_name> collate "C";

Override collation in a comparison operator:

select * from <table_name> where <column_name> > 'Tom' collate "C";

Use a collation index for better performance

An index can accelerate collation-based sorts, but only when the collation in the ORDER BY clause matches the collation of the index. Create the index with a specific collation:

create index idxa on <table_name>(<column_name> collate "en_US");
explain select * from <table_name> order by <column_name> collate "en_US";

Sample output:

QUERY PLAN
------------------------------------------------------------------------
 Index Only Scan using idxa on <table_name> (cost=0.15..31.55 rows=1360 width=64)
(1 row)

Sort results in alphabetical order

Four methods are available. Choose based on whether you can modify the data or the schema.

Method	Modifies data?	Modifies schema?	Best for
Per-query `COLLATE` in `ORDER BY`	No	No	One-off queries
Column-level collation (`ALTER TABLE`)	Yes (table rewrite)	Yes	Permanent, column-scoped sorting
Collation index + `ORDER BY`	No	Yes (index only)	High-frequency sorted queries
Database-level collation at creation	No	Yes (database-wide)	Uniform sorting across all columns

Method 1: Per-query collation (no schema or data changes)

select * from <table_name> order by <column_name> collate "en_US";

Method 2: Column-level collation (permanent, requires table rewrite)

alter table <table_name> alter <column_name> type text COLLATE "en_US";

Method 3: Collation index (no data changes, index only)

create index idxa on <table_name>(<column_name> collate "en_US");
explain select * from <table_name> order by <column_name> collate "en_US";

Sample output:

QUERY PLAN
  ------------------------------------------------------------------------
   Index Only Scan using idxa on <table_name>  (cost=0.15..31.55 rows=1360 width=64)
  (1 row)

Method 4: Database-level collation at creation (applies to all columns)

Set lc_collate and lc_ctype when creating the database. All text data in the database is then sorted by that collation by default.

create database test03 encoding 'UTF8' lc_collate 'en_US.utf8' lc_ctype 'en_US.utf8'  template template0;
\c test03
select * from (values ('Alice'),('Tom')) as a(c1) order by c1 ;

Sample output:

  c1
--------
 Alice
 Tom
(2 rows)

Note

Chinese characters can have multiple pronunciations. For example, the city name Chongqing may be sorted under "Z" (Zhongqing) in some pronunciation-based collations. Exercise caution when using pronunciation-based collations for Chinese text.

Sort alphabetically in Greenplum

Greenplum does not support per-column collation settings. As a workaround, convert strings to a target encoding and sort in binary order to approximate alphabetical results:

select * from (values ('Alice'), ('Tom')) t(id) order by byteain(textout(convert(id,'UTF8','EUC_CN')));

Sample output:

   id
--------
 Alice
 Tom
(2 rows)

References

PostgreSQL 15 documentation - Chapter 23. Localization