edit-icon download-icon

Set the localized information and string sorting rules for a database

Last Updated: Mar 31, 2018

When initializing a database cluster, you can set the following parameters to specify the string sorting rules, the string classification method, and the formats of numeric values, dates, time, and currencies. LC_COLLATE and LC_CTYPE are introduced in database initialization to accommodate the needs of international users.

LC_COLLATE String sort order
LC_CTYPE Character classification (What is a letter? Its upper-case equivalent?)
LC_MESSAGES Language of messages
LC_MONETARY Formatting of currency amounts
LC_NUMERIC Formatting of numbers
LC_TIME Formatting of dates and times

You can use those features to output orders or formats based on localization requirements. This article describes how to set the localized information for a database and how to sort output results in the Chinese Pinyin alphabetic order.

Character set types supported by PostgreSQL

The following table lists the character set types supported by PostgreSQL. Server=Yes indicates that the character set can be used by the CREATE DATABASE command, and Server=No indicates that the character set can only be used on a client. For more information, see the PostgreSQL official document.

Name Description Language Server? Bytes/Char Aliases
BIG5 Big Five Traditional Chinese No 1-2 WIN950, Windows950
EUC_CN Extended UNIX Code-CN Simplified Chinese Yes 1-3 -
EUC_JP Extended UNIX Code-JP Japanese Yes 1-3 -
EUC_JIS_2004 Extended UNIX Code-JP, JIS X 0213 Japanese Yes 1-3 -
EUC_KR Extended UNIX Code-KR Korean Yes 1-3 -
EUC_TW Extended UNIX Code-TW Traditional Chinese, Taiwanese Yes 1-3 -
GB18030 National Standard Chinese No 1-4 -
GBK Extended National Standard Simplified Chinese No 1-2 WIN936, Windows936
ISO_8859_5 ISO 8859-5, ECMA 113 Latin/Cyrillic Yes 1 -
ISO_8859_6 ISO 8859-6, ECMA 114 Latin/Arabic Yes 1 -
ISO_8859_7 ISO 8859-7, ECMA 118 Latin/Greek Yes 1 -
ISO_8859_8 ISO 8859-8, ECMA 121 Latin/Hebrew Yes 1 -
JOHAB JOHAB Korean (Hangul) No 1-3 -
KOI8R KOI8-R Cyrillic (Russian) Yes 1 KOI8
KOI8U KOI8-U Cyrillic (Ukrainian) Yes 1 -
LATIN1 ISO 8859-1, ECMA 94 Western European Yes 1 ISO88591
LATIN2 ISO 8859-2, ECMA 94 Central European Yes 1 ISO88592
LATIN3 ISO 8859-3, ECMA 94 South European Yes 1 ISO88593
LATIN4 ISO 8859-4, ECMA 94 North European Yes 1 ISO88594
LATIN5 ISO 8859-9, ECMA 128 Turkish Yes 1 ISO88599
LATIN6 ISO 8859-10, ECMA 144 Nordic Yes 1 ISO885910
LATIN7 ISO 8859-13 Baltic Yes 1 ISO885913
LATIN8 ISO 8859-14 Celtic Yes 1 ISO885914
LATIN9 ISO 8859-15 LATIN1 with Euro and accents Yes 1 ISO885915
LATIN10 ISO 8859-16, ASRO SR 14111 Romanian Yes 1 ISO885916
MULE_INTERNAL Mule internal code Multilingual Emacs Yes 1-4 -
SJIS Shift JIS Japanese No 1-2 Mskanji, ShiftJIS, WIN932, Windows932
SHIFT_JIS_2004 Shift JIS, JIS X 0213 Japanese No 1-2 -
SQL_ASCII unspecified (see text) any Yes 1 -
UHC Unified Hangul Code Korean No 1-2 WIN949, Windows949
UTF8 Unicode, 8-bit all Yes 1-4 Unicode
WIN866 Windows CP866 Cyrillic Yes 1 ALT
WIN874 Windows CP874 Thai Yes 1 -
WIN1250 Windows CP1250 Central European Yes 1 -
WIN1251 Windows CP1251 Cyrillic Yes 1 WIN
WIN1252 Windows CP1252 Western European Yes 1 -
WIN1253 Windows CP1253 Greek Yes 1 -
WIN1254 Windows CP1254 Turkish Yes 1 -
WIN1255 Windows CP1255 Hebrew Yes 1 -
WIN1256 Windows CP1256 Arabic Yes 1 -
WIN1257 Windows CP1257 Baltic Yes 1 -
WIN1258 Windows CP1258 Vietnamese Yes 1 ABC, TCVN, TCVN5712, VSCII

Query the LC_COLLATE and LC_CTYPE supported by character sets

Query the pg_collation system table by running the following SQL statement to obtain the LC_COLLATE and LC_CTYPE supported by character sets.

  1. test=> select pg_encoding_to_char(collencoding) as encoding,collname,collcollate,collctype from pg_collation ;

The returned results are as follows. When encoding is null, the collation supports all character sets.

  1. encoding | collname | collcollate | collctype
  2. ------------+-----------------------+-----------------------+-----------------------
  3. | default | |
  4. | C | C | C
  5. | POSIX | POSIX | POSIX
  6. UTF8 | aa_DJ | aa_DJ.utf8 | aa_DJ.utf8
  7. LATIN1 | aa_DJ | aa_DJ | aa_DJ
  8. LATIN1 | aa_DJ.iso88591 | aa_DJ.iso88591 | aa_DJ.iso88591
  9. UTF8 | aa_DJ.utf8 | aa_DJ.utf8 | aa_DJ.utf8
  10. UTF8 | aa_ER | aa_ER | aa_ER
  11. UTF8 | aa_ER.utf8 | aa_ER.utf8 | aa_ER.utf8
  12. .......
  13. EUC_CN | zh_CN | zh_CN | zh_CN
  14. UTF8 | zh_CN | zh_CN.utf8 | zh_CN.utf8
  15. EUC_CN | zh_CN.gb2312 | zh_CN.gb2312 | zh_CN.gb2312
  16. UTF8 | zh_CN.utf8 | zh_CN.utf8 | zh_CN.utf8
  17. UTF8 | zh_HK | zh_HK.utf8 | zh_HK.utf8
  18. UTF8 | zh_HK.utf8 | zh_HK.utf8 | zh_HK.utf8
  19. EUC_CN | zh_SG | zh_SG | zh_SG
  20. UTF8 | zh_SG | zh_SG.utf8 | zh_SG.utf8
  21. EUC_CN | zh_SG.gb2312 | zh_SG.gb2312 | zh_SG.gb2312
  22. UTF8 | zh_SG.utf8 | zh_SG.utf8 | zh_SG.utf8
  23. EUC_TW | zh_TW | zh_TW.euctw | zh_TW.euctw
  24. UTF8 | zh_TW | zh_TW.utf8 | zh_TW.utf8
  25. EUC_TW | zh_TW.euctw | zh_TW.euctw | zh_TW.euctw
  26. UTF8 | zh_TW.utf8 | zh_TW.utf8 | zh_TW.utf8
  27. UTF8 | zu_ZA | zu_ZA.utf8 | zu_ZA.utf8
  28. LATIN1 | zu_ZA | zu_ZA | zu_ZA
  29. LATIN1 | zu_ZA.iso88591 | zu_ZA.iso88591 | zu_ZA.iso88591
  30. UTF8 | zu_ZA.utf8 | zu_ZA.utf8 | zu_ZA.utf8
  31. (869 rows)

Set the localized information (COLLATE) for a database

For instructions on how to specify character sets and to set LC_COLLATE and LC_CTYPE, see CREATE DATABASE usage instructions.

Set field localization

Prerequisites

Run the following SQL command to query the character set (encoding) type for the database and identify the COLLATE that is compatible with the character set of the database.

  1. postgres=# select datname,pg_encoding_to_char(encoding) as encoding from pg_database;

The returned results are as follows:

  1. datname | encoding
  2. --------------------+-----------
  3. template1 | UTF8
  4. template0 | UTF8
  5. db | SQL_ASCII
  6. db1 | EUC_CN
  7. contrib_regression | UTF8
  8. test01 | UTF8
  9. test02 | UTF8
  10. postgres | UTF8
  11. (8 rows)

Procedure

  1. When creating a table, run the following command to specify the COLLATE that is compatible with the current character set.

    1. CREATE TABLE test1 (
    2. a text COLLATE "de_DE",
    3. b text COLLATE "es_ES",
    4. ...
    5. );
  2. Run the following command to modify COLLATE.

    Note: The table is rewritten when COLLATE is modified. Exercise with caution when the table is large.

    1. alter table a alter c1 type text COLLATE "zh_CN";

Perform localization in SQL

  • Perform localization to change the sorting of the results returned by the ORDER BY clause.

    1. test=# select * from a order by c1 collate "C";
    2. c1
    3. --------
    4. Liu Shaoqi
    5. Liu Dehua
    6. (2 rows)
    7. test=# select * from a order by c1 collate "zh_CN";
    8. c1
    9. --------
    10. Liu Dehua
    11. Liu Shaoqi
    12. (2 rows)
  • Perform localization to change the operator results.

    1. test=# select * from a where c1 > 'Liu Shaoqi' collate "C";
    2. c1
    3. --------
    4. Liu Dehua
    5. (1 row)
    6. test=# select * from a where c1 > 'Liu Shaoqi' collate "zh_CN";
    7. c1
    8. ----
    9. (0 rows)

Use localized indexes for sorting

To use a specific index for sorting, make sure that the sorting statement and the index have the same COLLATE value.

  1. postgres=# create index idxa on a(c1 collate "zh_CN");
  2. CREATE INDEX
  3. postgres=# explain select * from a order by c1 collate "zh_CN";
  4. QUERY PLAN
  5. ------------------------------------------------------------------------
  6. Index Only Scan using idxa on a (cost=0.15..31.55 rows=1360 width=64)
  7. (1 row)

Sort output results in the Chinese Pinyin alphabetic order

Use any of the following four methods for sorting in the Chinese Pinyin alphabetic order:

  • Perform localization in SQL. This method does not require modifying original data.

    1. test=# select * from a order by c1 collate "zh_CN";
    2. c1
    3. --------
    4. Liu Dehua
    5. Liu Shaoqi
    6. (2 rows)
  • Use localized fields. This method requires adjusting existing data.

    1. alter table a alter c1 type text COLLATE "zh_CN";
  • Use localized indexes and perform localization in SQL. This method does not require modifying original data.

    1. postgres=# create index idxa on a(c1 collate "zh_CN");
    2. CREATE INDEX
    3. postgres=# explain select * from a order by c1 collate "zh_CN";
    4. QUERY PLAN
    5. ------------------------------------------------------------------------
    6. Index Only Scan using idxa on a (cost=0.15..31.55 rows=1360 width=64)
    7. (1 row)
  • Set COLLATE of the database to zh_CN, and you can sort data in the Chinese Pinyin alphabetic order based on that COLLATE.

    1. test02=# create database test03 encoding 'UTF8' lc_collate 'zh_CN.utf8' lc_ctype 'zh_CN.utf8' template template0;
    2. CREATE DATABASE
    3. test02=# \c test03
    4. You are now connected to database "test03" as user "postgres".
    5. test03=# select * from (values ('Liu Dehua'),('Liu Shaoqi')) as a(c1) order by c1 ;
    6. c1
    7. --------
    8. Liu Dehua
    9. Liu Shaoqi
    10. (2 rows)

Sort output results in the Chinese Pinyin alphabetic order in Greenplum

Greenplum does not support single-column COLLATE setting, which results in a slight difference in Pinyin-based sorting.

In Greenplum, you can perform conversion using a character set to sort data in binary mode, which has the same effect of Pinyin-based sorting. See the following example.

  1. postgres=# select * from (values ('Liu Dehua'), ('Liu Shaoqi')) t(id) order by byteain(textout(convert(id,'UTF8','EUC_CN')));
  2. id
  3. --------
  4. Liu Dehua
  5. Liu Shaoqi
  6. (2 rows)

References

Thank you! We've received your feedback.