All Products
Search
Document Center

PolarDB:Parsers

Last Updated:Mar 28, 2026

A text search parser splits a raw document into tokens and assigns a type to each token. The parser does not modify the text — it only identifies plausible word boundaries.

In the full-text search pipeline, parsing is the first step:

document → parser → tokens → dictionary → lexemes → tsvector

Because parsers have a narrow role, custom parsers are rarely needed. Custom dictionaries are far more common. PolarDB for PostgreSQL provides one built-in parser: pg_catalog.default.

Token types

pg_catalog.default recognizes 23 token types.

AliasDescriptionExample
asciiwordWord, all ASCII letterselephant
wordWord, all lettersmañana
numwordWord, letters and digitsbeta1
asciihwordHyphenated word, all ASCIIup-to-date
hwordHyphenated word, all letterslógico-matemática
numhwordHyphenated word, letters and digitspostgresql-beta1
hword_asciipartHyphenated word part, all ASCIIpostgresql in postgresql-beta1
hword_partHyphenated word part, all letterslógico or matemática in lógico-matemática
hword_numpartHyphenated word part, letters and digitsbeta1 in postgresql-beta1
emailEmail addressfoo@example.com
protocolProtocol headhttp://
urlURLexample.com/stuff/index.html
hostHostexample.com
url_pathURL path/stuff/index.html (in URL context)
fileFile or path name/usr/local/foo.txt (if not within a URL)
sfloatScientific notation-1.234e56
floatDecimal notation-1.234
intSigned integer-1234
uintUnsigned integer1234
versionVersion number8.3.0
tagXML tag<a href="dictionaries.html">
entityXML entity&amp;
blankSpace symbols(any whitespace or punctuation not otherwise recognized)
Note

The email token type does not support all valid email characters as defined by RFC 5322. The only non-alphanumeric characters supported for email user names are period, dash, and underscore. The parser's notion of a "letter" is determined by the database locale setting (lc_ctype). In most European languages, treat word and asciiword alike.

Overlapping tokens

The parser can produce overlapping tokens from the same piece of text. A hyphenated word is reported both as the whole compound and as each of its parts:

SELECT alias, description, token FROM ts_debug('foo-bar-beta1');
      alias      |               description                |     token
-----------------+------------------------------------------+---------------
 numhword        | Hyphenated word, letters and digits      | foo-bar-beta1
 hword_asciipart | Hyphenated word part, all ASCII          | foo
 blank           | Space symbols                            | -
 hword_asciipart | Hyphenated word part, all ASCII          | bar
 blank           | Space symbols                            | -
 hword_numpart   | Hyphenated word part, letters and digits | beta1

This behavior allows full-text search to match both the full compound word and its individual components.

URLs are similarly decomposed into overlapping subtokens:

SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');
  alias   |  description  |            token
----------+---------------+------------------------------
 protocol | Protocol head | http://
 url      | URL           | example.com/stuff/index.html
 host     | Host          | example.com
 url_path | URL path      | /stuff/index.html