A text search parser splits a raw document into tokens and assigns a type to each token. The parser does not modify the text — it only identifies plausible word boundaries.
In the full-text search pipeline, parsing is the first step:
document → parser → tokens → dictionary → lexemes → tsvectorBecause parsers have a narrow role, custom parsers are rarely needed. Custom dictionaries are far more common. PolarDB for PostgreSQL provides one built-in parser: pg_catalog.default.
Token types
pg_catalog.default recognizes 23 token types.
| Alias | Description | Example |
|---|---|---|
asciiword | Word, all ASCII letters | elephant |
word | Word, all letters | mañana |
numword | Word, letters and digits | beta1 |
asciihword | Hyphenated word, all ASCII | up-to-date |
hword | Hyphenated word, all letters | lógico-matemática |
numhword | Hyphenated word, letters and digits | postgresql-beta1 |
hword_asciipart | Hyphenated word part, all ASCII | postgresql in postgresql-beta1 |
hword_part | Hyphenated word part, all letters | lógico or matemática in lógico-matemática |
hword_numpart | Hyphenated word part, letters and digits | beta1 in postgresql-beta1 |
email | Email address | foo@example.com |
protocol | Protocol head | http:// |
url | URL | example.com/stuff/index.html |
host | Host | example.com |
url_path | URL path | /stuff/index.html (in URL context) |
file | File or path name | /usr/local/foo.txt (if not within a URL) |
sfloat | Scientific notation | -1.234e56 |
float | Decimal notation | -1.234 |
int | Signed integer | -1234 |
uint | Unsigned integer | 1234 |
version | Version number | 8.3.0 |
tag | XML tag | <a href="dictionaries.html"> |
entity | XML entity | & |
blank | Space symbols | (any whitespace or punctuation not otherwise recognized) |
The email token type does not support all valid email characters as defined by RFC 5322. The only non-alphanumeric characters supported for email user names are period, dash, and underscore. The parser's notion of a "letter" is determined by the database locale setting (lc_ctype). In most European languages, treat word and asciiword alike.
Overlapping tokens
The parser can produce overlapping tokens from the same piece of text. A hyphenated word is reported both as the whole compound and as each of its parts:
SELECT alias, description, token FROM ts_debug('foo-bar-beta1'); alias | description | token
-----------------+------------------------------------------+---------------
numhword | Hyphenated word, letters and digits | foo-bar-beta1
hword_asciipart | Hyphenated word part, all ASCII | foo
blank | Space symbols | -
hword_asciipart | Hyphenated word part, all ASCII | bar
blank | Space symbols | -
hword_numpart | Hyphenated word part, letters and digits | beta1This behavior allows full-text search to match both the full compound word and its individual components.
URLs are similarly decomposed into overlapping subtokens:
SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html'); alias | description | token
----------+---------------+------------------------------
protocol | Protocol head | http://
url | URL | example.com/stuff/index.html
host | Host | example.com
url_path | URL path | /stuff/index.html