Token types for the default full text search parser - PolarDB

A text search parser splits a raw document into tokens and assigns a type to each token. The parser does not modify the text — it only identifies plausible word boundaries.

In the full-text search pipeline, parsing is the first step:

document → parser → tokens → dictionary → lexemes → tsvector

Because parsers have a narrow role, custom parsers are rarely needed. Custom dictionaries are far more common. PolarDB for PostgreSQL provides one built-in parser: pg_catalog.default.

Token types

pg_catalog.default recognizes 23 token types.

Alias	Description	Example
`asciiword`	Word, all ASCII letters	`elephant`
`word`	Word, all letters	`mañana`
`numword`	Word, letters and digits	`beta1`
`asciihword`	Hyphenated word, all ASCII	`up-to-date`
`hword`	Hyphenated word, all letters	`lógico-matemática`
`numhword`	Hyphenated word, letters and digits	`postgresql-beta1`
`hword_asciipart`	Hyphenated word part, all ASCII	`postgresql` in `postgresql-beta1`
`hword_part`	Hyphenated word part, all letters	`lógico` or `matemática` in `lógico-matemática`
`hword_numpart`	Hyphenated word part, letters and digits	`beta1` in `postgresql-beta1`
`email`	Email address	`foo@example.com`
`protocol`	Protocol head	`http://`
`url`	URL	`example.com/stuff/index.html`
`host`	Host	`example.com`
`url_path`	URL path	`/stuff/index.html` (in URL context)
`file`	File or path name	`/usr/local/foo.txt` (if not within a URL)
`sfloat`	Scientific notation	`-1.234e56`
`float`	Decimal notation	`-1.234`
`int`	Signed integer	`-1234`
`uint`	Unsigned integer	`1234`
`version`	Version number	`8.3.0`
`tag`	XML tag	`<a href="dictionaries.html">`
`entity`	XML entity	`&`
`blank`	Space symbols	(any whitespace or punctuation not otherwise recognized)

Note

The email token type does not support all valid email characters as defined by RFC 5322. The only non-alphanumeric characters supported for email user names are period, dash, and underscore. The parser's notion of a "letter" is determined by the database locale setting (lc_ctype). In most European languages, treat word and asciiword alike.

Overlapping tokens

The parser can produce overlapping tokens from the same piece of text. A hyphenated word is reported both as the whole compound and as each of its parts:

SELECT alias, description, token FROM ts_debug('foo-bar-beta1');

      alias      |               description                |     token
-----------------+------------------------------------------+---------------
 numhword        | Hyphenated word, letters and digits      | foo-bar-beta1
 hword_asciipart | Hyphenated word part, all ASCII          | foo
 blank           | Space symbols                            | -
 hword_asciipart | Hyphenated word part, all ASCII          | bar
 blank           | Space symbols                            | -
 hword_numpart   | Hyphenated word part, letters and digits | beta1

This behavior allows full-text search to match both the full compound word and its individual components.

URLs are similarly decomposed into overlapping subtokens:

SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');

  alias   |  description  |            token
----------+---------------+------------------------------
 protocol | Protocol head | http://
 url      | URL           | example.com/stuff/index.html
 host     | Host          | example.com
 url_path | URL path      | /stuff/index.html