Operators functions and syntax for pattern matching - PolarDB

PolarDB for PostgreSQL (Compatible with Oracle) supports three pattern matching methods: LIKE, SIMILAR TO, and POSIX regular expressions. Each method offers different levels of power and complexity.

Warning

Regular expressions can be crafted to consume arbitrary amounts of time and memory. When accepting patterns from untrusted sources, impose a statement timeout. SIMILAR TO carries the same security risk as POSIX regular expressions. LIKE is the safest option for untrusted pattern sources.

Tip: For pattern matching beyond what these built-in methods offer, write a user-defined function in Perl or Tcl.

Choose a method

Method	Matches entire string	Wildcards	Regex features	Security risk
`LIKE`	Yes	`_`, `%`	None	Low
`SIMILAR TO`	Yes	`_`, `%`	Alternation, quantifiers	Medium
POSIX regex	No (unless anchored)	`.`, `.*`	Full	Medium

LIKE

string LIKE pattern [ESCAPE escape-character]
string NOT LIKE pattern [ESCAPE escape-character]

LIKE returns true when the string matches the pattern. NOT LIKE returns true when there is no match.

Two wildcards are available:

_ matches any single character.
% matches any sequence of zero or more characters.

LIKE always matches against the entire string. To match a substring anywhere in the string, wrap the pattern in %.

Examples:

'abc' LIKE 'abc'    -- true
'abc' LIKE 'a%'     -- true
'abc' LIKE '_b_'    -- true
'abc' LIKE 'c'      -- false

Escape characters

The default escape character is the backslash (\). Precede a wildcard with the escape character to match it literally. To match the escape character itself, write it twice.

Use the ESCAPE clause to specify a different escape character:

'50%' LIKE '50!%' ESCAPE '!'   -- true

Disable the escape mechanism entirely with ESCAPE ''. This makes it impossible to match a literal underscore or percent sign.

Important

If standard_conforming_strings is off, double all backslashes in literal string constants.

The SQL standard specifies that omitting ESCAPE means there is no escape character (not that it defaults to a backslash), and prohibits zero-length ESCAPE values. PostgreSQL behavior differs slightly from the standard in this regard.

Case-insensitive matching

ILIKE is a case-insensitive variant of LIKE, based on the active locale. It is a PostgreSQL extension not in the SQL standard.

Operator aliases:

Operator	Equivalent
`~~`	`LIKE`
`~~*`	`ILIKE`
`!~~`	`NOT LIKE`
`!~~*`	`NOT ILIKE`

These operator names appear in EXPLAIN output because the parser translates LIKE and its variants into these operators internally. LIKE, ILIKE, NOT LIKE, and NOT ILIKE are treated as operators in PostgreSQL syntax and can appear in expression operator ANY (subquery) constructs, although an ESCAPE clause cannot be included in that context.

For simple prefix matching, the ^@ operator and the starts_with function are more direct alternatives.

SIMILAR TO regular expressions

string SIMILAR TO pattern [ESCAPE escape-character]
string NOT SIMILAR TO pattern [ESCAPE escape-character]

SIMILAR TO returns true when the pattern matches the entire string (unlike POSIX regular expressions, which can match anywhere). It combines LIKE notation with a subset of POSIX regex syntax.

SIMILAR TO uses _ and % as wildcards (equivalent to . and .* in POSIX). Note that . is not a metacharacter in SIMILAR TO.

Additional metacharacters borrowed from POSIX:

Metacharacter	Meaning
`\|`	Alternation (either of two alternatives)
`*`	Previous item zero or more times
`+`	Previous item one or more times
`?`	Previous item zero or one time
`{m}`	Previous item exactly `m` times
`{m,}`	Previous item `m` or more times
`{m,n}`	Previous item at least `m` and at most `n` times
`()`	Group items into a single logical item
`[...]`	Bracket expression (character class)

Escape rules are the same as for LIKE: a backslash disables the special meaning of the following metacharacter. Use ESCAPE to set a different escape character, or ESCAPE '' to disable escaping. (PostgreSQL's behavior here is slightly nonstandard, for the same reasons as with LIKE.)

Examples:

'abc' SIMILAR TO 'abc'          -- true
'abc' SIMILAR TO 'a'            -- false
'abc' SIMILAR TO '%(b|d)%'      -- true
'abc' SIMILAR TO '(b|c)%'       -- false
'-abc-' SIMILAR TO '%\mabc\M%'  -- true
'xabcy' SIMILAR TO '%\mabc\M%'  -- false

Extract a substring with SIMILAR TO

The substring function extracts a portion of a string that matches a SQL regular expression pattern. Two calling conventions are supported:

substring(string from pattern for escape-character)
substring(string, pattern, escape-character)

The pattern must match the entire string, or the function returns null. To specify which part of the match to return, include exactly two occurrences of the escape character followed by " in the pattern. The text between these separators is extracted.

The pattern is internally split into three independent regular expressions at the separator positions. The first and third sections match as little text as possible (non-greedy). PostgreSQL also allows one separator (the third regex is taken as empty) or no separators (both the first and third are taken as empty), as an extension to the SQL standard.

Example with #" as the separator:

substring('foobar' from '%#"o_b#"%' for '#')   -- oob
substring('foobar' from '#"o_b#"%' for '#')    -- NULL

POSIX regular expressions

POSIX regular expressions offer the most expressive pattern matching. Unlike LIKE and SIMILAR TO, a POSIX pattern matches anywhere within a string unless explicitly anchored to the beginning (^) or end ($).

Match operators

Operator	Returns	Description	Example
`text ~ text`	`boolean`	Matches, case-sensitive	`'thomas' ~ 't.*ma'` → `t`
`text ~* text`	`boolean`	Matches, case-insensitive	`'thomas' ~* 'T.*ma'` → `t`
`text !~ text`	`boolean`	Does not match, case-sensitive	`'thomas' !~ 't.*max'` → `t`
`text !~* text`	`boolean`	Does not match, case-insensitive	`'thomas' !~* 'T.*ma'` → `f`

Examples:

'abc' ~ 'abc'    -- true
'abc' ~ '^a'     -- true, ^ anchors to start of string
'abc' ~ '(b|d)'  -- true, | means OR
'abc' ~ '^(b|c)' -- false

regexp_replace

regexp_replace replaces substrings that match a POSIX pattern:

regexp_replace(source, pattern, replacement [, flags])

In the replacement string, \n (where n is 1-9) inserts the substring matched by the nth parenthesized subexpression. \& inserts the entire matched substring. Use \\ to insert a literal backslash.

Flags: i for case-insensitive matching, g to replace all matches instead of only the first.

regexp_replace('foobarbaz', 'b..', 'X')           -- fooXbaz
regexp_replace('foobarbaz', 'b..', 'X', 'g')      -- fooXX
regexp_replace('foobarbaz', 'b(..)', 'X\1Y', 'g') -- fooXarYXazY

regexp_match

regexp_match returns a text array of captured substrings from the first match:

regexp_match(string, pattern [, flags])

No match: returns NULL.
Match with no parenthesized subexpressions: returns a single-element array containing the whole matched substring.
Match with parenthesized subexpressions: returns an array where each element is the substring matched by the corresponding subexpression (non-capturing parentheses are not counted).

SELECT regexp_match('foobarbequebaz', 'bar.*que');
-- {barbeque}

SELECT regexp_match('foobarbequebaz', '(bar)(beque)');
-- {bar,beque}

To get just the matched string (not an array), index the result:

SELECT (regexp_match('foobarbequebaz', 'bar.*que'))[1];
-- barbeque

Tip: regexp_match is available in PostgreSQL version 10 and later. For earlier versions, use regexp_matches in a sub-select to get equivalent behavior: ``sql SELECT col1, (SELECT regexp_matches(col2, '(bar)(beque)')) FROM tab; ``

regexp_matches

regexp_matches has the same syntax as regexp_match but returns a set of text arrays—one row per match. Use the g flag to return all matches:

SELECT regexp_matches('foo', 'not there');
-- (0 rows)

SELECT regexp_matches('foobarbequebazilbarfbonk', '(b[^b]+)(b[^b]+)', 'g');
-- {bar,beque}
-- {bazil,barf}

Tip: When you only want the first match, regexp_match is simpler and more efficient than regexp_matches.

regexp_split_to_table and regexp_split_to_array

These functions split a string using a POSIX pattern as a delimiter:

regexp_split_to_table(string, pattern [, flags])  -- returns rows
regexp_split_to_array(string, pattern [, flags])  -- returns a text array

If there is no match to the pattern, the function returns the original string unchanged. Both functions ignore zero-length matches at the start or end of the string, or immediately following a previous match.

SELECT foo FROM regexp_split_to_table('the quick brown fox jumps over the lazy dog', '\s+') AS foo;
-- the / quick / brown / fox / jumps / over / the / lazy / dog  (9 rows)

SELECT regexp_split_to_array('the quick brown fox jumps over the lazy dog', '\s+');
-- {the,quick,brown,fox,jumps,over,the,lazy,dog}

Extract a substring with POSIX regex

substring with two parameters extracts the first portion of the string that matches a POSIX pattern:

substring(string from pattern)

Returns null if there is no match. If the pattern contains parentheses, returns the portion matched by the first parenthesized subexpression:

substring('foobar' from 'o.b')    -- oob
substring('foobar' from 'o(.)b')  -- o

Regular expression details

PostgreSQL's regular expression engine was written by Henry Spencer and implements the POSIX 1003.2 specification.

RE forms

PostgreSQL always assumes ARE (advanced regular expression) rules by default. ERE (extended regular expression, similar to egrep) or BRE (basic regular expression, similar to ed) rules can be chosen by prepending an embedded option to the pattern.

Structure

A regular expression consists of one or more branches separated by |. Each branch is a sequence of quantified atoms or constraints. An atom is matched as a unit; a constraint matches an empty string under specific conditions.

Atoms

Atom	Description
`(re)`	Matches `re`; the match is noted for reporting
`(?:re)`	Matches `re`; the match is not noted (non-capturing); AREs only
`.`	Any single character
`[chars]`	A bracket expression matching any one of the listed characters
`\k` (non-alphanumeric `k`)	That character literally; e.g., `\\` matches a backslash
`\c` (alphanumeric `c`)	An escape; AREs only (in EREs and BREs, matches `c` literally)
`{` (followed by non-digit)	A literal left brace
`x` (any other single character)	That character literally

An RE cannot end with a backslash.

Important

If standard_conforming_strings is off, double all backslashes in literal string constants.

Quantifiers

Quantifier	Matches
`*`	0 or more
`+`	1 or more
`?`	0 or 1
`{m}`	Exactly `m`
`{m,}`	`m` or more
`{m,n}`	`m` through `n` inclusive; `m` cannot exceed `n`
`*?`	Non-greedy `*`
`+?`	Non-greedy `+`
`??`	Non-greedy `?`
`{m}?`	Non-greedy `{m}`
`{m,}?`	Non-greedy `{m,}`
`{m,n}?`	Non-greedy `{m,n}`

The bound values m and n are unsigned decimal integers from 0 to 255 inclusive. Non-greedy quantifiers are available in AREs only.

Important

A quantifier cannot immediately follow another quantifier, begin an expression or subexpression, or follow ^ or |.

Constraints

Constraint	Description
`^`	Beginning of string
`$`	End of string
`(?=re)`	Positive lookahead: matches where a substring matching `re` begins; AREs only
`(?!re)`	Negative lookahead: matches where no substring matching `re` begins; AREs only
`(?<=re)`	Positive lookbehind: matches where a substring matching `re` ends; AREs only
`(?<!re)`	Negative lookbehind: matches where no substring matching `re` ends; AREs only

Lookahead and lookbehind constraints cannot contain back references, and all parentheses within them are non-capturing.

Bracket expressions

A bracket expression [...] matches any single character listed within the brackets. If the list starts with ^, it matches any character not in the list. A range like [0-9] matches all characters between the two endpoints in the collating sequence.

Rules for including special characters:

To include ], make it the first character (after ^ if present).
To include -, make it the first or last character, or the second endpoint of a range.

Within a bracket expression:

A collating element enclosed in [. and .] stands for its character sequence, treated as a single element.
An equivalence class enclosed in [= and =] stands for all characters equivalent to that element in the current collation.
A character class enclosed in [: and :] stands for all characters belonging to that POSIX class.

Important

PostgreSQL does not currently support multi-character collating elements.

POSIX standard character classes: alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, upper, xdigit. PostgreSQL adds word (letters, digits, and underscore) and ascii (the 7-bit ASCII set).

The bracket expressions [[:<:]] and [[:>:]] are word-boundary constraints, matching empty strings at the beginning and end of a word respectively.

Regular expression escapes

Escapes start with \ followed by an alphanumeric character. In AREs, an invalid escape is an error. In EREs, \ is not special outside bracket expressions.

Character-entry escapes

Escape	Description
`\a`	Alert (bell)
`\b`	Backspace
`\B`	Backslash (synonym, reduces need for doubling)
`\cX`	Character with low-order 5 bits same as `X`
`\e`	ESC character (octal `033`)
`\f`	Form feed
`\n`	Newline
`\r`	Carriage return
`\t`	Horizontal tab
`\v`	Vertical tab
`\uwxyz`	Unicode character `0xwxyz` (exactly 4 hex digits)
`\Ustuvwxyz`	Unicode character `0xstuvwxyz` (exactly 8 hex digits)
`\xhhh`	Character with hexadecimal value `0xhhh`
`\0`	Null byte
`\xy`	Character with octal value `0xy` (exactly 2 octal digits, not a back reference)
`\xyz`	Character with octal value `0xyz` (exactly 3 octal digits, not a back reference)

Escape values outside the ASCII range (0-127) are encoding-dependent. In UTF-8, they map to Unicode code points (for example, \u1234 means U+1234). An escape value that does not correspond to a valid character in the database encoding never matches any data and raises no error.

Class-shorthand escapes

Escape	Equivalent
`\d`	`[[:digit:]]`
`\s`	`[[:space:]]`
`\w`	`[[:word:]]`
`\D`	`[^[:digit:]]`
`\S`	`[^[:space:]]`
`\W`	`[^[:word:]]`

These shorthands also work inside bracket expressions; for example, [a-c\d] is equivalent to [a-c[:digit:]].

Constraint escapes

Escape	Description
`\A`	Beginning of string only
`\m`	Beginning of a word
`\M`	End of a word
`\y`	Beginning or end of a word
`\Y`	Not the beginning or end of a word
`\Z`	End of string only

Constraint escapes are not valid inside bracket expressions.

Back references

Escape	Description
`\m`	Back reference to the `m`th subexpression (nonzero digit)
`\mnn`	Back reference to the `mnn`th subexpression, where `mnn` is within the valid range

For example, ([bc])\1 matches bb or cc but not bc or cb.

Important

Octal character-entry escapes and back references can be ambiguous. A leading zero always indicates an octal escape. A single non-zero digit not followed by another digit is always a back reference. A multi-digit sequence not starting with zero is treated as a back reference if the number is within the valid range, and as octal otherwise.

Regular expression metasyntax

Director prefixes

***: — the rest of the RE is an ARE.
***= — the rest of the RE is a literal string; all characters are ordinary.

Embedded options

An ARE can start with (?xyz) where xyz is one or more option letters:

Option	Description
`b`	Rest of RE is a BRE
`c`	Case-sensitive matching (overrides operator type)
`e`	Rest of RE is an ERE
`i`	Case-insensitive matching
`m`	Synonym for `n`
`n`	Newline-sensitive matching
`p`	Partial newline-sensitive matching
`q`	Rest of RE is a literal string
`s`	Non-newline-sensitive matching (default)
`t`	Tight syntax (default)
`w`	Inverse partial newline-sensitive ("weird") matching
`x`	Expanded syntax

Embedded options take effect at the closing ) and can only appear at the start of an ARE (after the ***: director, if any).

Expanded syntax

With the x option, whitespace and #-to-newline comments in the RE are ignored, allowing you to break up and annotate complex expressions. Three exceptions: a \-preceded whitespace or # is retained; whitespace or # inside a bracket expression is retained; whitespace and comments cannot appear inside multi-character symbols like (?:.

Comments can also be written as (?#ttt) (deprecated; use expanded syntax instead).

Matching rules

When a pattern can match more than one substring, the RE matches the one starting earliest in the string. When multiple matches start at the same position, the match length depends on whether the RE is greedy or non-greedy.

Greediness rules:

Most atoms and all constraints have no greediness attribute.
Adding parentheses does not change greediness.
A quantified atom with {m} or {m}? inherits the atom's greediness.
A quantified atom with other normal quantifiers is greedy (prefers the longest match).
A quantified atom with non-greedy quantifiers (e.g., *?) is non-greedy (prefers the shortest match).
A branch has the same greediness as its first quantified atom with a greediness attribute.
An RE with top-level | is always greedy.

Greediness applies to the whole RE, not just individual atoms. Once the overall match length is determined, subexpression match lengths are resolved based on their own greediness, with earlier subexpressions taking priority.

Example:

SELECT SUBSTRING('XY1234Z', 'Y*([0-9]{1,3})');
-- Result: 123  (Y* is greedy; matches Y123; captures 123)

SELECT SUBSTRING('XY1234Z', 'Y*?([0-9]{1,3})');
-- Result: 1    (Y*? is non-greedy; matches Y1; captures 1)

Use {1,1} to force greediness or {1,1}? to force non-greediness on a subexpression when the whole RE needs a different greediness from what its elements imply:

SELECT regexp_match('abc01234xyz', '(.*)(\d+)(.*)');
-- Result: {abc0123,4,xyz}  -- first .* eats too much

SELECT regexp_match('abc01234xyz', '(.*?)(\d+)(.*)');
-- Result: {abc,0,""}       -- RE is non-greedy overall; ends too soon

SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
-- Result: {abc,01234,xyz}  -- force whole RE to be greedy

Match lengths are measured in characters. An empty string is considered longer than no match at all.

Newline-sensitive matching: with the n option, . and ^-negated bracket expressions do not match newlines, and ^/$ also match at newline boundaries. \A and \Z continue to match only the start and end of the entire string. \D and \W match newlines regardless of this mode (unlike behavior before PostgreSQL 14).

Limits and compatibility

No length limit is imposed on regular expressions in this implementation. For portability, avoid REs longer than 256 bytes (POSIX-compliant implementations may refuse them).

The only true incompatibility between AREs and POSIX EREs is that \ retains its special significance inside bracket expressions in AREs. All other ARE features use syntax that is either illegal or undefined in POSIX EREs.

Incompatibilities introduced in version 7.4:

In AREs, \ followed by an alphanumeric character is either an escape or an error (previously it was just the alphanumeric character).
In AREs, \ is special inside []; a literal backslash inside a bracket expression must be written as \\.

Differences from XQuery (LIKE_REGEX)

SQL:2008 defines a LIKE_REGEX operator based on the XQuery regular expression standard. PostgreSQL does not yet implement LIKE_REGEX. The regexp_match function provides similar behavior, since XQuery regular expressions are close to the ARE syntax described here.

Key differences between POSIX-based and XQuery regular expressions:

XQuery character class subtraction ([a-z-[aeiou]]) is not supported.
XQuery class shorthands \c, \C, \i, \I are not supported.
XQuery Unicode property classes (\p{UnicodeProperty}, \P{UnicodeProperty}) are not supported.
POSIX interprets character classes (like \w) by the active locale; XQuery uses Unicode properties.
XQuery newline handling is broader: it treats \r, \r\n, and certain Unicode characters (e.g., LINE SEPARATOR U+2028) as newlines in addition to \n. Notably, . and \s count \r\n as one character, not two, according to SQL.
XQuery character-entry escapes: only \n, \r, and \t are supported.
XQuery does not support [::] character class syntax in bracket expressions.
XQuery has no lookahead, lookbehind, constraint escapes, or metasyntax forms.
XQuery flag letters differ from POSIX: while the i and q options behave the same, others do not. XQuery's s (dot matches newline) and m (^ and $ match at newlines) differ from POSIX s and m; XQuery's x (ignore whitespace) differs from POSIX expanded mode in that POSIX also allows # comments and preserves backslash-escaped whitespace.