Regular Expression

Last Updated: Jun 10, 2016

The regular expressions in ODPS SQL use PCRE standard, matched by character. The metacharacter to be supported are shown as follows:

  1. Metacharacter Description
  2. ^ Top of line (TOL)
  3. $ End of line
  4. . Any character
  5. * Matches for zero or multiple times
  6. + Matches for once or multiple times
  7. ? Matches for zero time or once
  8. ? Matches modifier. When this character follows any other constraints (*, +,? {n}, {n, {n, m},}, the match mode is non greedy. Non greedy mode matches strings as little as possible, while the default greedy mode matches strings as more as possible.
  9. A|B A or B
  10. (abc)* Matches abc for zero or multiple times
  11. {n} or {m,n} Matching times
  12. [ab] Matches any character in the brackets. In the example, it is to match a or b.
  13. [a-d] Matches any character in a, b, c and d.
  14. [^ab] ^ indicats non’, to match any character which is not a and b.
  15. [::] Refer to POSIX character group in next table.
  16. \ Escape character
  17. \n N is a digit from 1 to 9 and is backward referenced.
  18. \d digits
  19. \D Non-number

POSIX character group

  1. POSIX Character Group Description Range
  2. [[:alnum:]] letter and digit characters [a-zA-Z0-9]
  3. [[:alpha:]] letter [a-zA-Z]
  4. [[:ascii:]] ASCII character [\x00-\x7F]
  5. [[:blank:]] Space character and tabs [ \t]
  6. [[:cntrl:]] Control character [\x00-\x1F\x7F]
  7. [[:digit:]] Digit character [0-9]
  8. [[:graph:]] Characters except whitespace characters [\x21-\x7E]
  9. [[:lower:]] Lowercase characters [a-z]
  10. [[:print:]] [:graph:] and whitespace characters [\x20-\x7E]
  11. [[:punct:]] punctuation [][!”#$%&’()*+,./:;<=>? @\^_`{|}~-]
  12. [[:space:]] Whitespace characters [ \t\r\n\v\f]
  13. [[:upper:]] Uppercase characters [A-Z]
  14. [[:xdigit:]] hexadecimal character [A-Fa-f0-9]

Because system uses a backslash “ \” as an escape character, all “ \” which appear in the regular expression pattern will do two escapes. For example, the regular expression needs to match the string “a+b”. “+” is a special character in regular expression and should be expressed by escape. The expression in regular engine is “a\+b”, because system needs to explain a layer of escape, the expression which can match this string is “a\\+b”. Suppose that the table test_dual is existent:

  1. select 'a+b' rlike 'a\\\+b' from test_dual;
  2. +------+
  3. | _c1 |
  4. +------+
  5. | true |
  6. +------+

In extreme case, to match the character “ \”, because “ \” is a special character in regular engine, it needs to be expressed by “\”, while system will do an escape for it again, it will be written to be “\\”.

  1. select 'a\\b', 'a\\b' rlike 'a\\\\b' from test_dual;
  2. +-----+------+
  3. | _c0 | _c1 |
  4. +-----+------+
  5. | a\b | true |
  6. +-----+------+

Note:

  • To write ”a\\b” in ODPS SQL,and the output result is ’a\b’.

If TAB exists in a string, when system reads these two characters ‘ \t’, they are already saved as one character by system. So in regular expression, it is a general character.

  1. select 'a\tb', 'a\tb' rlike 'a\tb' from test_dual;
  2. +---------+------+
  3. | _c0 | _c1 |
  4. +---------+------+
  5. | a b | true |
  6. +---------+------+
Thank you! We've received your feedback.