MATCH_RECOGNIZE is a complex event processing (CEP) statement that identifies events from input data streams based on the specified rules and generates output events based on the specified method.
Syntax
SELECT [ ALL | DISTINCT ]
{ * | projectItem [, projectItem ]* }
FROM tableExpression
[MATCH_RECOGNIZE (
[PARTITION BY {partitionItem [, partitionItem]*}]
[ORDER BY {orderItem [, orderItem]*}]
[MEASURES {measureItem AS col [, measureItem AS col]*}]
[ONE ROW PER MATCH|ALL ROWS PER MATCH|ONE ROW PER MATCH WITH TIMEOUT ROWS|ALL ROWS PER MATCH WITH TIMEOUT ROWS]
[AFTER MATCH SKIP]
PATTERN (patternVariable[quantifier] [ patternVariable[quantifier]]*) WITHIN intervalExpression
DEFINE {patternVariable AS patternDefinationExpression [, patternVariable AS patternDefinationExpression]*}
)];
Clause | Description |
---|---|
PARTITION BY | Optional. You can use the clause to divide the rows of an input table into partitions based on one or more partition key columns. |
ORDER BY | Optional. You can use the clause to sort the rows of a partition based on one or more
columns. If you use multiple columns to sort the rows, specify EVENT TIME or PROCESS TIME as the first column.
|
MEASURES | You can use the clause to define the output events that are generated based on the matched input events. |
ONE ROW PER MATCH | If you use the clause, only one output event is generated for each match. |
ONE ROW PER MATCH WITH TIMEOUT ROWS | If you use the clause, an output event is generated for each match or time-out. The
time-out period is specified by the WITHIN clause in PATTERN.
Note Blink V3.6.0 and later do not support the ONE ROW PER MATCH WITH TIMEOUT ROWS clause due to Calcite upgrades.
|
ALL ROWS PER MATCH | If you use the clause, an output event is generated for each input event upon each match. |
ALL ROWS PER MATCH WITH TIMEOUT ROWS | If you use the clause, an output event is generated for each input event upon each
match or time-out. The time-out period is specified by the WITHIN clause in PATTERN.
Note Blink V3.6.0 and later do not support the ALL ROWS PER MATCH WITH TIMEOUT ROWS clause due to Calcite upgrades.
|
[ONE ROW PER MATCH|ALL ROWS PER MATCH|ONE ROW PER MATCH WITH TIMEOUT ROWS|ALL ROWS PER MATCH WITH TIMEOUT ROWS] | Optional. By default, ONE ROW PER MATCH is used.
|
AFTER MATCH SKIP TO NEXT ROW | If you use the clause, the next match starts to be performed from the event that follows the first event in the sequence of matched events. |
AFTER MATCH SKIP PAST LAST ROW | If you use the clause, the next match starts to be performed from the event that follows the last event in the sequence of matched events. |
AFTER MATCH SKIP TO FIRST patternItem | If you use the clause, the next match starts to be performed from the first event that corresponds to patternItem in the sequence of matched events. |
AFTER MATCH SKIP TO LAST patternItem | If you use the clause, the next match starts to be performed from the last event that corresponds to patternItem in the sequence of matched events. |
PATTERN | You can use the clause to specify the rule that the sequence of events to be identified
must meet. You must enclose the rule in parentheses () . The rule is specified by a set of custom patternVariable variables.
Note
|
Parameter description
- quantifier
The quantifier parameter specifies the number of occurrences of events that meet the patternVariable definition.
Value Description * Zero or multiple times. + One or more times. ? Zero or once. {n} n times. {n,} At least n times. {n, m} Ranges from n times to m times. {,m} At most m times. By default, greedy matching is performed. For example, if the PATTERN clause isA -> B+ -> C
and the input isa bc1 bc2 c
, the output isa bc1 bc2 c
. In the input, bc1 and bc2 means that the results must match B and C. To perform reluctant matching, append the quantifier with a question mark (?). You can use the following reluctant quantifiers:*?
+?
{n}?
{n,}?
{n, m}?
{,m}?
Note Blink V3.x and later do not support (e1 e2+) greedy matching. You can use e1 e2+ e3 e3 as not e2 as an alternative. In the alternative method, at least one e3 entry must be included to make sure that the output data is returned as expected.In this case, the output that is generated for the input and the PATTERN setting in the preceding example changes to a bc1 bc2,a bc1 bc2 c.- The WITHIN clause defines the maximum time span of the events that meet the specified rule in an event sequence.
- The format of static windows is
INTERVAL 'string' timeUnit [ TO timeUnit ]
, for example,INTERVAL '10' SECOND, INTERVAL '45' DAY, INTERVAL '10:20' MINUTE TO SECOND, INTERVAL '10:20.10' MINUTE TO SECOND, INTERVAL '10:20' HOUR TO MINUTE, INTERVAL '1-5' YEAR TO MONTH
. - The format of dynamic windows is
INTERVAL intervalExpression
, for example,INTERVAL A.windowTime + 10
. In this example, A is the first patternVariable variable in the PATTERN clause. When you specify intervalExpression, you can use only the first patternVariable variable in the PATTERN clause. When you specify the intervalExpression parameter, you can use user defined functions (UDFs). The intervalExpression result indicates the window size that is measured in milliseconds. The data type of the result must be LONG. - The DEFINE statement specifies the meanings of patternVariable variables in the PATTERN clause. If a patternVariable variable is not defined in the DEFINE statement, the patternVariable variable is valid for each event.
- Functions in MEASURES and DEFINE statements
Function Description Row Pattern Column References This function uses the format of patternVariable.col
. This function is used to access the specified column of an event that corresponds to a patternVariable variable.PREV This function can be used in only the DEFINE statement. In most cases, this function is used in conjunction with the Row Pattern Column References
function. This function is used to access the specified column of a previous event that has a specified offset. The event occurs before the events that are specified by the PATTERN clause.For example, you can useDOWN AS DOWN.price < PREV(DOWN.price)
. In this example,PREV(A.price)
specifies theprice
column value for the event that occurs before the current event.NoteDOWN.price
is equivalent toPREV(DOWN.price, 0)
.PREV(DOWN.price)
is equivalent toPREV(DOWN.price, 1)
.
FIRST or LAST In most cases, the FIRST or the LAST function is used in conjunction with the Row Pattern Column References
function. You can use the FIRST or the LAST function to access the event that has a specified offset in the sequence of events. The sequence of events is specified by the PATTERN clause. The following examples are used to explain the functions:FIRST(A.price, 3)
specifies the fourth event in the sequence of events that match thepattern A
.LAST(A.price, 3)
specifies the last but three event in the sequence of events that match thepattern A
.
- Output columns
Function Output column ONE ROW PER MATCH The columns that are specified in PARTITION BY
andMEASURES
are included. The columns that are specified inPARTITION BY
do not need to be specified inMEASURES
again.ONE ROW PER MATCH WITH TIMEOUT ROWS An output event is generated for each match or time-out. The time-out period is specified by the WITHIN clause in PATTERN. Note- When you define the PATTERN clause, we recommend that you define the WITHIN clause. If the WITHIN clause is not defined, the state size may grow larger.
- The first column that is specified in the ORDER BY clause must be EVENT TIME or PROCESS TIME.
Examples
- Syntax in the example
SELECT * FROM Ticker MATCH_RECOGNIZE ( PARTITION BY symbol ORDER BY tstamp MEASURES STRT.tstamp AS start_tstamp, LAST(DOWN.tstamp) AS bottom_tstamp, LAST(UP.tstamp) AS end_tstamp ONE ROW PER MATCH AFTER MATCH SKIP TO NEXT ROW PATTERN (STRT DOWN+ UP+) WITHIN INTERVAL '10' SECOND DEFINE DOWN AS DOWN.price < PREV(DOWN.price), UP AS UP.price > PREV(UP.price) );
- Test data
timestamp (TIMESTAMP) card_id(VARCHAR) location (VARCHAR) action (VARCHAR) 2018-04-13 12:00:00
1 Beijing Consumption 2018-04-13 12:05:00
1 Shanghai Consumption 2018-04-13 12:10:00
1 Shenzhen Consumption 2018-04-13 12:20:00
1 Beijing Consumption - Syntax in the test
Each credit card is identified by a unique card ID that is specified by the card_id parameter. If payments with a credit card are made within 10 minutes at two different locations, an alert is triggered. This helps you monitor unauthorized operations of credit cards.
CREATE TABLE datahub_stream ( `timestamp` TIMESTAMP, card_id VARCHAR, location VARCHAR, `action` VARCHAR, WATERMARK wf FOR `timestamp` AS withOffset(`timestamp`, 1000) ) WITH ( type = 'datahub' ... ); CREATE TABLE rds_out ( start_timestamp TIMESTAMP, end_timestamp TIMESTAMP, card_id VARCHAR, event VARCHAR ) WITH ( type= 'rds' ... ); --Define the computational logic. insert into rds_out select `start_timestamp`, `end_timestamp`, card_id, `event` from datahub_stream MATCH_RECOGNIZE ( PARTITION BY card_id --Partition the table by card ID. The data that has the same card ID is allocated to the same compute node. ORDER BY `timestamp` --Sort events by time in a window. MEASURES --Define how to generate output events based on the input events that are matched. e2.`action` as `event`, e1.`timestamp` as `start_timestamp`, --Specify the time of the first event as the start_timestamp value. LAST(e2.`timestamp`) as `end_timestamp` --Specify the time of the last event as the end_timestamp value. ONE ROW PER MATCH --The system generates an output event for each match. AFTER MATCH SKIP TO NEXT ROW --The system performs the next match from the next row after each match. PATTERN (e1 e2+) WITHIN INTERVAL '10' MINUTE --Define two events: e1 and e2. DEFINE --Define the meanings of patternVariable variables in the PATTERN clause. e1 as e1.action = 'Consumption', --Mark the action of the e1 event as Consumption. e2 as e2.action = 'Consumption' and e2.location <> e1.location --Mark the action of the e2 event as Consumption. The locations of e1 and e2 are different. );
Note In some scenarios, some data meets the CEP condition but no outputs are returned. This occurs because only data that meets the watermark > e2.ts condition is processed. No input data flows into the system after e2 and the watermark is always e2.ts -1000. As a result, the e2 data cannot be processed. Therefore, no outputs are returned. - Test result
start_timestamp (TIMESTAMP) end_timestamp (TIMESTAMP) card_id (VARCHAR) event (VARCHAR) 2018-04-13 12:00:00.0
2018-04-13 12:05:00.0
1 Consumption 2018-04-13 12:05:00.0
2018-04-13 12:10:00.0
1 Consumption