Data quality monitoring typically involves checking only the data that has changed, not an entire table. This topic describes how to configure data filtering.
Data filtering configuration example
In most scenarios, you filter changed data based on a time field. For example, you can use time partitions in MaxCompute or filter by the transaction creation time in a MySQL business transaction table.
To filter data, set the filter parameter in your monitoring rules. The following code provides a configuration example:
datasets:
- type: Table
tables:
- tb_d_spec_demo
filter: "dt = '$[yyyymmdd]' AND hh = '$[hh24-1/24]'"
dataSource:
name: odps_first
envType: Dev
rules:
- assertion: "row_count > 0"
# Add a filter at the rule level. It works with the filter in Scan.dataset.
# The actual filter executed is (dt = '$[yyyymmdd-1]' AND hh = '$[hh24-1/24]') AND (id IS NOT NULL)
filter: "id IS NOT NULL"
computeResource:
id: 2001filter syntax
Basic syntax
The filter syntax is mostly the same as the WHERE clause in the SQL-92 standard, except for the match and overlaps predicates. The syntax is as follows:
<search condition> ::=
<boolean term> | <search condition> OR <boolean term>
<boolean term> ::=
<boolean factor> | <boolean term> AND <boolean factor>
<boolean factor> ::=
[ NOT ] <boolean test>
<boolean test> ::=
<predicate> | ( <search condition> )
<predicate> ::=
<comparison predicate> | <between predicate> |
<in predicate> | <like predicate> | <null predicate> |
<quantified comparison predicate> | <exists predicate>
<comparison predicate> ::=
<expression> <comparison_operator> <expression>
<between predicate> ::=
<expression> [NOT] BETWEEN <lower_bound> AND <upper_bound>
<in predicate> ::=
<expression> [NOT] IN (value_list | subquery)
<like predicate> ::=
<expression> [NOT] LIKE pattern [ESCAPE escape_character]
<null predicate> ::=
<expression> IS [NOT] NULL
<quantified comparison predicate> ::=
<expression> <comparison_operator> { ALL | ANY | SOME } (subquery)
<exists predicate> ::=
EXISTS (subquery)Configuration examples
Comparison predicate
salary > 50000Between predicate
price BETWEEN 10 AND 50In predicate
department IN ('Sales', 'Marketing')Like predicate
name LIKE 'Jo%'Null predicate
department IS NOT NULLQuantified comparison predicate
salary > ALL (SELECT salary FROM employees WHERE department = 'Sales')Exists predicate
EXISTS (SELECT 1 FROM customers cu WHERE cu.country_id = c.country_id)More complex search condition
-- Connect two predicates with the AND logical operator salary > ALL (SELECT salary FROM employees WHERE department = 'Sales') AND EXISTS (SELECT 1 FROM customers cu WHERE cu.country_id = c.country_id) -- Connect with the OR logical operator department IS NOT NULL OR name LIKE 'Jo%'
Use a native query to define a filter
If the query for the underlying data storage cannot be expressed in standard SQL-92 because it uses features such as RLIKE, geolocation search, or full-text search, you can write a native query directly in the filter. Start the native query with the query: keyword.
For example, to filter data in a MaxCompute table using RLIKE, use the following configuration:
datasets:
- type: Table
indices:
- ods_d_customers
dataSource:
name: odps_first
envType: Dev
rules:
- assertion: "row_count > 0"
filter: "query: email_address RLIKE '^info@'"Use time variables in a filter
If you use time partitions, Data Quality scans the partition for the current day or the previous day (T-1). The system must automatically determine the correct date for each daily run. To do this, you can use the time offset expressions of the scheduling system.
For example, if you use a time expression such as $[yyyymmdd-1] in the filter, the Data Quality system automatically calculates the offset based on the instance's scheduled time and replaces the expression with the calculated value.
If Data Quality monitoring is not triggered by a scheduling instance, specify triggerTime in the trigger parameters of CreateDataQualityScanRun. The system uses triggerTime as the base for the offset.