All Products
Search
Document Center

DataWorks:Data filtering

Last Updated:Feb 28, 2026

Data quality monitoring typically involves checking only the data that has changed, not an entire table. This topic describes how to configure data filtering.

Data filtering configuration example

In most scenarios, you filter changed data based on a time field. For example, you can use time partitions in MaxCompute or filter by the transaction creation time in a MySQL business transaction table.

To filter data, set the filter parameter in your monitoring rules. The following code provides a configuration example:

datasets:
  - type: Table
    tables:
      - tb_d_spec_demo
    filter: "dt = '$[yyyymmdd]' AND hh = '$[hh24-1/24]'"
    dataSource:
      name: odps_first
      envType: Dev
rules:
  - assertion: "row_count > 0"
    # Add a filter at the rule level. It works with the filter in Scan.dataset.
    # The actual filter executed is (dt = '$[yyyymmdd-1]' AND hh = '$[hh24-1/24]') AND (id IS NOT NULL)
    filter: "id IS NOT NULL"
computeResource:
  id: 2001

filter syntax

Basic syntax

The filter syntax is mostly the same as the WHERE clause in the SQL-92 standard, except for the match and overlaps predicates. The syntax is as follows:

<search condition> ::=
    <boolean term> | <search condition> OR <boolean term>

<boolean term> ::=
    <boolean factor> | <boolean term> AND <boolean factor>

<boolean factor> ::=
    [ NOT ] <boolean test>

<boolean test> ::=
    <predicate> | ( <search condition> )

<predicate> ::=
    <comparison predicate> | <between predicate> |
    <in predicate> | <like predicate> | <null predicate> |
    <quantified comparison predicate> | <exists predicate>

<comparison predicate> ::=
    <expression> <comparison_operator> <expression>

<between predicate> ::=
    <expression> [NOT] BETWEEN <lower_bound> AND <upper_bound>

<in predicate> ::=
    <expression> [NOT] IN (value_list | subquery)

<like predicate> ::=
    <expression> [NOT] LIKE pattern [ESCAPE escape_character]

<null predicate> ::=
    <expression> IS [NOT] NULL

<quantified comparison predicate> ::=
    <expression> <comparison_operator> { ALL | ANY | SOME } (subquery)

<exists predicate> ::=
    EXISTS (subquery)

Configuration examples

  • Comparison predicate

    salary > 50000
  • Between predicate

    price BETWEEN 10 AND 50
  • In predicate

    department IN ('Sales', 'Marketing')
  • Like predicate

    name LIKE 'Jo%'
  • Null predicate

    department IS NOT NULL
  • Quantified comparison predicate

    salary > ALL (SELECT salary FROM employees WHERE department = 'Sales')
  • Exists predicate

    EXISTS (SELECT 1 FROM customers cu WHERE cu.country_id = c.country_id)
  • More complex search condition

    -- Connect two predicates with the AND logical operator
    salary > ALL (SELECT salary FROM employees WHERE department = 'Sales') AND EXISTS (SELECT 1 FROM customers cu WHERE cu.country_id = c.country_id)
    -- Connect with the OR logical operator
    department IS NOT NULL OR name LIKE 'Jo%'

Use a native query to define a filter

If the query for the underlying data storage cannot be expressed in standard SQL-92 because it uses features such as RLIKE, geolocation search, or full-text search, you can write a native query directly in the filter. Start the native query with the query: keyword.

For example, to filter data in a MaxCompute table using RLIKE, use the following configuration:

datasets:
  - type: Table
    indices:
      - ods_d_customers
    dataSource:
      name: odps_first
      envType: Dev
rules:
  - assertion: "row_count > 0"
    filter: "query: email_address RLIKE '^info@'"

Use time variables in a filter

If you use time partitions, Data Quality scans the partition for the current day or the previous day (T-1). The system must automatically determine the correct date for each daily run. To do this, you can use the time offset expressions of the scheduling system.

For example, if you use a time expression such as $[yyyymmdd-1] in the filter, the Data Quality system automatically calculates the offset based on the instance's scheduled time and replaces the expression with the calculated value.

Note

If Data Quality monitoring is not triggered by a scheduling instance, specify triggerTime in the trigger parameters of CreateDataQualityScanRun. The system uses triggerTime as the base for the offset.