All Products
Search
Document Center

Dataphin:Introduction to partition filter expression

Last Updated:Jan 21, 2025

Dataphin Data Quality partition filter expressions are utilized to tailor quality rules based on the data production date. The system offers both standard partition filter expressions and the ability to create custom expressions. This topic covers the syntax of partition filter expressions, their calculation logic, the various types of built-in expressions, and typical business use cases.

Usage recommendations

  • For situations where specifying the validation partition is unnecessary and the goal is to validate partitions updated by code, the code check trigger schedule - task update partition is recommended. This approach eliminates the need to configure a validation partition filter expression, as the system automatically identifies the updated data partitions.

    Note: The task update partition pattern does not support triggering volatility validation rules that require specific partitions, such as partition size, partition row count, or field statistics.

  • For cases where the trigger date is not restricted and validation of specified partitions is required upon the arrival of a new partition or on a daily basis, the business date ds='${yyyyMMdd}', execution date ds='$[yyyyMMdd]', or monthly ds='${yyyyMM}' partition can be used directly.

  • For scenarios that require validation on specific dates, such as the first or last day of the month or business days, conditional scheduling can be enabled. Quality validation can be determined based on the conditions of the public calendar (month, week, day, business day, holiday, tag, etc.) and the task type (timed scheduling, manual scheduling).

Format of partition filter expression

The partition filter expression syntax is defined as follows:

column1=${FORMAT[OPERATE NUM DELTA]} and columnN=$[FORMAT[OPERATE NUM DELTA]];

Example:

ds=${yyyyMMdd-1M} and city='cn-hangzhou' and hour=${HH:mm:ss};

Format Description:

  • ${expression}:

    • The business date of data production serves as the basis for calculation. For instance, data produced on 20220101 will be calculated using that date.

    • For various partition filter expressions, basic expressions like ${yyyy}, ${MM}, ${dd}, etc., must be combined. For example, the expression ds=${yyyy}-${MM}-${dd} will yield the partition condition ds=2022-01-01 for data produced on 20220101. OPERATE can be applied to each basic expression.

  • $[expression]:

    • The current execution date is used for calculations, making it suitable for near real-time tasks at the hour and minute level. For example, if the execution date is 20220101, the calculation will be based on that date.

    • For different format partition filter expressions, basic expressions such as $[yyyy], $[MM], $[dd], etc., must be concatenated. For instance, the expression ds=$[yyyy]-$[MM]-$[dd] will produce the partition condition ds=2022-01-01 for data produced on 20220101. OPERATE can be applied to each basic expression.

  • column1……columnN:

    • Partition fields. Multi-level partitions can be linked with "and" to create multi-level "and" type partition filter expressions. Multi-level "or" type partition filter expressions are not currently supported.

  • FORMAT: The partition format string, which supports the following format strings:

    • yyyy: Year.

    • MM: Month.

    • dd: Day.

    • ld: Last day of the month.

    • HH: 24-hour format.

    • hh: 12-hour format.

    • mm: Minute.

    • ss: Second.

  • OPERATE: The operator, supporting the following:

    • +: Increase. For example, +1y means adding one year, +1M means adding one month.

    • -: Decrease. For example, -1y means subtracting one year, -1M means subtracting one month.

    • ~: Specify. For example, ~3M means specifying the third month, ~1q means specifying the last month of the first quarter.

    • %: Modulus. For example, %1m means resetting the month to zero, %1h means resetting the hour to zero.

  • NUM: The number used to control the quantity of DELTA.

  • DELTA: The identifier, supporting the following:

    Identifier

    Meaning

    Type

    Example

    y

    Year

    Number

    2009

    M

    Month in year

    Text & Number

    July & 07

    d

    Day in month

    Number

    10

    h

    Hour in am/pm (1-12)

    Number

    12

    H

    Hour in day (0-23)

    Number

    0

    m

    Minute in hour

    Number

    30

    s

    Second in minute

    Number

    55

    S

    Millisecond

    Number

    978

    E

    Day in week

    Text

    • 1 (Sunday)

    • 2 (Monday)

    • 3 (Tuesday)

    • 4 (Wednesday)

    • 5 (Thursday)

    • 6 (Friday)

    • 7 (Saturday)

    D

    Day in year

    Number

    189

    F

    Day of week in month

    Number

    2 (2nd Wed in July)

    w

    Week in year

    Number

    27

    W

    Week in month

    Number

    2

    a

    AM/PM marker

    Text

    PM

    k

    Hour in day (1-24)

    Number

    24

    K

    Hour in am/pm (0-11)

    Number

    0

    z

    Time zone

    Text

    Pacific Standard Time

    Q

    Quarter, first month

    Number

    Month=1,4,7,10

    q

    Quarter, last month

    Number

    Month=3,6,9,12

Calculation logic of partition filter expression

Calculation example 1: Date expression, execution condition, and trigger condition description

ds=${yyyyMMld-1d};

Date expression description: When partition data for May is produced, the following calculations are performed:

  1. Step 1: Calculate yyyyMMld, which is the last day of the given month. For May, this is the 31st, resulting in: 20220531.

  2. Step 2: Apply the DELTA, which in this case is subtracting one day. Therefore, 20220531 minus one day is 20220530.

    The final result is 20220530.

Calculation example 2:

ds=$[yyyyMMdd-1M];

When using $[], the calculation is based on the execution date, not the business date of the data. Suppose today is May 3, 2022, and data for April 3 is being backfilled. The following calculations will be performed:

  • Step 1: Calculate yyyyMMdd. Based on the current execution time of May 3, the result is 20220503.

  • Step 2: Apply the DELTA, which is -1M, resulting in the same day of the previous month. The result is 20220403.

When backfilling data for April 3, the rule will be triggered. If backfilling data for May 3 or March 3, the rule will not be triggered. $[] is primarily used for partition expressions related to the execution date.

The rule triggers when backfilling data for April 3. However, it does not trigger for data backfill on May 3 or March 3. The $[] syntax is primarily used for creating partition expressions that pertain to the execution date.

Types of built-in partition filter expressions

  • Business Date (Time): Calculated based on the business date of data production. For example, for data produced on 20220101, the calculation will be performed using that date.

  • Execution Date (Time): The trigger schedule uses the task's timed schedule execution time (actual execution time may be delayed due to upstream dependencies or resource issues, but the task's scheduled time is used); the timed schedule uses the scheduled time T.

Partition Filter Expression

New Name

ds=${yyyyMMdd}

Business Date

ds=${yyyyMMdd} and hour=${HH}

Business Date Time

ds=${HHmmss}

Business Time

ds=$[yyyyMMdd]

Execution Date

ds=$[yyyyMMdd] and hour $[HH]

Execution Date Time

ds=$[HHmmss]

Execution Time

ds=$[yyyyMMdd - 1d]

Day Before Execution Date

full table

Full Table Scan

Common business examples

Produce last month's salary on a specific day each month

For instance, if salaries are issued on the 15th of each month, summarizing the data for the previous month with the partition field formatted as yyyyMM, the following configurations are necessary:

  • Schedule condition configuration: Date - belongs to - 15th.

    image.png

  • Partition filter expression configuration:

ds=$[yyyyMM-1M];

Produce this month's salary on the last day of each month

If salaries are distributed on the last day of each month, summarizing the data for that month with the partition field formatted as yyyyMM, the following configurations are required:

  • Schedule condition configuration: Date - belongs to - last day of the month.

    image.png

  • Partition filter expression configuration:

ds=$[yyyyMM];

Minute-level task checks at the last minute of each hour

For minute-level tasks, typically formatted as yyyyMMdd HH:mm, the last minute of each hour is the 59th minute. This can be expressed with the following partition filter expression: ds=$[yyyyMMdd HH:59];