Data Quality spec configuration - DataWorks - Alibaba Cloud Documentation Center

This specification defines resources for Data Quality monitoring and rules in the DataWorks Data Quality module. The specification is written in YAML and is used as a request parameter for OpenAPI to manage resources in the DataWorks Data Quality product.

Basic definitions

YAML example

A simple Data Quality monitoring rule is defined as follows:

datasets:
  - type: Table
    tables:
      - tb_d_spec_demo
    dataSource:
      name: odps_first
      envType: Dev
rules:
  - assertion: row_count > 0

Property description

The preceding specification example defines a simple Data Quality monitoring rule:

datasets: The monitored object for Data Quality. It includes the following three properties:
- type: The type of the monitored object. Currently, only the Table enumeration is supported.
- dataSource: The data source of the monitored object. You can set name and envType to identify the data source. You can call the ListDataSources operation to obtain the data source name.
  Note
  Currently, only some data sources are supported. For more information, see List of supported data source types.
- tables: If the monitored object is a `Table`, you can configure the table name.
  Note
  If the data source is attached at the database level and you need to monitor a table in a non-default schema, you can configure the table in the schema.table format.
rules: The rules that the data is expected to meet. A Data Quality monitoring configuration can contain multiple rules.
When Data Quality monitoring runs, it scans the data in the monitored object. It then calculates the metric value for each rule and compares the value with the expected threshold. This comparison determines whether the rule passes.
Usually, a data expectation is described by an assertion statement. The statement includes a metric type, such as row_count, a comparison symbol, such as >, and a threshold, such as 0. The check result can be one of the following three states:
- pass: The check passed. The collected metric value is within the range defined by the threshold.
- fail: The check failed. The collected metric value is outside the range defined by the threshold.
- error: An error occurred. Other exceptions occurred during the check, such as a syntax error.
Note
- If you explicitly specify a warn level threshold, the check result can also be warn. For more information, see Add multi-level threshold definitions.
- You can view the supported built-in system metric types for the assertion statement.
- When you create a quality monitoring rule, you can also use custom metric rules.

Threshold definition methods

The assertion statement supports the following threshold definition methods:

Fixed thresholds

This method is suitable for scenarios where you need to directly compare a metric value with a fixed value. A fixed threshold definition includes the following parts:

a metric
an argument (optional)
a comparison symbol (optional)
a threshold (optional)

Configuration example

rules:
  # The number of data rows must be greater than 0.
  - assertion: row_count > 0
  # The maximum value of the size field must be less than or equal to 500.
  - assertion: max(size) <= 500

The following table describes the four parts of a fixed threshold expression.

Expression part	`row_count > 0`	`max(size) <= 500`
a metric	`row_count`	`max`
an argument (optional)	/	`(size)`
a comparison symbol (optional)	`>`	`<=`
a threshold (optional)	`0`	`500`

For more information about supported comparison symbols, see Supported comparison symbols.

Fluctuation thresholds

This method is suitable for scenarios where you compare the current metric value with historical values of the same metric. For example, you can ensure that the difference between the number of users today and yesterday is within 100, or that the fluctuation of today's revenue compared to the average revenue of the last seven days is within 10%.

Usually, you can add change for before the metric to describe a fluctuation threshold. A fluctuation threshold definition includes the following parts:

change (keyword)
an aggregate type (optional)
a time window
percent (keyword, optional)
for (keyword)
a metric
an argument (optional)
a comparison symbol (optional)
a threshold (optional)

The combined expression format is: change [aggregate_type] [time_window] [percent] for metric [argument] [comparison_symbol threshold].

Configuration examples

Example 1: Compare the difference with the check result of a specified time window

You can add a time window between change ... for to explicitly specify which historical metrics to compare with the current metric value:

rules:
  # The difference between the current number of data rows and the number of data rows from the check 7 days ago must be within 10,000.
  - assertion: change 7 days ago for row_count < 10000

Example 2: Aggregate the check results of a specified time window and then compare the difference

You can add an aggregation method (aggregate type) between change and the time window. The system uses the specified aggregation method to calculate an intermediate result from the check records within the time window. This result is then used as the reference value for the current check:

rules:
  # The difference between the current number of data rows and the average number of data rows over the last 7 days must be within 10,000.
  - assertion: change average last 7 days for row_count < 10000

The following two aggregation methods are supported:

avg: Mean.
var: Variance.

If you do not specify an aggregation method, the system compares the current metric value with all historical check records within the time window. The system then uses the status with the highest severity level as the final check status.

Example 3: Compare the percentage of fluctuation with historical check results

You can add percent between change ... for to indicate that the fluctuation percentage should be calculated before comparing it with the threshold:

rules:
  # The fluctuation percentage between the current number of data rows and the number of data rows from the check 7 days ago must be within 50%.
  - assertion: change 7 days ago percent for row_count < 50%

You can add a % symbol at the end of the threshold to improve readability.
Assume the current metric value is c and the previous value is cl. Then, percent = (c-cl) / cl:
- If `cl` is 0 and `c` is also 0, the calculated percentage is 0.
- If `cl` is 0 and `c` is not 0, the percentage cannot be calculated, and the check result is `error`.
The result can be a negative number. Note this when you define the threshold. You can use between...and..., as mentioned below, to define the range for the fluctuation percentage.

Range thresholds

You can use between...and... to define a threshold range.

Define a closed interval threshold

An interval defined with between...and... is a closed interval by default. For example, a `warn` is triggered if the day-over-day row count fluctuation falls outside the [-1%, 1%] range. A `fail` is triggered if the fluctuation falls outside the [-5%, 5%] range. The check passes if the value is within the [10, 15] interval.

datasets:
  - type: Table
    tables:
      - tb_d_spec_demo
    filter: dt='$[yyyymmdd]' AND hh='$[hh24-1/24]'
    dataSource:
      name: odps_first
      envType: Dev
rules:
  - assertion: change 1 day ago percent for row_count
    warn: 
      - when not between -1% and 1%
    fail: 
      - when not between -5% and 5%

Add multi-level threshold definitions

In addition to defining the expected threshold in the rule's assertion statement, you can also omit the expected threshold from the assertion statement and set the rule's warn or fail properties to define more granular levels of thresholds. For example:

rules:
  - assertion: duplicate_count(phone)
    warn: when between 1 and 10
    fail: when > 10

In the preceding code snippet, two threshold levels, warn and fail, are defined:

The check passes if the number of duplicate rows for the phone field is 0.
The check fails if the number of duplicate rows for the phone field is greater than 10.
The check result is warn if the number of duplicate rows for the phone field is between 1 and 10.

Handling policy for overlapping warn and fail thresholds

The threshold ranges for warn and fail can overlap. If a metric value falls into this overlapping range, the system uses the more severe status. The result is fail.

Use not between...and... to define the complement of an interval

You can use not before between...and... to obtain the complement of the interval. For example:

rules:
  - assertion: duplicate_count(phone)
    warn: 
      - when not between -1% and 1%
    fail:
      - when not between -5% and 5%

The interval defined in the preceding code snippet is shown in the following figure:

Set a unique identity for a rule

You can specify an identity for a rule. This serves as the unique identifier for the rule within the same Data Quality monitoring scope.

If you do not specify an identity when creating a rule, the system automatically assigns an id to the rule.
You must ensure that the identity is unique within the same Data Quality monitoring rule. Otherwise, updates may fail or mistakenly update other rules. We recommend using a readable string for easy management.
For rules with fluctuation thresholds (change...for...) and anomaly detection rules (anomaly detection for ...), the rule's identity is used to query the check history.
When the system receives a request to update a quality monitoring configuration, it processes the request as follows:
- First, it traverses each rule in the update request. It uses the rule's id to match and update existing rules in the quality monitoring configuration.
- It deletes existing rules in the quality monitoring configuration that do not have a matching id.
- It creates the remaining rules from the update request as new rules in the quality monitoring configuration.

The following example shows how to specify an identity:

rules:
  - assertion: row_count > 0
    name: Number of data rows is greater than 0
    # Specify a unique identifier within the Data Quality monitoring scope.
    identity: table-not-empty

Define the business severity level of a rule

You can set the severity of a rule to mark the severity of its impact on your business. This helps with subsequent management. For example:

rules:
  - assertion: row_count > 0
    severity: High

The severity property supports two levels:

High
Normal (default)

Set the enabling status of a rule

You can specify the enabled flag for a rule to manage its enabling status. For rules that you do not want to use temporarily but do not want to delete, you can set enabled to false. This temporarily disables the rule. The rule will not be triggered during monitoring execution.

rules:
  - assertion: row_count > 0
    # The enabling status of the rule. The default value is true.
    enabled: false

Set prerequisite statements

In some business scenarios, you may need to execute SET statements to adjust parameters before executing the SQL to calculate metrics. This ensures that the metric calculation SQL can run correctly or guarantees performance. You can add a settingConfig setting in the rule. For example:

rules:
  - assertion: row_count > 0
    # Set the prerequisite SET statements to be executed.
    settingConfig: SET odps.sql.udf.timeout=600s; SET odps.sql.python.version=cp27;

Set the switch to retain problematic data

You can enable the switch to retain problematic data at the rule level. When a rule check does not pass (in a warn or fail state), the system automatically filters the data that caused the failure and saves it to another table in the same database as the monitored object's table. To enable the switch to retain problematic data, use the following configuration:

rules:
  - assertion: duplicate_count(phone) = 0
    collectFailedRows: true

You can set collectFailedRows to true to enable the retention of problematic data. For custom SQL rules, you must also specify failedRowsQuery to explicitly configure the filter statement for retaining problematic data:

rules:
  - assertion: id_null_count = 0
    id_null_count:
      expression: id IS NULL
    collectFailedRows: true
    failedRowsQuery: SELECT * FROM tb_d_spec_demo WHERE dt = '$[yyyymmdd-1]' AND id IS NULL

In addition to custom SQL rules, only the following metrics support retaining problematic data. Enabling collectFailedRows is not supported for other metrics at this time:

missing_count
missing_percent
duplicate_count
duplicate_percent
distinct_count
distinct_percent

Set the rule name and description

You can set a name and description for a rule for easy management.

For example, you can set a name for the entire monitoring configuration and a separate name for each rule. These names can be used in subsequent check results and on the UI page for easy management.

datasets:
  - type: Table
    tables:
      - tb_d_spec_demo
    filter: dt='$[yyyymmdd]' AND hh='$[hh24-1/24]'
    dataSource:
      name: odps_first
      envType: Dev
rules:
  - assertion: row_count > 0
    # Rule name and description
    name: Number of data rows is greater than 0
    description: The output data cannot be empty

Set data filtering

Set data filtering for a rule

If you only need to use a subset of data for metric statistics when checking a rule, you can add a filter setting to the rule. For example, to count only the data where the id field is not NULL:

rules:
  - assertion: row_count > 0
    filter: id IS NOT NULL

Set data filtering for quality monitoring

You can also add a filter setting in Scan.Dataset. In this case, the filter is effective for all rules in the quality monitoring configuration during the check process.

datasets:
  - type: Table
    tables:
      - tb_d_spec_demo
    filter: dt='$[yyyymmdd]' AND hh='$[hh24-1/24]'
    dataSource:
      name: odps_first
rules:
  - assertion: row_count > 0

In the preceding code snippet, a filter setting is added to the dataset. When each rule is checked, the data is first filtered using this filter before the subsequent check is performed.

Note

In the example filter definition, the time expression $[yyyymmdd-1] is used. The system replaces this expression in the filter with a value calculated based on the triggerTime parameter and a time offset. For more information about the supported parameter reference methods, see Data filtering configurations.

Appendix

Supported comparison symbols

>
>=
<
<=
=
!=
between ... and ...

List of supported data source types

The following data source types are supported:

maxcompute
hologres
emr
mysql
analyticdb_for_mysql
analyticdb_for_postgresql
cdh
starrocks

Supported built-in system metric types

avg
row_count
sum
min
max
distinct_count
distinct_percent
table_size
missing_count
missing_percent
duplicate_percent
duplicate_count
group_by
invalid_count
invalid_distinct_count

Time window definition methods

DataWorks Data Quality supports multiple methods for defining time windows. The basic formats are as follows:

n time units ago: n (minute[s]|hour[s]|day[s]|week[s]|month[s]) ago, for example, n months ago, n days ago, or n hours ago.
- 1 day ago
- 7 days ago
- 1 month ago
- 8 hours ago
- 15 minutes ago
Relative time range: last n (minute[s]|hour[s]|day[s]|week[s]|month[s]) , such as last 15 minutes, last 7 days, and last 1 month.
- last 15 minutes
- last 24 hours
  Note
  Starts from 24 hours before the current time and ends at the current time. A data point is collected every hour.
- last 7 days
- last 1 month
1/2/3/.../-3/-2/-1 of (current|last|n) (months|weeks) (ago)
- Same time on the 1st of the current month: 1 of current month
- Same time on the last day of the last month: -1 of last month
- Same time on Tuesday 3 weeks ago: 2 of 3 weeks ago