This specification defines resources for Data Quality monitoring and rules in the DataWorks Data Quality module. The specification is written in YAML and is used as a request parameter for OpenAPI to manage resources in the DataWorks Data Quality product.
Basic definitions
YAML example
A simple Data Quality monitoring rule is defined as follows:
datasets:
- type: Table
tables:
- tb_d_spec_demo
dataSource:
name: odps_first
envType: Dev
rules:
- assertion: row_count > 0Property description
The preceding specification example defines a simple Data Quality monitoring rule:
datasets: The monitored object for Data Quality. It includes the following three properties:type: The type of the monitored object. Currently, only theTableenumeration is supported.dataSource: The data source of the monitored object. You can setnameandenvTypeto identify the data source. You can call the ListDataSources operation to obtain the data source name.NoteCurrently, only some data sources are supported. For more information, see List of supported data source types.
tables: If the monitored object is a `Table`, you can configure the table name.NoteIf the data source is attached at the database level and you need to monitor a table in a non-default schema, you can configure the table in the
schema.tableformat.
rules: The rules that the data is expected to meet. A Data Quality monitoring configuration can contain multiple rules.When Data Quality monitoring runs, it scans the data in the monitored object. It then calculates the metric value for each rule and compares the value with the expected threshold. This comparison determines whether the rule passes.
Usually, a data expectation is described by an
assertionstatement. The statement includes a metric type, such asrow_count, a comparison symbol, such as>, and a threshold, such as0. The check result can be one of the following three states:pass: The check passed. The collected metric value is within the range defined by the threshold.
fail: The check failed. The collected metric value is outside the range defined by the threshold.
error: An error occurred. Other exceptions occurred during the check, such as a syntax error.
NoteIf you explicitly specify a
warnlevel threshold, the check result can also bewarn. For more information, see Add multi-level threshold definitions.You can view the supported built-in system metric types for the
assertionstatement.When you create a quality monitoring rule, you can also use custom metric rules.
Threshold definition methods
The assertion statement supports the following threshold definition methods:
Fixed thresholds
This method is suitable for scenarios where you need to directly compare a metric value with a fixed value. A fixed threshold definition includes the following parts:
a metric
an argument (optional)
a comparison symbol (optional)
a threshold (optional)
Configuration example
rules:
# The number of data rows must be greater than 0.
- assertion: row_count > 0
# The maximum value of the size field must be less than or equal to 500.
- assertion: max(size) <= 500The following table describes the four parts of a fixed threshold expression.
Expression part |
|
|
a metric |
|
|
an argument (optional) | / |
|
a comparison symbol (optional) |
|
|
a threshold (optional) |
|
|
For more information about supported comparison symbols, see Supported comparison symbols.
Fluctuation thresholds
This method is suitable for scenarios where you compare the current metric value with historical values of the same metric. For example, you can ensure that the difference between the number of users today and yesterday is within 100, or that the fluctuation of today's revenue compared to the average revenue of the last seven days is within 10%.
Usually, you can add change for before the metric to describe a fluctuation threshold. A fluctuation threshold definition includes the following parts:
change (keyword)
an aggregate type (optional)
a time window
percent (keyword, optional)
for (keyword)
a metric
an argument (optional)
a comparison symbol (optional)
a threshold (optional)
The combined expression format is: change [aggregate_type] [time_window] [percent] for metric [argument] [comparison_symbol threshold].
Configuration examples
Example 1: Compare the difference with the check result of a specified time window
You can add a time window between change ... for to explicitly specify which historical metrics to compare with the current metric value:
rules:
# The difference between the current number of data rows and the number of data rows from the check 7 days ago must be within 10,000.
- assertion: change 7 days ago for row_count < 10000Example 2: Aggregate the check results of a specified time window and then compare the difference
You can add an aggregation method (aggregate type) between change and the time window. The system uses the specified aggregation method to calculate an intermediate result from the check records within the time window. This result is then used as the reference value for the current check:
rules:
# The difference between the current number of data rows and the average number of data rows over the last 7 days must be within 10,000.
- assertion: change average last 7 days for row_count < 10000The following two aggregation methods are supported:
avg: Mean.
var: Variance.
If you do not specify an aggregation method, the system compares the current metric value with all historical check records within the time window. The system then uses the status with the highest severity level as the final check status.
Example 3: Compare the percentage of fluctuation with historical check results
You can add percent between change ... for to indicate that the fluctuation percentage should be calculated before comparing it with the threshold:
rules:
# The fluctuation percentage between the current number of data rows and the number of data rows from the check 7 days ago must be within 50%.
- assertion: change 7 days ago percent for row_count < 50%You can add a % symbol at the end of the threshold to improve readability.
Assume the current metric value is
cand the previous value iscl. Then,percent = (c-cl) / cl:If `cl` is 0 and `c` is also 0, the calculated percentage is 0.
If `cl` is 0 and `c` is not 0, the percentage cannot be calculated, and the check result is `error`.
The result can be a negative number. Note this when you define the threshold. You can use between...and..., as mentioned below, to define the range for the fluctuation percentage.
Range thresholds
You can use between...and... to define a threshold range.
Define a closed interval threshold
An interval defined with between...and... is a closed interval by default. For example, a `warn` is triggered if the day-over-day row count fluctuation falls outside the [-1%, 1%] range. A `fail` is triggered if the fluctuation falls outside the [-5%, 5%] range. The check passes if the value is within the [10, 15] interval.
datasets:
- type: Table
tables:
- tb_d_spec_demo
filter: dt='$[yyyymmdd]' AND hh='$[hh24-1/24]'
dataSource:
name: odps_first
envType: Dev
rules:
- assertion: change 1 day ago percent for row_count
warn:
- when not between -1% and 1%
fail:
- when not between -5% and 5%Add multi-level threshold definitions
In addition to defining the expected threshold in the rule's assertion statement, you can also omit the expected threshold from the assertion statement and set the rule's warn or fail properties to define more granular levels of thresholds. For example:
rules:
- assertion: duplicate_count(phone)
warn: when between 1 and 10
fail: when > 10In the preceding code snippet, two threshold levels, warn and fail, are defined:
The check passes if the number of duplicate rows for the
phonefield is0.The check fails if the number of duplicate rows for the
phonefield is greater than 10.The check result is
warnif the number of duplicate rows for thephonefield is between 1 and 10.
Handling policy for overlapping warn and fail thresholds
The threshold ranges for warn and fail can overlap. If a metric value falls into this overlapping range, the system uses the more severe status. The result is fail.
Use not between...and... to define the complement of an interval
You can use not before between...and... to obtain the complement of the interval. For example:
rules:
- assertion: duplicate_count(phone)
warn:
- when not between -1% and 1%
fail:
- when not between -5% and 5%The interval defined in the preceding code snippet is shown in the following figure:
Set a unique identity for a rule
You can specify an identity for a rule. This serves as the unique identifier for the rule within the same Data Quality monitoring scope.
If you do not specify an
identitywhen creating a rule, the system automatically assigns anidto the rule.You must ensure that the
identityis unique within the same Data Quality monitoring rule. Otherwise, updates may fail or mistakenly update other rules. We recommend using a readable string for easy management.For rules with fluctuation thresholds (
change...for...) and anomaly detection rules (anomaly detection for ...), the rule'sidentityis used to query the check history.When the system receives a request to update a quality monitoring configuration, it processes the request as follows:
First, it traverses each rule in the update request. It uses the rule's
idto match and update existing rules in the quality monitoring configuration.It deletes existing rules in the quality monitoring configuration that do not have a matching
id.It creates the remaining rules from the update request as new rules in the quality monitoring configuration.
The following example shows how to specify an identity:
rules:
- assertion: row_count > 0
name: Number of data rows is greater than 0
# Specify a unique identifier within the Data Quality monitoring scope.
identity: table-not-emptyDefine the business severity level of a rule
You can set the severity of a rule to mark the severity of its impact on your business. This helps with subsequent management. For example:
rules:
- assertion: row_count > 0
severity: HighThe severity property supports two levels:
High
Normal (default)
Set the enabling status of a rule
You can specify the enabled flag for a rule to manage its enabling status. For rules that you do not want to use temporarily but do not want to delete, you can set enabled to false. This temporarily disables the rule. The rule will not be triggered during monitoring execution.
rules:
- assertion: row_count > 0
# The enabling status of the rule. The default value is true.
enabled: falseSet prerequisite statements
In some business scenarios, you may need to execute SET statements to adjust parameters before executing the SQL to calculate metrics. This ensures that the metric calculation SQL can run correctly or guarantees performance. You can add a settingConfig setting in the rule. For example:
rules:
- assertion: row_count > 0
# Set the prerequisite SET statements to be executed.
settingConfig: SET odps.sql.udf.timeout=600s; SET odps.sql.python.version=cp27;Set the switch to retain problematic data
You can enable the switch to retain problematic data at the rule level. When a rule check does not pass (in a warn or fail state), the system automatically filters the data that caused the failure and saves it to another table in the same database as the monitored object's table. To enable the switch to retain problematic data, use the following configuration:
rules:
- assertion: duplicate_count(phone) = 0
collectFailedRows: trueYou can set collectFailedRows to true to enable the retention of problematic data. For custom SQL rules, you must also specify failedRowsQuery to explicitly configure the filter statement for retaining problematic data:
rules:
- assertion: id_null_count = 0
id_null_count:
expression: id IS NULL
collectFailedRows: true
failedRowsQuery: SELECT * FROM tb_d_spec_demo WHERE dt = '$[yyyymmdd-1]' AND id IS NULLIn addition to custom SQL rules, only the following metrics support retaining problematic data. Enabling collectFailedRows is not supported for other metrics at this time:
missing_count
missing_percent
duplicate_count
duplicate_percent
distinct_count
distinct_percent
Set the rule name and description
You can set a name and description for a rule for easy management.
For example, you can set a name for the entire monitoring configuration and a separate name for each rule. These names can be used in subsequent check results and on the UI page for easy management.
datasets:
- type: Table
tables:
- tb_d_spec_demo
filter: dt='$[yyyymmdd]' AND hh='$[hh24-1/24]'
dataSource:
name: odps_first
envType: Dev
rules:
- assertion: row_count > 0
# Rule name and description
name: Number of data rows is greater than 0
description: The output data cannot be emptySet data filtering
Set data filtering for a rule
If you only need to use a subset of data for metric statistics when checking a rule, you can add a filter setting to the rule. For example, to count only the data where the id field is not NULL:
rules:
- assertion: row_count > 0
filter: id IS NOT NULLSet data filtering for quality monitoring
You can also add a filter setting in Scan.Dataset. In this case, the filter is effective for all rules in the quality monitoring configuration during the check process.
datasets:
- type: Table
tables:
- tb_d_spec_demo
filter: dt='$[yyyymmdd]' AND hh='$[hh24-1/24]'
dataSource:
name: odps_first
rules:
- assertion: row_count > 0In the preceding code snippet, a filter setting is added to the dataset. When each rule is checked, the data is first filtered using this filter before the subsequent check is performed.
In the example filter definition, the time expression $[yyyymmdd-1] is used. The system replaces this expression in the filter with a value calculated based on the triggerTime parameter and a time offset. For more information about the supported parameter reference methods, see Data filtering configurations.
Appendix
Supported comparison symbols
>
>=
<
<=
=
!=
between ... and ...
List of supported data source types
The following data source types are supported:
maxcompute
hologres
emr
mysql
analyticdb_for_mysql
analyticdb_for_postgresql
cdh
starrocks
Supported built-in system metric types
avg
row_count
sum
min
max
distinct_count
distinct_percent
table_size
missing_count
missing_percent
duplicate_percent
duplicate_count
group_by
invalid_count
invalid_distinct_count
Time window definition methods
DataWorks Data Quality supports multiple methods for defining time windows. The basic formats are as follows:
n time units ago:
n (minute[s]|hour[s]|day[s]|week[s]|month[s]) ago, for example,n months ago,n days ago, orn hours ago.1 day ago
7 days ago
1 month ago
8 hours ago
15 minutes ago
Relative time range:
last n (minute[s]|hour[s]|day[s]|week[s]|month[s]), such aslast 15 minutes,last 7 days, andlast 1 month.last 15 minutes
last 24 hours
NoteStarts from 24 hours before the current time and ends at the current time. A data point is collected every hour.
last 7 days
last 1 month
1/2/3/.../-3/-2/-1 of (current|last|n) (months|weeks) (ago)Same time on the 1st of the current month: 1 of current month
Same time on the last day of the last month: -1 of last month
Same time on Tuesday 3 weeks ago: 2 of 3 weeks ago