Dataphin supports the creation of quality rules to validate metrics, enhancing the convenience of metric quality monitoring. This topic describes how to configure metric quality rules.
Prerequisites
Quality rules can be configured only after you have added monitored objects. For more information, see Add monitored objects.
Permission description
Super administrators, quality administrators, custom global roles with Quality Rule - Manage permissions, custom project roles with Project Quality Management - Quality Rule Management permissions for the project where the table resides, and metric business owners can configure scheduling, alerts, and more for quality rules.
Quality owners and regular users who require additional read permissions for logical table fields can refer to the following documentation: how to apply for, renew, and return table permissions.
The permissions supported for different objects vary. For details, see Quality Rule Operation Permissions.
Validation rule description
When metrics are validated against quality rules, the system sends alert messages if weak quality monitoring rules are triggered, helping you promptly identify and address anomalies. If strong quality monitoring rules are triggered, the system automatically interrupts tasks associated with the table to prevent dirty data from flowing downstream. The system also sends alert messages to help you promptly identify and address anomalies.
Differences between trial runs and executions
The differences between trial runs and executions lie in the execution method and displayed results. Trial runs refer to running a quality rule in a test mode to check its correctness and execution status. The results of trial runs are not displayed in quality reports. Executions refer to checking quality rules within a specific time frame. The results of executions are output to quality reports for users to view and analyze.
Quality rule configuration
On the Dataphin home page, in the top menu bar, select Administration > Data Quality.
Click Quality Rule in the left-side navigation pane. On the Metrics page, click the name of the target object to enter the Quality Rule Details page and configure the quality rule.
On the Quality Rule Details page, click the Create Quality Rule button.
In the Create Quality Rule dialog box, configure the parameters.
Parameter
Description
Basic Information
Rule Name
The name of the custom quality rule.
Rule Strength
Supports Weak Rule and Strong Rule.
If you select Weak Rule, an alert is triggered when the quality rule validation result is abnormal but it does not block downstream task nodes.
If you select Strong Rule, an alert is triggered when the quality rule validation result is abnormal. If there are downstream tasks (code inspection scheduling, task trigger scheduling), it will block downstream tasks to prevent data pollution from spreading. If there are no downstream tasks (such as periodic quality scheduling), it will only trigger an alert.
Description
Description of the custom quality rule. No more than 128 characters.
Configuration Method
Template Creation: Quickly create quality rules using general system templates and custom business templates.
System Template: Built-in parameters in the template can be configured, suitable for general rule creation.
Custom Template: Preset parameters in the template do not require configuration, generally used for rule creation with business logic.
SQL: Flexibly customize quality monitoring rules through SQL, suitable for flexible and complex scenarios.
Rule Template
Dropdown to select rule templates, Uniqueness, Stability, SQL.
Uniqueness: Includes Field Group Count Validation and Duplicate Value Count Validation.
Stability: Includes Column Stability Validation and Column Volatility Validation.
SQL: Includes Custom Statistic Validation.
For more information, see the referenced document.
Rule Type
The rule type is related to the template and is the most basic property of the template. It can be used for description and filtering functions.
Rule Configuration
Rule Configuration
When Rule Template is selected as Uniqueness, the corresponding parameters are as follows.
Field Group Count Validation/Duplicate Value Count Validation:
Validation Table Data Filtering: Disabled by default. When enabled, you can configure the filter conditions or partition filtering or general data filtering for the validation table. The filter conditions will be directly appended to the validation SQL. If there is a partition filtering requirement for the validation table, it is recommended to configure the partition expression in the scheduling configuration. After configuration, the quality report will be viewed with the validation partition as the minimum granularity. Fill in the data filtering content, such as:
id = 12 --single table
T1.id=12 and T2.name = "Zhang San" --double table
When Rule Template is selected as Stability, the corresponding parameters are as follows.
Column Stability Validation/Column Volatility Validation:
Statistical Method: It is recommended to choose the statistical method based on the business scenario.
Validation Table Data Filtering: Disabled by default. When enabled, you can configure the filter conditions or partition filtering or general data filtering for the validation table. The filter conditions will be directly appended to the validation SQL. If there is a partition filtering requirement for the validation table, it is recommended to configure the partition expression in the scheduling configuration. After configuration, the quality report will be viewed with the validation partition as the minimum granularity. Fill in the data filtering content, such as:
id = 12 --single table
T1.id=12 and T2.name = "Zhang San" --double table
When Rule Template is selected as SQL, the corresponding parameters are as follows.
Custom Statistic Validation:
SQL: Supports select query statements. The query object must include the primary table. For example:
select sum(sale) from tableA where ds=${bizdate};
Validation Configuration
Rule Validation
After the data quality rule validation, the result will be compared with the abnormal validation configuration. If the conditions are met, the validation result will be failed. It will also trigger alerts and other subsequent processes.
The available indicators for abnormal validation are determined by the template and configuration content. It supports multiple conditional and/or conditions. It is recommended to have fewer than three in actual configuration.
For more information, see the referenced document.
Business Property Configuration
Property Information
The specification for filling in business properties depends on the configuration of the quality rule properties. For example:
The field value type corresponding to the department in charge is an enumeration value (multiple choice). The range of selectable enumeration values includes the Big Data Department, Business Department, and Technical Department. Therefore, when creating a quality rule, this property value is a dropdown multiple-choice box. The selectable options are enumeration values (multiple choice), and the range of selectable enumeration values includes the Big Data Department, Business Department, and Technical Department.
The field value type corresponding to the rule owner is custom input, and the property field length is 256. Therefore, when creating a quality rule, this property value can be entered with up to 256 characters.
If the method for filling in the property field is Range Interval, the configuration method is as follows:
Range Interval: Commonly used when the value range is continuous numbers or dates. You can choose from four symbols: >, >=, <, <=. For more property configurations, see the referenced document.
Scheduling Property Configuration
Scheduling Method
Supports selecting a configured schedule. If the scheduling method is not yet decided, you can configure it after creating the quality rule. If you need to create a new one, see the referenced document.
Click Save to complete the rule configuration.
You can click Preview SQL to compare the current configuration with the last saved configuration, which helps in viewing SQL changes.
NoteIf the key information is not fully filled out, the SQL preview is not available.
The left side shows the SQL preview of the last saved configuration. If not configured, it is empty. The right side shows the SQL preview of the current configuration.
Rule configuration list
You can view the configured metric rule information on the rule configuration list page and perform operations such as view, edit, trial run, run, or delete.
Area
Description
①Filter and search area
Supports quick search by object or rule name.
Supports filtering by rule type, rule template, rule strength, trial run status, or active status.
NoteIf the quality rule property is configured with searchable and filterable business properties and is enabled, you can search or filter based on this property.
②List area
Displays the object type/name, rule name/ID, trial run status, active status, rule type, rule template, rule strength, schedule type, and related knowledge base document information of the rule configuration list. Click the
icon before refresh to select the rule list fields you need to display.
Active Status: It is recommended to conduct a trial run before activating the rule. Activate the status for rules that pass the trial run to avoid incorrect rules blocking online tasks.
After activating the status, the selected rules will automatically execute according to the configured schedule.
After deactivating the status, the selected rules will not automatically execute but can be manually executed.
Related Knowledge Base Document: Click View Details to view the knowledge base information associated with the rule. This includes table name, validation object, rule, and related knowledge base document information. You can also perform search, view, edit, or delete operations on the knowledge base. For more information, see the referenced document.
③Operation area
You can perform view, clone, edit, trial run, run, schedule configuration, associate knowledge base document, or delete operations.
View: View the details of the rule configuration.
Clone: Quickly clone a rule.
Edit: After editing a rule, a trial run is required again.
Trial Run: Supports selecting Existing Schedule or Custom Validation Range to trial run the rule. After the trial run, click the
icon View Trial Run Log.
Run: Supports selecting Existing Schedule or Custom Validation Range to run the rule. After running, you can view the validation results in Quality Record.
Scan Configuration: Supports filtering schedule types or quick searching schedules by schedule name in the dialog box. Also supports editing schedules.
Associate Knowledge Base Document: After associating a rule with a knowledge base, you can view the associated knowledge in the quality rule and administration workbench. Supports selecting unassociated knowledge bases. For creation, see create and manage knowledge base.
Delete: Deleting this quality rule object will delete all quality rules under the object. This action cannot be revoked. Please proceed with caution.
④Batch operation area
You can perform batch trial run, run, schedule configuration, enable, shutdown, modify business properties, associate knowledge base document, or delete operations.
Trial Run: Supports selecting Existing Schedule or Custom Validation Range to batch trial run rules. After the trial run, click the
icon View Trial Run Log.
Run: Supports selecting Existing Schedule or Custom Validation Range to batch run rules. After running, you can view the validation results in Quality Record.
Scan Configuration: Supports filtering schedule types or quick searching schedules by schedule name in the dialog box. Also supports editing schedules to batch configure schedules for quality rules. Only supports modifying selected rules that are editable on the quality rule list page.
Enable: After batch enabling the active status, the selected rules will automatically execute according to the configured schedule. Only supports enabling selected rules that are editable on the quality rule list page.
Shutdown: After batch deactivating the active status, the selected rules will not automatically execute but can be manually executed. Only supports deactivating selected rules that are editable on the quality rule list page.
Modify Business Properties: When the field value type corresponding to the business property is single or multiple choice, batch modification of business properties is supported.
When the field value type corresponding to the business property is multiple choice, appending or modifying property values is supported.
When the field value type corresponding to the business property is single choice, direct modification of property values is supported.
Associate Knowledge Base Document: After associating rules with knowledge, you can view the associated knowledge in the quality rule and administration workbench. Supports batch configuration of knowledge bases for monitored objects. For creation, see create and manage knowledge base.
Delete: Supports batch deletion of quality rule objects. This action cannot be revoked. Please proceed with caution. Only supports deleting selected rules that are editable on the quality rule list page.
New scheduling
When setting up scheduling rules, you can swiftly create configurations using existing schedules, with a maximum of 20 rules per table.
A maximum of 10 schedules can be configured for the same rule.
Automatic deduplication is supported when the scheduling configuration is fully consistent.
The validation scope will be issued as a filter condition in the quality validation statement to control the scope of each quality validation. The validation scope will also serve as the basic unit for subsequent quality reports and other downstream processes. Viewing quality reports will use the validation scope as the smallest viewing granularity.
On the Quality Rule Details page, click the Scan Configuration tab, and then click the New Scheduling button to enter the New Scheduling dialog box.
In the New Scheduling dialog box, configure the parameters.
Parameter
Description
Schedule Name
Custom schedule name.
Schedule Type
Supports Recurrency Triggered, Data Update Triggered, and Task Triggered.
Recurrency Triggered: Supports scheduled, periodic quality checks on data based on the set schedule time. Suitable for scenarios where data output time is relatively fixed.
Recurrence: Running quality rules will occupy certain computing resources. It is recommended to avoid concurrent execution of multiple quality rules at the same time to prevent affecting the normal operation of production tasks. The scheduling cycle includes five types: Day, Week, Month, Hour, and Minute.
Data Update Triggered: When all code tasks are executed, it will parse whether the current table's specified validation scope is updated during this task run. Suitable for tables with non-fixed modification tasks or tables that require focused monitoring, i.e., each change needs to be monitored.
NoteIt is recommended to select the partition updated by the task as the validation scope (non-partitioned tables will validate the entire table). The system will automatically detect all data changes and perform validation to avoid omissions.
Task Triggered: Execute the configured quality rules after or before the specified task runs successfully. Supports selecting task types such as engine SQL, offline pipeline, Python, Shell, Virtual, Datax, Spark_jar, Hive_MR, and database SQL node to trigger tasks. Suitable for scenarios where table modification tasks are fixed.
NoteFixed task triggers can only select production environment tasks. If the rule intensity is configured as a strong rule, a scheduling task validation failure may affect online tasks. Please operate cautiously according to business needs.
Trigger Timing: Select the trigger timing for quality checks. Supports selecting Trigger After All Tasks Run Successfully, Trigger After Each Task Runs Successfully, and Trigger Before Each Task Runs.
Triggering Task: Supports selecting production task nodes for which the current user has maintenance permissions. You can search by node output name.
NoteWhen the trigger timing is selected as trigger after all tasks run successfully, it is recommended to select tasks with the same scheduling cycle to avoid delayed rule execution and delayed quality check results due to different scheduling cycles.
Schedule Condition
Disabled by default. When enabled, it will first determine whether the scheduling conditions are met before the quality rule is officially scheduled. If the conditions are met, it will be officially scheduled. If not, this schedule will be ignored.
Data Timestamp/Executed On: If the schedule type is selected as Recurrency Triggered (timed scheduling does not support execution date), Data Update Triggered, or Task Triggered, date configuration is supported. You can choose Regular Calendar or Custom Calendar. For how to customize a calendar, see Create a public calendar.
If you choose Regular Calendar, the conditions can be Month, Week, or Date. For example, see the figure below:
If you choose Custom Calendar, the conditions can be Date Type or Tag. For example, see the figure below:
Instance Type: If the schedule type is selected as Data Update Triggered or Task Triggered, instance type configuration is supported. You can choose Recurring Instance, Data Backfill Instance, or One-time Instance. For example, see the figure below:
NoteAt least one rule must be configured. To add a rule, click the + Add Rule button.
A maximum of 10 scheduling conditions can be configured.
The relationship between scheduling conditions can be configured as and or or.
Validation Scope
When the schedule type is selected as timed scheduling, fixed task triggered scheduling, the validation scope supports custom validation scope. When the schedule type is selected as data update triggered scheduling, the validation scope supports task updated partition, custom validation scope.
Updated Partition: If a partition is updated in the inspection task, the task will be issued directly according to the updated partition.
NoteDynamic partition scenarios may not parse the partition and will not perform quality validation.
Volatile validation rules (such as checking partition size, partition row count, field statistics) require specifying a partition and do not support task updated partition validation scope.
If there is data update in a non-partitioned table, the entire table will be validated.
Custom Validation Scope: For scenarios that cannot be parsed, you can use a custom validation scope to specify the validation scope expression based on the data timestamp or execution date.
Validation Scope Expression: It is an input-enabled drop-down selection box that supports directly entering the scope to be validated, such as
ds='${yyyyMMdd}'
. You can also select a built-in validation scope expression and then modify it to help you quickly configure. For details on partition expressions, see Built-in partition expression types.NoteIf there are multiple conditions for validation, you can use and or or to connect them, such as
province="Zhejiang" and ds<=${yyyyMMdd}.
If a filter condition is configured in the quality rule, the relationship between the validation scope expression and the filter condition is AND. When validating data, both conditions will be filtered together.
The validation scope expression supports full table scan.
Note: Full table scan will consume significant resources, and some do not support full table scan. It is recommended to configure partition expressions to avoid full table scan.
Validation Scope Budget: The default is the current day's data timestamp.
Click OK to complete the scheduling configuration.
Scheduling configuration list
After the scheduling is created, you can view, edit, clone, or delete it in the scheduling configuration list.
Area | Description |
①Filter and Search Area | Supports quick search by schedule name. Supports filtering by Recurrency Triggered, Data Update Triggered, Task Triggered. |
②List Area | Displays the Schedule Name, Schedule Type, Last Updated By, and Last Updated Time information of the rule configuration list. |
③Operation Area | You can edit, clone, or delete the schedule.
|
Alert configuration
You can configure different alert methods for different rules to distinguish alerts. For example, configure phone alerts for strong rule abnormalities and text message alerts for soft rule abnormalities. If a rule hits multiple alert configurations simultaneously, you can set the effective policy for the alert.
A single monitored object supports creating no more than 20 alert configurations.
On the Quality Rule Details page, click the Alert Configuration tab, then click the New Alert Configuration button to enter the New Alert Configuration dialog box.
In the New Alert Configuration dialog box, configure the parameters.
Parameter
Description
Coverage
Supports selecting All Rules, All Strong Rules, All Soft Rules, and Custom.
NoteUnder a single monitored object, the three ranges of all rules, all strong rules, and all soft rules support configuring one alert each. Newly added rules will automatically match the corresponding alert based on rule strength. If you need to change one of the alert configurations, you can modify the existing configuration.
The custom range can select all configured rules under the current monitored object, not exceeding 200.
Alert Configuration Name
The alert configuration name under a single monitored object is unique and does not exceed 256 characters.
Alert Recipient
Configure the alert recipient and alert method. You need to select at least one alert recipient and alert method.
Alert Recipient: Supports selecting three types of alert recipients: custom, shift schedule, and quality owner.
Supports configuring no more than 5 custom alert recipients and no more than 3 shift schedules.
Alert Method: Supports selecting different receiving methods such as phone, email, text message, DingTalk, Lark, WeCom, and custom channel. This receiving method can be controlled through configure channel settings.
Click OK to complete the alert configuration.
Alert configuration list
After completing the alert configuration, you can sort, edit, or delete operations in the alert configuration list.
Ordinalnumber | Description |
① Sort area | Supports configuring the alert effective policy when a quality rule meets multiple alert configurations:
|
② List area | Displays the name of the alert configuration, the effective range, the specific recipients of each alert type, and the corresponding alert receiving method. Effective Range: Custom alerts support viewing the configured object name and rule name. If the rule is deleted, the object name cannot be viewed. It is recommended to update the alert configuration. |
③ Operation area | You can edit or delete the configured alerts.
|
View quality report
Click Quality Report to view the Rule Validation Overview and Rule Validation Details of the current quality rules.
You can quickly filter validation details based on abnormal results, partition time, or keywords in the names of rules or objects.
In the operation column of the rule validation details list, click the
icon to view the rule validation details of the quality rules.
In the operation column of the rule validation details list, click the
icon to view the execution log of the quality rules.
Set quality rule permission management
You can Click Permission Management, and configure View Details, which specifies members who can view validation records, quality rule details, and quality reports.
View Details: You can select All Members or Only Members With Current Object Quality Management Permissions.
You can Click Confirm to complete the permission management configuration.
What to do next
Once you have configured the quality rule, you can view it on the metric rule list page. For more information, see the monitored object list.