Dataphin supports the creation of quality rules for data table validation, streamlining the process of data table quality monitoring. This topic describes the steps to configure quality rules for global data tables.
Prerequisites
Before configuring quality rules, you must add the monitored object. For more information, see Add Monitored Objects.
Permission description
Super administrators, quality administrators, and custom global roles with Quality Rule-Management permissions can configure scheduling, alerts, exception archive tables, scoring weights, and more for quality rules.
Quality owners can configure scheduling, alerts, exception archive tables, scoring weights, and more for quality rules associated with their monitored objects.
Quality owners and regular users require additional read-through permissions for the data source of the global data table. To apply, see Apply for Data Source Permissions.
Operation permissions vary for different objects. For details, see Quality Rule Operation Permissions.
Validation rule description
When a data table is subject to quality rule validation, triggering a Weak Monitoring Rule will prompt the system to send an alert message, aiding in the prompt identification and resolution of exceptions. If a Strong Monitoring Rule is triggered, the system will automatically halt the task involving the table to prevent the propagation of dirty data downstream and will send an alert message.
Differences between trial run and run
A Trial Run simulates the execution of a quality rule to verify its correctness and operational status, with results not displayed in the quality report. A Run executes the quality rule within a specified timeframe, with results documented in the quality report for user review and analysis.
Quality rule configuration
On the Dataphin home page, navigate to Administration > Data Quality via the top menu bar.
Click Quality Rule in the left-side navigation pane. On the Global Data Table page, select the target object to access the Quality Rule Details page and configure the quality rule.
On the Quality Rule Details page, click Create Quality Rule.
In the Create Quality Rule dialog box, set the necessary parameters.
Parameter
Description
Basic Information
Rule Name
Custom name for the quality rule, not exceeding 256 characters.
Rule Strength
Supports Weak Rule and Strong Rule.
Weak Rule: If you select Weak Rule, the quality rule validation result will alert if abnormal but will not block downstream task nodes.
Strong Rule: If you select Strong Rule, the quality rule validation result will alert if abnormal. If there are downstream tasks (code check scheduling, task trigger scheduling), it will block downstream tasks to prevent data contamination. If there are no downstream tasks (such as periodic quality scheduling), it will only alert.
Description
Custom description of the quality rule. Not exceeding 128 characters.
Configuration Method
Template Creation: Use general system templates and custom business templates to quickly create quality rules.
System Template: Built-in parameters in the template can be configured, suitable for general rule creation.
Custom Template: Pre-set parameters in the template, no configuration required, generally used for rules with business logic.
SQL: You can flexibly customize quality monitoring rules through SQL, suitable for flexible and complex scenarios.
Rule Template
Dropdown to select rule templates, including Integrity, Uniqueness, Timeliness, Validity, Consistency, Stability, Custom SQL.
Integrity: Includes Field Null Value Check and Field Empty String Check.
Uniqueness: Includes Field Uniqueness Check, Field Group Count Check, and Field Duplicate Value Count Check.
Timeliness: Includes Time Function Comparison, Single Table Time Field Comparison, and Two Table Time Field Comparison.
Validity: Includes Field Format Check, Field Length Check, Field Range Check, Lookup Table Reference Comparison, and Data Standard Lookup Table Reference Comparison (requires Data Standard module activation).
Consistency: Includes Single Table Field Value Consistency Comparison, Single Table Field Statistical Value Consistency Comparison, Single Field Business Logic Consistency Comparison, Two Table Field Value Consistency Comparison, Two Table Field Statistical Value Consistency Comparison, Two Table Field Business Logic Consistency Comparison, and Cross-source Two Table Field Statistical Value Consistency Comparison.
Stability: Includes Table Stability Check, Table Volatility Check, Field Stability Check, and Field Volatility Check.
Custom SQL: Includes Custom Statistical Indicator Check and Custom Data Detail Check.
For details, see Template type description.
Rule Type
The rule type is related to the template and is the most basic attribute of the template, which can be used for description and filtering functions.
Monitoring Granularity
When the configuration type is Custom SQL, monitoring granularity can be configured. You can choose Full Table or specify specific monitoring Fields.
Template Configuration
Template Information
When selecting a quality rule template, the configuration information of the template is displayed for you. If you need to modify the configuration information, you can modify it in Quality Rule Template.
Rule Configuration
Rule Configuration
The rule configuration changes depending on the selected rule template. For details, see Data Table Parameter Configuration.
Data Filtering For Validation Table: Disabled by default. After enabling, you can configure filter conditions for the validation table, partition filtering, or regular data filtering. The filter conditions will be directly appended to the validation SQL. If there is a partition filtering requirement for the validation table, it is recommended to configure the partition filter expression in the scheduling configuration. After configuration, the quality report will be viewed with the validation partition as the smallest granularity.
When the rule template is selected as Consistency/Two Table Field Statistical Value Consistency Comparison or Consistency/Cross-source Two Table Field Statistical Value Consistency Comparison, you can choose whether to enable Data Filtering For Comparison Table. After enabling, you can configure filter conditions for the comparison table, partition filtering, or regular data filtering. The filter conditions will be directly appended to the validation SQL.
Validation Configuration
Rule Validation
After data quality rule validation, the results will be compared with the configuration of exception validation. If the conditions are met, the validation result will be failed. At the same time, it will trigger alerts and other subsequent processes.
The available indicators for exception validation are determined by the template and configuration content. It supports multiple conditions with and/or conditions. It is recommended to configure less than three conditions in actual configuration.
For details, see Validation Configuration Description.
Archive Configuration
Exception Archive
By default, it is disabled. After enabling, you can archive abnormal data to a file or table. After quality verification, you can download or parse the archived abnormal data.
Archive Mode supports Only Archive Exception Fields and Archive Full Record.
Only Archive Exception Fields: Only deduplicate and archive the current monitoring fields, suitable for situations where a single field can fully determine the exception data.
Archive Full Record: Archive the entire record where the exception data is located, suitable for situations where the complete record is necessary to locate the exception data. Note: Archiving the full record will significantly increase the amount of archived data. It is recommended to use only archive exception fields under normal circumstances.
Archive Location supports Default File Server and Archive Table For Exception Data. If an exception archive table has not been created, click Manage Exception Archive Table to create one. For details, see Add Exception Archive Table.
Default File Server: Refers to the system file server configured during Dataphin deployment. You can directly download exception data from the Validation Record-Validation Details page or directly access the default file server to obtain exception data. When using the default file server, a maximum of 100 exception data records can be archived per validation execution, suitable for scenarios with small data volume validation.
Archive Table For Exception Data: If you want to store more exception data or place exception data from different validation records together for subsequent comparison and analysis, it is recommended to specify an archive table yourself. Each quality rule can record up to 10,000 validation exception data per run. In addition to quickly downloading exception data from a single validation on the validation record page, you can also directly access the archive table and customize the table lifecycle for greater flexibility.
NoteDownload the summarized exception data generated by all rules involved in this run, with a download limit of 1,000 records. If you want to view more data, it is recommended to archive it to the specified exception archive table and then directly access the archive table to obtain it.
The exception archive table must meet specific format requirements. Otherwise, errors may occur when writing data, affecting usage. For details, see Add Exception Archive Table.
BusinessAttributeConfiguration
AttributeInformation
The specification for filling in business attributes depends on the configuration of quality rule attributes. For example, the field value type corresponding to the department in charge is an enumeration value (multiple choice), and the optional enumeration value range is the Big Data Department, Business Department, and Technical Department. Therefore, when creating a quality rule, the attribute value is a dropdown multiple-choice box, and the optional items are enumeration values (multiple choice), with the optional enumeration value range being the Big Data Department, Business Department, and Technical Department.
The field value type corresponding to the rule owner is custom input, with an attribute field length of 256. Therefore, when creating a quality rule, the attribute value can be entered with no more than 256 characters.
If the filling method of the attribute field is Range Interval, the configuration method is as follows:
Range Interval: Commonly used when the value range is continuous numerical or date. You can choose from four symbols: >, >=, <, <=. For more attribute configurations, see Create and manage quality rule attributes.
Scheduling Attribute Configuration
Scheduling Method
Supports selecting configured scheduling. If the scheduling method is not yet decided, you can configure it after creating the quality rule. To create a new one, see Create Scheduling.
Quality Score Configuration
Scoring Method
Supports two scoring methods: quality validation status and data qualification ratio.
Quality Validation Status: Scores based on the validation status of the most recent successful validation record of the current rule. 100 points for passing validation, 0 points for failing validation.
Data Qualification Ratio: Uses the proportion of normal data (i.e., normal rate) in the most recent successful validation record of the current rule as the score. For example, if the data format validity is 80%, the quality score is 80 points.
Different rule templates support different scoring methods. Only the Quality Validation Status scoring method is supported as follows:
Field group count check and field duplicate value count check in the uniqueness rule category.
Single table field statistical value consistency comparison and cross-source two table field statistical value consistency comparison in the consistency rule category.
Stability rule category.
Custom statistical indicator check in the custom SQL rule category.
Quality Score Weight
The quality score weight of the quality rule is used to calculate the quality score of the monitored object. Supports selecting integers between 1 and 10.
Click OK to finalize the rule configuration.
To compare the current configuration with the last saved one, click Preview SQL for an easier review of SQL changes.
NoteIf key information is incomplete, SQL preview is not available.
The left side shows the SQL preview of the last saved configuration, which will be empty if not previously configured. The right side displays the SQL preview of the current configuration.
Rule configuration list
The rule configuration list page allows you to view configured data table rule information and perform actions such as viewing, editing, trial running, running, and deleting.
Area | Description |
①Filter and Search Area | Supports quick search by object or rule name. Supports filtering by rule type, rule template, rule strength, trial run status, and effective status. Note If the quality rule attribute is configured with searchable and filterable business attributes and is enabled, you can search or filter based on that attribute. |
②List Area | Displays the object type/name, rule name/ID, test run status, effective status, rule type, rule template, rule strength, schedule type, and related knowledge base document information of the rule configuration list. Click the
|
③Operation Area | You can perform view, clone, edit, trial run, run, scheduling configuration, associate knowledge base document, quality score configuration, and delete operations.
|
④Batch Operation Area | You can perform batch trial run, run, scheduling configuration, enable, disable, modify business attributes, associate knowledge base document, quality score configuration, export rules, and delete operations.
|
Create scheduling
When setting up scheduling rules for configuration, you can swiftly establish them by leveraging up to 20 existing scheduling rules per table.
A single rule can have up to 10 schedules configured.
Deduplication occurs automatically when the scheduling configuration is fully consistent.
The validation range acts as a filter condition during quality validation, defining the scope of each validation. It also serves as the base unit for quality reports and downstream processes, with the quality report using the validation range as the smallest granularity.
On the Quality Rule Details page, select the Scan Configuration tab, then click Create Scheduling to open the Create Scheduling dialog box.
In the Create Scheduling dialog box, enter the required parameters.
Parameter
Description
Scheduling Name
Custom scheduling name, not exceeding 64 characters.
Scheduling Type
Supports Recurrency Triggered and Task Triggered.
Recurrency Triggered: According to the set scheduling time, supports regular and periodic quality checks on data, suitable for scenarios where data production time is relatively fixed.
Recurrence: Running quality rules will occupy certain computing resources. It is recommended to avoid concurrent execution of multiple quality rules at the same time to avoid affecting the normal operation of production tasks. The scheduling cycle includes five cycle types: Day, Week, Month, Hour, Minute.
Task Triggered: Execute the configured quality rule after or before the specified task runs successfully. Supports selecting task types such as engine SQL, offline pipeline, Python, Shell, Virtual, Datax, Spark_jar, Hive_MR, and database SQL nodes to trigger tasks. Suitable for situations where table modification tasks are fixed.
NoteFixed task triggering can only select production environment tasks. If the rule strength is configured with a strong rule, scheduling task validation failure may affect online tasks. Please operate with caution according to business needs.
Trigger Timing: Select the trigger timing for quality checks. Supports selecting Trigger After All Tasks Run Successfully, Trigger After Each Task Runs Successfully, and Trigger Before Each Task Runs.
Triggering Task: Supports selecting production task nodes that the current user has maintenance permissions for. You can search by node output name.
NoteWhen the trigger timing is selected as trigger after all tasks run successfully, it is recommended to select tasks with the same scheduling cycle to avoid rule delay execution and quality check result delay output due to different scheduling cycles.
Schedule Condition
Disabled by default. After enabling, before the formal scheduling of the quality rule, it will first determine whether the scheduling conditions are met. If the conditions are met, it will be formally scheduled. If not, this scheduling will be ignored.
Business Date/Executed On: If the scheduling type is selected as Recurrency Triggered (timed scheduling does not support execution date), Task Triggered, date configuration is supported. You can choose Regular Calendar or Custom Calendar. For how to customize the calendar, see Create Public Calendar.
If you choose Regular Calendar, the conditions can be selected as Month, Week, Date. For example, as shown below:
If you choose Custom Calendar, the conditions can be selected as Date Type, Tag. For example, as shown below:
Instance Type: If the scheduling type is selected as Task Triggered, instance type configuration is supported. You can choose Recurring Instance, Data Backfill Instance, One-time Instance. For example, as shown below:
NoteAt least one rule must be configured. To add a rule, click the +add Rule button.
A maximum of 10 scheduling conditions can be configured.
The relationship between scheduling conditions can be configured as and/or.
Validation Range Expression
It is an input dropdown selection box that supports directly entering the range to be validated, such as
ds='${yyyyMMdd}'
. You can also select the built-in validation range expression and then modify it to help you quickly configure it. For details on partition filter expressions, see Built-in Partition Expression Types.NoteIf there are multiple conditions for validation, you can use and or or to connect them, such as
province="Zhejiang" and ds<=${yyyyMMdd}.
If a filter condition is configured in the quality rule, the validation range expression and the filter condition are in an and relationship. When validating data, both conditions will be filtered together.
The validation range expression supports scanning the entire table.
Note: Scanning the entire table will consume significant resources, and some do not support scanning the entire table. It is recommended to configure a partition filter expression to avoid scanning the entire table.
Validation Range Budget
Default business date is today.
Click OK to finalize the scheduling configuration.
Scheduling configuration list
Once scheduling creation is complete, you can view, edit, clone, delete, and perform other operations in the scheduling configuration list.
Area | Description |
①Filter and Search Area | Supports quick search by scheduling name. Supports filtering by Recurrency Triggered and Task Triggered. |
②List Area | Displays the Scheduling Name, Schedule Type, Last Updated By, and Last Updated Time information of the rule configuration list. |
③Operation Area | You can perform edit, clone, and delete operations on the scheduling.
|
Alert configuration
Configure different alert methods for various rules to facilitate differentiated alerts, such as phone alerts for strong rule exceptions and text message alerts for weak rule exceptions. If multiple alert configurations are hit simultaneously, set the effective alert policy accordingly.
You can create up to 20 alert configurations for each monitored object.
On the Quality Rule Details page, click the Alert Configuration tab, then click Create Alert Configuration to open the Create Alert Configuration dialog box.
In the Create Alert Configuration dialog box, set the parameters.
Parameter
Description
Coverage
Supports selecting All Rules, All Strong Rules, All Weak Rules, and Custom.
NoteUnder a single monitored object, the three ranges of all rules, all strong rules, and all weak rules support configuring one alert each. Newly added rules will automatically match the corresponding alert based on rule strength. If you need to change one of the alert configurations, you can modify the existing configuration.
The custom range can select all configured rules under the current monitored object, not exceeding 200.
Alert Configuration Name
The alert configuration name is unique under a single monitored object, not exceeding 256 characters.
Alert Recipients
Configure alert recipients and alert methods. At least one alert recipient and alert method must be selected.
Alert Recipients: Supports selecting custom, shift schedule, and quality owner as alert recipients.
Supports configuring up to 5 custom alert recipients and up to 3 shift schedules.
Alert Methods: Supports selecting different receiving methods such as phone, email, text message, DingTalk, Lark, WeCom, and custom channels. This receiving method can be controlled through Configure Channel Settings.
Click OK to complete the alert configuration.
Alert configuration list
After completing the alert configuration, you can sort, edit, and delete entries in the alert configuration list.
OrdinalNumber | Description |
① Sorting Area | Supports configuring the alert effective policy when a quality rule meets multiple alert configurations:
|
② List Area | Displays the name, effective range, specific recipients for each alert type, and corresponding alert receiving methods of the alert configuration. Effective Range: Custom alerts support viewing the configured object name and rule name. If the rule is deleted, the object name cannot be viewed. It is recommended to update the alert configuration. |
③ Operation Area | You can perform edit and delete operations on the configured alerts.
|
Add exception archive table
The exception archive table is used for archiving exceptions that occur during quality rule validation.
On the Quality Rule Details page, click the Archive tab, then click + Add Exception Archive Table to open the Add Exception Archive Table dialog box.
In the Add Exception Archive Table dialog box, choose the method of addition and configure the parameters accordingly.
The Add Method offers two options: Create New Table or Select Existing Table. Special quality validation fields are included, and exception data is not written into the original data table.
For Create New Table, customize the table name within the project or section of the archive table. The default name is current_table_name_exception_data. After successful creation, a new table will be created in the same database or data source. The name supports letters,numbers, underscores (_), and dots (.), and must not exceed 128 characters.
If the monitored table is a physical table, the archive table will be created within the project of the monitored table.
If the monitored table is a logical dimension or fact table, the archive table will default to the project of the monitored table. Alternatively, you can specify a project within the section of the monitored table, such as projectA.table_name.
For a logical aggregate table, it is advisable to name the archive table under the same section's project. If not, it will default to a project within the section of the monitored table.
The archive table must include all fields and validation fields from the quality monitoring table. The script format is as follows:
create table current_table_name_exception_data (dataphin_quality_tenant_id varchar(64) comment 'Tenant ID' , dataphin_quality_rule_id varchar(64) comment 'Quality Rule ID', dataphin_quality_rule_name varchar(256) comment 'Quality Rule Name', dataphin_quality_column_name varchar(1024) comment 'Validation Field Name', dataphin_quality_watch_task_id varchar(128) comment 'Monitored Object Task ID', dataphin_quality_rule_task_id varchar(64) comment 'Rule Task ID', dataphin_quality_validate_time varchar(64) comment 'Quality Validation Time', dataphin_quality_archive_mode varchar(32) comment 'Exception Archive Mode, ONLY_ERROR_FIELD/FULL_RECORD', dataphin_quality_error_data string comment 'Exception Data', ljba_id bigint comment 'ljba_primary key', ljb_id bigint comment 'ljb_primary key', col_tinyint tinyint comment 'Field type is TINYINT and lowercase', col_tinyint_02 tinyint comment '2', col_smallint smallint comment 'Field type is SMALLINT and lowercase', col_smallint_02 smallint comment '4', col_int int comment 'Field type is INT and lowercase', col_int_02 int comment '6', col_bigint bigint comment 'Field type is BIGINT and lowercase', col_bigint_02 bigint comment '8', col_float float comment 'Field type is FLOAT and lowercase', col_float_02 float comment '10', col_double double comment 'Field type is DOUBLE and lowercase', col_double_02 double comment '11', col_decimal decimal(38,18) comment 'Field type is DECIMAL(38,18) and lowercase', col_decimal_02 decimal(38,18) comment '12', col_varchar varchar(500) comment 'Field type is VARCHAR(500) and lowercase', col_varchar_02 varchar(500) comment '13', col_char char(10) comment 'Field type is CHAR(10) and lowercase', col_char_02 char(10) comment '14', col_string string comment 'Field type is STRING and lowercase', col_string_02 string comment '15', col_date date comment 'Field type is DATE and lowercase', col_date_02 date comment '16', col_datetime datetime comment 'Field type is DATETIME and lowercase', col_datetime_02 datetime comment '17', col_timestmap timestamp comment 'Field type is TIMESTAMP and lowercase', col_timestmap_02 timestamp comment '18', col_boolean boolean comment 'Field type is BOOLEAN and lowercase', col_boolean_02 boolean comment '19', col_binary binary comment 'Field type is BINARY and lowercase', col_binary_02 binary comment '20', col_array array<int> comment 'Field type is ARRAY<int> and lowercase', col_array_02 array<string> comment '21', col_map map<string,string> comment 'Field type is MAP<string, string> and lowercase', col_map_02 map<string,int> comment '22', ds string comment 'Date partition, yyyyMMdd' ) partitioned by (dataphin_quality_validate_date string comment 'Validation Date (Partition Field)');
Select Existing Table: Choose from tables within the same project or data source. The archive table must encompass all fields and validation fields from the quality monitoring table. To view the table creation statement, click View Exception Archive Table DDL. The script format is as follows:
create table current_table_name_exception_data (dataphin_quality_tenant_id varchar(64) comment 'Tenant ID' , dataphin_quality_rule_id varchar(64) comment 'Quality Rule ID', dataphin_quality_rule_name varchar(256) comment 'Quality Rule Name', dataphin_quality_column_name varchar(1024) comment 'Validation Field Name', dataphin_quality_watch_task_id varchar(128) comment 'Monitored Object Task ID', dataphin_quality_rule_task_id varchar(64) comment 'Rule Task ID', dataphin_quality_validate_time varchar(64) comment 'Quality Validation Time', dataphin_quality_archive_mode varchar(32) comment 'Exception Archive Mode, ONLY_ERROR_FIELD/FULL_RECORD', dataphin_quality_error_data string comment 'Exception Data', ljba_id bigint comment 'ljba_primary key', ljb_id bigint comment 'ljb_primary key', col_tinyint tinyint comment 'Field type is TINYINT and lowercase', col_tinyint_02 tinyint comment '2', col_smallint smallint comment 'Field type is SMALLINT and lowercase', col_smallint_02 smallint comment '4', col_int int comment 'Field type is INT and lowercase', col_int_02 int comment '6', col_bigint bigint comment 'Field type is BIGINT and lowercase', col_bigint_02 bigint comment '8', col_float float comment 'Field type is FLOAT and lowercase', col_float_02 float comment '10', col_double double comment 'Field type is DOUBLE and lowercase', col_double_02 double comment '11', col_decimal decimal(38,18) comment 'Field type is DECIMAL(38,18) and lowercase', col_decimal_02 decimal(38,18) comment '12', col_varchar varchar(500) comment 'Field type is VARCHAR(500) and lowercase', col_varchar_02 varchar(500) comment '13', col_char char(10) comment 'Field type is CHAR(10) and lowercase', col_char_02 char(10) comment '14', col_string string comment 'Field type is STRING and lowercase', col_string_02 string comment '15', col_date date comment 'Field type is DATE and lowercase', col_date_02 date comment '16', col_datetime datetime comment 'Field type is DATETIME and lowercase', col_datetime_02 datetime comment '17', col_timestmap timestamp comment 'Field type is TIMESTAMP and lowercase', col_timestmap_02 timestamp comment '18', col_boolean boolean comment 'Field type is BOOLEAN and lowercase', col_boolean_02 boolean comment '19', col_binary binary comment 'Field type is BINARY and lowercase', col_binary_02 binary comment '20', col_array array<int> comment 'Field type is ARRAY<int> and lowercase', col_array_02 array<string> comment '21', col_map map<string,string> comment 'Field type is MAP<string, string> and lowercase', col_map_02 map<string,int> comment '22', ds string comment 'Date partition, yyyyMMdd' ) partitioned by (dataphin_quality_validate_date string comment 'Validation Date (Partition Field)');
Click OK to add the exception archive table.
Select Automatically Set As Effective Archive Table After Creation for convenience when creating future quality rules.
View exception archive table list
Upon successful addition, the first table is set as the current effective archive table by default. Click the exception archive table name to view its structure. Additionally, you can set other tables as the effective archive table or delete them.
Set As Effective Archive Table: Designating this table as the effective archive table means all quality rules created under this monitored object with a custom exception archive table will direct exception data to this specified table.
Delete: This action only removes the reference to the exception archive table, not the table itself. You can re-add it if necessary.
View quality report
Click Quality Report to access the Rule Validation Overview and Rule Validation Details for the current quality rule.
Filter validation details quickly by exception results, partition time, rule, or object name keywords.
In the operation column of the rule validation details list, click the
icon to view detailed information for the quality rule.
In the operation column, click the
icon to access the execution logs for the quality rules.
Set quality rule permission management
Click Permission Management to configure View Details, which allows specified members to view validation record details, quality rule details, and quality reports.
View Details: Choose between All Members or Only Members With Current Object Quality Management Permissions.
Click OK to finalize the permission management settings.
What to do next
After completing the quality rule configurations, you can review them on the global data table rule list page. For details, see View Monitored Object List.