Data Quality lets you detect changes in source data and dirty data from ETL jobs. It blocks problematic tasks and prevents dirty data from spreading downstream. This avoids unexpected results that can affect business use and decisions. It also reduces the time to fix issues and avoids task rerunning.
Billing
The cost of running data quality rules includes two parts:
DataWorks fees
Charged on a pay-as-you-go basis by the number of data quality rule instances. For more information, see Fees for resources.
Engine-specific fees
Data quality checks generate SQL statements that run on the engine, incurring engine fees. For details, see the billing documentation of each engine. For example, if you use MaxCompute in pay-as-you-go mode, data quality checks generate MaxCompute engine charges. These are billed by MaxCompute and do not appear on your DataWorks bill.
Features
Data Quality supports quality checks on common data analytics engines, including MaxCompute, E-MapReduce, Hologres, AnalyticDB for PostgreSQL, AnalyticDB for MySQL, and CDH.
You can configure rules covering completeness, accuracy, validity, consistency, uniqueness, and timeliness. You can associate these data quality rules with scheduling nodes. When a task finishes, data quality checks are triggered immediately. You can set the strength of rules to control when a task should fail or exit, thereby preventing the spread of dirty data and effectively reducing the time and financial costs of data recovery.
The features of each Data Quality module are described below:
Name | Description | |
The Quality Dashboard displays key overview metrics for data quality in the current workspace, trends and distributions of data quality check statuses triggered after instances run, top tables and owners with quality issues, and rule coverage. This helps quality assurance managers quickly understand the overall data quality of the workspace and promptly address issues to improve data quality. | ||
Quality Assets | Shows all configured quality rules. | |
Data Quality allows you to build a custom rule template library to centrally manage common custom monitoring rules, improving the efficiency of rule configuration. | ||
Configure Rules | Data Quality supports configuring quality monitoring rules by table or by template. | |
Configure a monitoring rule for multiple tables based on a template | ||
Quality O&M | Displays all quality monitors created in this workspace. | |
Displays the data quality check results when a quality monitoring task runs. After a quality monitoring task finishes, you can view the details on the Run History page. | ||
Quality Analysis | Data Quality allows users to create report templates and freely add various metrics for rule configurations and rule runs. Reports are generated and sent periodically based on the configured statistical period, sending time, and subscription information. | |
Usage notes
Supported regions for each engine are as follows:
Engine
Supported regions
E-MapReduce
China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Germany (Frankfurt), and US (Silicon Valley).
Hologres
China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Germany (Frankfurt), US (Silicon Valley), and US (Virginia).
AnalyticDB for PostgreSQL
China (Hangzhou), China (Shanghai), China (Beijing), China (Shenzhen), and Japan (Tokyo).
AnalyticDB for MySQL
China (Shenzhen), Singapore, and US (Silicon Valley).
CDH
China (Shanghai), China (Beijing), China (Zhangjiakou), China (Hong Kong), and Germany (Frankfurt).
Before configuring data quality rules for E-MapReduce, Hologres, AnalyticDB for PostgreSQL, AnalyticDB for MySQL, and CDH, you must first collect metadata. For more information, see Collect metadata from an EMR data source.
After configuring data quality rules for tables in E-MapReduce, Hologres, AnalyticDB for PostgreSQL, AnalyticDB for MySQL, and CDH, run the scheduling node that generates the table data on a resource group with an established network connection to trigger the data quality rule checks properly.
Multiple data quality rules can be configured for a single table.
Scenarios
In offline data check scenarios, Data Quality uses the partition expression configured for a table to check the table partitions generated by a node each day. The data quality rule is associated with the scheduling node that produces the table data. When the task finishes running, the quality check is triggered (dry-run tasks do not trigger quality checks). You can set the strength of the rule to control whether the node fails and exits, thereby preventing the spread of dirty data. You can also configure alert settings to receive alert notifications and handle issues promptly.
Configure rules
Create rules: Data Quality allows you to create data quality rules by table. You can also use predefined rule templates to quickly create data quality rules for multiple tables in batches. For more information, see Configure a monitoring rule for a single table and Configure a monitoring rule for multiple tables based on a template.
Subscribe to rules: After creating a rule, you can subscribe to it to receive alert notifications for data quality rule checks. Supported methods include Email, Email and SMS, DingTalk Chatbot, DingTalk Chatbot @ALL, Lark Group Chatbot, Enterprise WeChat Chatbot, and Custom Webhook.
- Note
Only DataWorks Enterprise Edition supports the Custom Webhook method.
Trigger data quality checks
In Operation Center, when a scheduling node associated with a table finishes running (executing the node code logic), it triggers a data quality check, which generates a SQL statement that validates data on the engine. DataWorks determines whether the task should fail and exit based on the strength of the data quality rule and the check results. This blocks downstream nodes from running and prevents dirty data from expanding.
View check results
You can view the data quality check results through the node's runtime log in Operation Center and the Data Quality task query page.
View the node's runtime log in Operation Center
Check the instance status. If the status shows failed, the code may have been run, but the output did not pass a strong data quality rule. This caused the task to exit and blocked downstream instances.
Open the DQC Log in the instance's Runtime Log to view the data quality check results. For more information, see View auto triggered instances.
View through the Running Records page.
In the page, search for the check details of a data quality monitoring task by table or node. For more information, see View the details of a monitor.