Data quality is the basis for effective and accurate data analysis. This topic describes the business scenario based on which the tutorial for guaranteeing data quality is carried out and the standards for assessing data quality.
To guarantee data quality, you must clarify the data consumption scenario and processing workflow.
This tutorial uses data that comes from HTTP access logs of a website as an example. Based on the logs, you can collect statistics on and present the number of page views (PVs) and unique visitors (UVs) of the website by region and terminal type, such as Android, iPad, iPhone, and PC.
To guarantee the data quality in the entire data processing workflow, you must monitor data at the operational data store (ODS), common data model (CDM), and application data store (ADS) layers of a data warehouse. For more information about the layers, see Divide a data warehouse into layers. This tutorial is based on the Build an online operation analysis platform tutorial. The ods_user_trace_log, dw_user_trace_log, and rpt_user_trace_log tables represent data at the ODS, CDM, and ADS layers, respectively. For more information, see Design workflows.
Standards for assessing data quality
You can assess the quality of data based on the data integrity, accuracy, consistency, and timeliness.
The integrity of data refers to whether data records are complete. Data is incomplete if a data record, that is, a table row, is missing or a field in a data record is null. This tutorial shows you how to create rules to monitor Tablestore data that a MaxCompute foreign table references and data at the CDM and ADS layers of a data warehouse. The rules allow you to check whether the number of table rows is greater than 0, whether the number of table rows fluctuates within an expected range, and whether null or duplicate field values exist.
The accuracy of data refers to whether data records are correct. For example, if the number of UVs or PVs is smaller than 0, the data is incorrect.
The same set of data must be consistent for different workflows and nodes. For example, the
provincefield in a table has values Zhejiang and ZJ. In this case, two records are found if you execute an SQL statement with the GROUP BY province clause to query data in the table.
The timeliness of data refers to whether data can be generated in a timely manner at the ADS layer. To guarantee data timeliness, you must make sure that data is generated in a timely manner on each node in the data processing workflow. This tutorial shows you how to use the Monitor service of DataWorks to guarantee data timeliness on each node.