Monitor the data quality of the generated user profile data - DataWorks

This topic uses the ods_user_info_d_starrocks table as an example to demonstrate how to configure data quality rules. You will learn to set up a strong rule to verify that the table is not empty and a weak rule to check for primary key uniqueness. These rules are triggered during daily scheduled synchronization tasks to detect and prevent exceptions such as missing source data or duplicate primary keys in real time, ensuring the reliability of downstream computing.

Prerequisites

Before you begin, ensure that you have completed the data synchronization and processing steps.

The basic user information in the ApsaraDB RDS for MySQL table ods_user_info_d is synchronized to the ods_user_info_d_starrocks table created in an E-MapReduce (EMR) Serverless StarRocks instance by using Data Integration.
The website access logs of users in user_log.txt in Object Storage Service (OSS) are synchronized to the ods_raw_log_d_starrocks table created in an EMR Serverless StarRocks instance by using Data Integration.
The collected data is processed into basic user profile data in Data Studio.

Analysis of data quality monitoring requirements

In this example, data quality is used to promptly detect changes to source data in the user profile analysis case and dirty data generated when the extract, transform, and load (ETL) operations are performed on the source data. The following table describes the monitoring requirements for the user profile analysis and processing procedure.

Table name	Detailed requirement
ods_raw_log_d_starrocks	Configure a strong rule to verify that the daily synchronized row count is greater than 0. This ensures that raw log data is successfully obtained each day and prevents downstream computing from being affected by missing data.
ods_user_info_d_starrocks	Configure a strong rule to verify that the daily synchronized row count is greater than 0 and a weak rule to check the uniqueness of the business primary key. These rules ensure that user information is successfully obtained each day, prevent duplicate primary keys, and maintain the accuracy of subsequent computations.
dwd_log_info_di_starrocks	Run the node without configuring a monitoring rule.
dws_user_info_all_di_starrocks	Run the node without configuring a monitoring rule.
ads_user_info_1d_starrocks	Configure a rule that monitors the fluctuation of the number of rows in the user information table on a daily basis. The rule is used to observe the fluctuation of daily unique visitors (UVs) and helps you learn the application status at the earliest opportunity.

The following steps use the ods_user_info_d_starrocks table as an example to guide you through configuring monitoring rules for periodically generated table data.

Step 1: Go to the Configure by Table page

Go to the Data Quality page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Governance > Data Quality. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Quality.
Go to the Configure by Table page.
In the left-side navigation pane of the Data Quality page, choose Configure Rules > Configure by Table. On the Configure by Table page, find the target table.
- Database type: StarRocks.
- Table: ods_user_info_d_starrocks.
Find the desired table in the search results and click Rule Management in the Actions column. The Table Quality Details page of the table appears. The following sections describe the configurations of the table.

Step 2: Configure monitoring rules

This section uses the ods_user_info_d_starrocks table as an example to explain how to configure a rule that verifies a specified partition is not empty. This includes creating the rule, defining its trigger method, and setting a policy for handling exceptions.

Select a monitoring scope.
1. On the Monitor tab, click Create Monitor.
2. Set the Data Range parameter to dt=$[yyyymmdd-1].
  Note
  To monitor table data generated by a periodic schedule, ensure that the Data Range value corresponds to the partition generated for the table on the current day.
Create monitoring rules.
This section demonstrates how to configure a rule to check that the row count is greater than 0 for the ods_user_info_d_starrocks table. For more information about how to configure monitoring rules, see Configure a monitoring rule for a single table.
1. On the Create Monitor page, click Create Rule. The Create Rule panel appears.
2. On the System Template tab of the Create Rule panel, find the Table is not empty rule and click Use. On the right side of the panel, set the Degree of Importance parameter to Strong Rule.
  Note
  In this example, the rule is defined as a strong rule. This indicates that when the number of rows in the ods_user_info_d_starrocks table is 0, an alert is triggered and downstream nodes are blocked from running.
3. On the System Template tab of the Create Rule panel, find the Unique value. fixed value rule and click Use. On the right side of the panel, configure the following parameters.
  - Rule Scope: uid(STRING).
  - Monitoring Threshold: Set the expected number of duplicates to 0.
  - Degree of Importance: Weak rules.
4. Click OK to save the configured monitoring rules.
Specify the trigger method.
Set the trigger method to Triggered by Node Scheduling in Production Environment and select the ods_user_info_d_starrocks node that is created during data synchronization.
Specify the exception handling policy.
As needed, define the handling policy to either blocking the running of the node or sending an alert notification to the recipient.
After the configuration is complete, click Save.

Step 3: Test the monitor

.After configuring the rules, perform a test run to verify they work as expected. A test run helps ensure your rule configurations are correct before you deploy them.

In the Monitor Perspective section of the Rule Management tab, select the created monitor. Then, click Test Run on the right side of the tab. The Test Run dialog box appears.
In the Test Run dialog box, configure the Scheduling Time parameter and click Test Run.
After the test run is complete, click View Details to check whether the data passes the validation checks.

Step 4: Subscribe to monitor alerts

After configuring the monitoring rules, subscribe to alerts to ensure you are notified of any issues. The following steps show how to configure notification methods and recipients.

In the Monitor Perspective section of the Rule Management tab, select the created monitor.
Click Alert Subscription on the right side of the tab.
In the Alert Subscription dialog box, configure the Notification Method and Recipient parameters, and click Save in the Actions column.
After the subscription configuration is complete, choose Quality O&M > Monitor in the left-side navigation pane. Then, select My Subscriptions on the Monitor page to view and modify the subscribed monitors.

What to do next

After the data is processed, you can use DataAnalysis to visualize the data. For more information, see Visualize data on a dashboard.