Use DataWorks Data Quality to detect data anomalies in EMR Serverless Spark tables during daily scheduled runs. This tutorial walks through configuring strong and weak monitoring rules for the ods_user_info_d_spark table, running a test, and subscribing to alert notifications.
Prerequisites
Before you begin, make sure you have:
Synchronized
ods_user_info_dfrom ApsaraDB RDS for MySQL toods_user_info_d_sparkin an EMR Serverless Spark workspace via Data IntegrationSynchronized
user_log.txtfrom Object Storage Service (OSS) toods_raw_log_d_sparkin an EMR Serverless Spark workspace via Data IntegrationProcessed the collected data into basic user profile data in Data Studio
Monitoring requirements
In this tutorial, Data Quality monitors the ods_user_info_d_spark table for two types of anomalies that can occur during extract, transform, and load (ETL) operations: missing source data and duplicate business primary keys. The following table summarizes the monitoring requirements for all tables in the user profile analysis pipeline.
| Table | Monitoring requirement |
|---|---|
ods_raw_log_d_spark | Strong rule: row count > 0 daily |
ods_user_info_d_spark | Strong rule: row count > 0 daily; weak rule: business primary key (uid) unique daily |
dwd_log_info_di_spark | No monitoring rule |
dws_user_info_all_di_spark | No monitoring rule |
ads_user_info_1d_spark | Rule: monitors row count fluctuation daily (tracks unique visitors) |
Rule types
Before configuring rules, understand how each rule type handles violations:
| Rule type | On violation | When to use |
|---|---|---|
| Strong rule | Triggers an alert and blocks all descendant nodes from running | Data that must be present for downstream jobs to produce valid results (e.g., row count > 0) |
| Weak rule | Triggers an alert only; downstream nodes continue running | Data quality issues worth tracking but not critical enough to halt the pipeline (e.g., duplicate primary keys) |
Step 1: Go to the Configure by Table page
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Governance > Data Quality. On the page that appears, select the desired workspace and click Go to Data Quality.
In the left-side navigation pane of the Data Quality page, choose Configure Rules > Configure by Table.
On the Configure by Table page, filter by connection: in the Connection section, select E-MapReduce. Use the filter fields on the right to find the
ods_user_info_d_sparktable.In the search results, click Rule Management in the Actions column for the
ods_user_info_d_sparktable.
Step 2: Configure monitoring rules
On the Monitor tab, click Create Monitor.
Set the Data Range parameter to target the previous day's partition:
The
dt=$[yyyymmdd-1]expression resolves to the partition generated for the current day during scheduling. This ensures the monitor checks the data that the daily sync node produces.dt=$[yyyymmdd-1]Click Create Rule. In the Create Rule panel, go to the System Template tab and configure two rules: Rule 1 — Row count check (strong rule) Find the Table is not empty template and click Use. On the right side of the panel, set Degree of Importance to Strong Rule. When the row count in
ods_user_info_d_sparkis 0, this rule triggers an alert and blocks descendant nodes. Rule 2 — Primary key uniqueness check (weak rule) Find the Unique value. fixed value template and click Use. Configure the following parameters: A threshold value of0means the rule flags any duplicateuidvalues (non-zero duplicate count).Parameter Value Rule Scope uid(STRING)Monitoring Threshold Normal threshold; comparison operator =; value0Degree of Importance Weak rules Click Determine to save both rules.
Set the Trigger Method to Triggered by Node Scheduling in Production Environment and select the
ods_user_info_d_sparknode created during data synchronization.Set the handling policy to blocking the running of the node or sending an alert notification to the recipient, based on your requirements.
Click Save.
Step 3: Run a test
Before relying on the monitor in production, verify the rule configurations with a test run.
In the Monitor Perspective section of the Rule Management tab, select the monitor you created.
Click Test Run on the right side of the tab.
In the Test Run dialog, set the Scheduling Time to a date when data exists in the
dt=$[yyyymmdd-1]partition, then click Test Run.After the run completes, click View Details to check the results.
Step 4: Subscribe to alerts
Configure an alert subscription so you receive notifications when the monitor detects a violation.
In the Monitor Perspective section of the Rule Management tab, select the monitor.
Click Alert Subscription.
In the Alert Subscription dialog, configure the Notification Method and Recipient, then click Save in the Actions column.
To view or modify your subscriptions, choose Quality O&M > Monitor in the left-side navigation pane, then select My Subscriptions.
What's next
After the data is processed, visualize it on a dashboard with DataAnalysis. For more information, see Visualize data on a dashboard.