Configure EMR Serverless Spark Data Quality Rules - E-MapReduce

Use DataWorks Data Quality to detect data anomalies in EMR Serverless Spark tables during daily scheduled runs. This tutorial walks through configuring strong and weak monitoring rules for the ods_user_info_d_spark table, running a test, and subscribing to alert notifications.

Prerequisites

Before you begin, make sure you have:

Synchronized ods_user_info_d from ApsaraDB RDS for MySQL to ods_user_info_d_spark in an EMR Serverless Spark workspace via Data Integration
Synchronized user_log.txt from Object Storage Service (OSS) to ods_raw_log_d_spark in an EMR Serverless Spark workspace via Data Integration
Processed the collected data into basic user profile data in Data Studio

Monitoring requirements

In this tutorial, Data Quality monitors the ods_user_info_d_spark table for two types of anomalies that can occur during extract, transform, and load (ETL) operations: missing source data and duplicate business primary keys. The following table summarizes the monitoring requirements for all tables in the user profile analysis pipeline.

Table	Monitoring requirement
`ods_raw_log_d_spark`	Strong rule: row count > 0 daily
`ods_user_info_d_spark`	Strong rule: row count > 0 daily; weak rule: business primary key (`uid`) unique daily
`dwd_log_info_di_spark`	No monitoring rule
`dws_user_info_all_di_spark`	No monitoring rule
`ads_user_info_1d_spark`	Rule: monitors row count fluctuation daily (tracks unique visitors)

Rule types

Before configuring rules, understand how each rule type handles violations:

Rule type	On violation	When to use
Strong rule	Triggers an alert and blocks all descendant nodes from running	Data that must be present for downstream jobs to produce valid results (e.g., row count > 0)
Weak rule	Triggers an alert only; downstream nodes continue running	Data quality issues worth tracking but not critical enough to halt the pipeline (e.g., duplicate primary keys)

Step 1: Go to the Configure by Table page

Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Governance > Data Quality. On the page that appears, select the desired workspace and click Go to Data Quality.
In the left-side navigation pane of the Data Quality page, choose Configure Rules > Configure by Table.
On the Configure by Table page, filter by connection: in the Connection section, select E-MapReduce. Use the filter fields on the right to find the ods_user_info_d_spark table.
In the search results, click Rule Management in the Actions column for the ods_user_info_d_spark table.

Step 2: Configure monitoring rules

On the Monitor tab, click Create Monitor.
Set the Data Range parameter to target the previous day's partition:
The dt=$[yyyymmdd-1] expression resolves to the partition generated for the current day during scheduling. This ensures the monitor checks the data that the daily sync node produces.
```
dt=$[yyyymmdd-1]
```
Click Create Rule. In the Create Rule panel, go to the System Template tab and configure two rules: Rule 1 — Row count check (strong rule) Find the Table is not empty template and click Use. On the right side of the panel, set Degree of Importance to Strong Rule. When the row count in ods_user_info_d_spark is 0, this rule triggers an alert and blocks descendant nodes. Rule 2 — Primary key uniqueness check (weak rule) Find the Unique value. fixed value template and click Use. Configure the following parameters: A threshold value of 0 means the rule flags any duplicate uid values (non-zero duplicate count).
Parameter Value
Rule Scope uid(STRING)
Monitoring Threshold Normal threshold; comparison operator =; value 0
Degree of Importance Weak rules
Click Determine to save both rules.
Set the Trigger Method to Triggered by Node Scheduling in Production Environment and select the ods_user_info_d_spark node created during data synchronization.
Set the handling policy to blocking the running of the node or sending an alert notification to the recipient, based on your requirements.
Click Save.

Parameter	Value
Rule Scope	`uid(STRING)`
Monitoring Threshold	Normal threshold; comparison operator `=`; value `0`
Degree of Importance	Weak rules

Step 3: Run a test

Before relying on the monitor in production, verify the rule configurations with a test run.

In the Monitor Perspective section of the Rule Management tab, select the monitor you created.
Click Test Run on the right side of the tab.
In the Test Run dialog, set the Scheduling Time to a date when data exists in the dt=$[yyyymmdd-1] partition, then click Test Run.
After the run completes, click View Details to check the results.

Step 4: Subscribe to alerts

Configure an alert subscription so you receive notifications when the monitor detects a violation.

In the Monitor Perspective section of the Rule Management tab, select the monitor.
Click Alert Subscription.
In the Alert Subscription dialog, configure the Notification Method and Recipient, then click Save in the Actions column.
To view or modify your subscriptions, choose Quality O&M > Monitor in the left-side navigation pane, then select My Subscriptions.

What's next

After the data is processed, visualize it on a dashboard with DataAnalysis. For more information, see Visualize data on a dashboard.