Detect ETL Data Quality Issues in EMR Serverless Spark Pipelines - E-MapReduce

Data Quality monitors your scheduled pipeline tables in real time, catching issues like missing source data and duplicate primary keys before they corrupt downstream computations. This tutorial shows you how to configure two monitoring rules for the ods_user_info_d_spark table: a Strong Rule that blocks downstream nodes when the row count drops to zero, and a Weak Rule that flags duplicate business primary keys.

Prerequisites

Before you begin, ensure that you have:

Basic user information from ods_user_info_d (ApsaraDB RDS for MySQL) synchronized to ods_user_info_d_spark in an EMR Serverless Spark workspace via Data Integration
Website access logs from user_log.txt in Object Storage Service (OSS) synchronized to ods_raw_log_d_spark in an EMR Serverless Spark workspace via Data Integration
Collected data processed into basic user profile data in Data Studio

Monitoring requirements for this example

In this user profile analysis scenario, Data Quality detects two types of issues: missing source data from upstream pipelines and dirty data introduced during extract, transform, and load (ETL) operations. The following table shows the monitoring plan for each table in the pipeline.

Table	Monitoring rule
`ods_raw_log_d_spark`	Strong Rule: row count > 0 daily
`ods_user_info_d_spark`	Strong Rule: row count > 0 daily; Weak Rule: business primary key is unique daily
`dwd_log_info_di_spark`	No monitoring rule
`dws_user_info_all_di_spark`	No monitoring rule
`ads_user_info_1d_spark`	Rule monitoring daily row count fluctuation

The steps below configure monitoring rules for ods_user_info_d_spark.

Strong rules vs weak rules

Before configuring rules, understand how each rule type responds when a check fails:

Rule type	What happens on failure	When to use
Strong Rule	Triggers an alert and blocks all descendant nodes from running	Data is critical — missing or invalid data would corrupt downstream results
Weak Rule	Triggers an alert but allows downstream nodes to continue	Data issues are worth tracking but won't break downstream logic

In this example, the row count check is a Strong Rule because an empty ods_user_info_d_spark table means the daily sync failed and no downstream computation should proceed. The primary key uniqueness check is a Weak Rule because duplicate records are worth flagging, but they don't prevent downstream processing.

Step 1: Open the Configure by Table page

Log on to the DataWorks console. In the top navigation bar, select the region. In the left-side navigation pane, choose Data Governance > Data Quality. On the page that appears, select the workspace from the drop-down list and click Go to Data Quality.
In the left-side navigation pane, choose Configure Rules > Configure by Table.
On the Configure by Table page, set the Connection filter to E-MapReduce, then use the right-side filters to locate the ods_user_info_d_spark table.
In the search results, click Rule Management in the Actions column. The Table Quality Details page opens.

Step 2: Configure monitoring rules

Select a monitoring scope.
1. On the Monitor tab, click Create Monitor.
2. Set Data Range to dt=$[yyyymmdd-1]. > Note: This value targets the partition generated for the current day, so the monitor checks the data produced by each daily schedule run.
Add the row count rule.
1. On the Create Monitor page, click Create Rule. The Create Rule panel opens.
2. On the System Template tab, find the Table is not empty rule and click Use.
3. On the right side of the panel, set Degree of Importance to Strong Rule. > Note: When the row count in ods_user_info_d_spark is 0, the Strong Rule triggers an alert and blocks all descendant nodes from running.

Add the primary key uniqueness rule.

On the System Template tab, find the Unique value. fixed value rule and click Use.

Configure the following parameters on the right side of the panel:

Parameter	Value
Rule Scope	`uid(STRING)`
Monitoring Threshold	Normal threshold: comparison operator `=`, value `0`
Degree of Importance	Weak Rule

Click Determine to save the rules.
Set the trigger method. Set Trigger Method to Triggered by Node Scheduling in Production Environment and select the ods_user_info_d_spark node created during data synchronization.
Set the exception handling policy to blocking the running of the node or sending an alert notification to the recipient based on your requirements.
Click Save.

Step 3: Run a test

After saving the monitor, run a test to verify that the rules work as expected before the first scheduled run.

In the Monitor Perspective section of the Rule Management tab, select the monitor you created.
Click Test Run on the right side of the tab. The Test Run dialog box opens.
Set the Scheduling Time parameter and click Test Run.
After the test completes, click View Details to check whether the data passes all rules.

Step 4: Subscribe to alerts

Configure who receives notifications when a monitoring rule is triggered.

In the Monitor Perspective section of the Rule Management tab, select the monitor.
Click Alert Subscription on the right side of the tab.
In the Alert Subscription dialog box, set Notification Method and Recipient, then click Save in the Actions column.
To view and manage all subscriptions, choose Quality O&M > Monitor in the left-side navigation pane, then select My Subscriptions.

What's next

After the pipeline data is processed, use DataAnalysis to visualize results on a dashboard. For more information, see Visualize data on a dashboard.