Configure EMR-HOOK for Spark SQL to record data lineage and historical access information - E-MapReduce

EMR-HOOK is built into Spark 2 and Spark 3 in E-MapReduce (EMR) clusters. It captures SQL execution information — including data lineage and table access frequency — and reports it to Data Lake Formation (DLF) and DataWorks. This lets you track how data flows between tables and how often each table or partition is accessed, without modifying your Spark jobs.

Prerequisites

Before you begin, ensure that you have:

A DataLake or custom cluster with the Spark service selected. For more information, see Create a cluster.

Limitations

EMR-HOOK cannot collect SQL information from jobs running in a gateway deployed with EMR-CLI.
In versions earlier than EMR V5.16.0 or EMR V3.50.0, hive.exec.post.hooks (Hive) and spark.sql.queryExecutionListeners (Spark) settings cannot be synchronized to a gateway. EMR V5.16.0, EMR V3.50.0, and later versions support gateway synchronization and introduce the hive_aux_jars_path_gateway_only parameter, which lets you load a custom extension JAR exclusively on the gateway.

Usage notes

EMR-HOOK default state varies by cluster version. Check the following table before you proceed.

Cluster version	Default state	Action required
Earlier than EMR V5.14.0 or EMR V3.48.0	Enabled	No action needed; reconfigure only if you want to change behavior
EMR V5.14.0, EMR V3.48.0, or later	Disabled	Manually enable as described in the procedure below
EMR V3.44 (custom cluster)	May be disabled	See FAQ for the manual enable steps

Enable EMR-HOOK for Spark

Step 1: Go to the Services tab

Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
In the top navigation bar, select a region and a resource group.
Find your cluster and click Services in the Actions column.

Step 2: Configure EMR-HOOK

On the Services tab, find the Spark 2 or Spark 3 service and click Configure.

On the Configure tab, set the following parameters on their respective subtabs.

Subtab Parameter Value

Subtab	Parameter	Value
`spark-defaults.conf`	`spark.sql.queryExecutionListeners`	To enable: `com.aliyun.emr.meta.spark.listener.EMRQueryLogger` To disable: leave blank. If left blank, EMR-HOOK stops working entirely, and the Data Overview tab for tables in the DLF console no longer displays File Visits within Last Day, File Visits within Last Seven Days, or File Visits within Last 30 Days.
`hive-site.xml`	`dlf.emrhook.webtracking`	`true` to enable access frequency reporting; `false` to disable.

spark-defaults.conf

spark.sql.queryExecutionListeners

To enable: com.aliyun.emr.meta.spark.listener.EMRQueryLogger

To disable: leave blank. If left blank, EMR-HOOK stops working entirely, and the Data Overview tab for tables in the DLF console no longer displays File Visits within Last Day, File Visits within Last Seven Days, or File Visits within Last 30 Days.

hive-site.xml dlf.emrhook.webtracking true to enable access frequency reporting; false to disable.

Click Save. In the dialog box, set Execution Reason and click Save.

Step 3: Restart Spark

In the upper-right corner of the Configuration tab, choose More > Restart.
In the dialog box, set Execution Reason and click OK.
In the Confirm dialog, click OK.

View results

After restarting Spark, EMR-HOOK begins collecting data. View the results in:

DLF console — table access frequency and data overview. See Data overview of data tables.
DataWorks console — data lineage. See View lineages.

FAQ

How do I enable EMR-HOOK on a custom cluster running EMR V3.44?

On the Configure tab of the Spark service, add the following JAR path to both class path parameters, then apply the changes as prompted.

Subtab	Parameter	Modification
`spark-defaults.conf`	`spark.driver.extraClassPath`	Append `/opt/apps/EMRHOOK/emrhook-1.1.5/spark-hook-1.1.5-spark30.jar`
`spark-defaults.conf`	`spark.executor.extraClassPath`	Append `/opt/apps/EMRHOOK/emrhook-1.1.5/spark-hook-1.1.5-spark30.jar`

What's next

To configure EMR-HOOK for Hive, see Use the Hive extension feature to record data lineage and historical access information.