EMR-HOOK is built into Spark 2 and Spark 3 in E-MapReduce (EMR) clusters. It captures SQL execution information — including data lineage and table access frequency — and reports it to Data Lake Formation (DLF) and DataWorks. This lets you track how data flows between tables and how often each table or partition is accessed, without modifying your Spark jobs.
Prerequisites
Before you begin, ensure that you have:
-
A DataLake or custom cluster with the Spark service selected. For more information, see Create a cluster.
Limitations
-
EMR-HOOK cannot collect SQL information from jobs running in a gateway deployed with EMR-CLI.
-
In versions earlier than EMR V5.16.0 or EMR V3.50.0,
hive.exec.post.hooks(Hive) andspark.sql.queryExecutionListeners(Spark) settings cannot be synchronized to a gateway. EMR V5.16.0, EMR V3.50.0, and later versions support gateway synchronization and introduce thehive_aux_jars_path_gateway_onlyparameter, which lets you load a custom extension JAR exclusively on the gateway.
Usage notes
EMR-HOOK default state varies by cluster version. Check the following table before you proceed.
| Cluster version | Default state | Action required |
|---|---|---|
| Earlier than EMR V5.14.0 or EMR V3.48.0 | Enabled | No action needed; reconfigure only if you want to change behavior |
| EMR V5.14.0, EMR V3.48.0, or later | Disabled | Manually enable as described in the procedure below |
| EMR V3.44 (custom cluster) | May be disabled | See FAQ for the manual enable steps |
Enable EMR-HOOK for Spark
Step 1: Go to the Services tab
-
Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
-
In the top navigation bar, select a region and a resource group.
-
Find your cluster and click Services in the Actions column.
Step 2: Configure EMR-HOOK
-
On the Services tab, find the Spark 2 or Spark 3 service and click Configure.
-
On the Configure tab, set the following parameters on their respective subtabs.
Subtab Parameter Value spark-defaults.confspark.sql.queryExecutionListenersTo enable:
com.aliyun.emr.meta.spark.listener.EMRQueryLoggerTo disable: leave blank. If left blank, EMR-HOOK stops working entirely, and the Data Overview tab for tables in the DLF console no longer displays File Visits within Last Day, File Visits within Last Seven Days, or File Visits within Last 30 Days.
hive-site.xmldlf.emrhook.webtrackingtrueto enable access frequency reporting;falseto disable. -
Click Save. In the dialog box, set Execution Reason and click Save.
Step 3: Restart Spark
-
In the upper-right corner of the Configuration tab, choose More > Restart.
-
In the dialog box, set Execution Reason and click OK.
-
In the Confirm dialog, click OK.
View results
After restarting Spark, EMR-HOOK begins collecting data. View the results in:
-
DLF console — table access frequency and data overview. See Data overview of data tables.
-
DataWorks console — data lineage. See View lineages.
FAQ
How do I enable EMR-HOOK on a custom cluster running EMR V3.44?
On the Configure tab of the Spark service, add the following JAR path to both class path parameters, then apply the changes as prompted.
| Subtab | Parameter | Modification |
|---|---|---|
spark-defaults.conf |
spark.driver.extraClassPath |
Append /opt/apps/EMRHOOK/emrhook-1.1.5/spark-hook-1.1.5-spark30.jar |
spark-defaults.conf |
spark.executor.extraClassPath |
Append /opt/apps/EMRHOOK/emrhook-1.1.5/spark-hook-1.1.5-spark30.jar |
What's next
To configure EMR-HOOK for Hive, see Use the Hive extension feature to record data lineage and historical access information.