All Products
Search
Document Center

E-MapReduce:Use the Spark SQL extension feature to record data lineage and historical access information

Last Updated:Mar 26, 2026

EMR-HOOK is built into Spark 2 and Spark 3 in E-MapReduce (EMR) clusters. It captures SQL execution information — including data lineage and table access frequency — and reports it to Data Lake Formation (DLF) and DataWorks. This lets you track how data flows between tables and how often each table or partition is accessed, without modifying your Spark jobs.

Prerequisites

Before you begin, ensure that you have:

  • A DataLake or custom cluster with the Spark service selected. For more information, see Create a cluster.

Limitations

  • EMR-HOOK cannot collect SQL information from jobs running in a gateway deployed with EMR-CLI.

  • In versions earlier than EMR V5.16.0 or EMR V3.50.0, hive.exec.post.hooks (Hive) and spark.sql.queryExecutionListeners (Spark) settings cannot be synchronized to a gateway. EMR V5.16.0, EMR V3.50.0, and later versions support gateway synchronization and introduce the hive_aux_jars_path_gateway_only parameter, which lets you load a custom extension JAR exclusively on the gateway.

Usage notes

EMR-HOOK default state varies by cluster version. Check the following table before you proceed.

Cluster version Default state Action required
Earlier than EMR V5.14.0 or EMR V3.48.0 Enabled No action needed; reconfigure only if you want to change behavior
EMR V5.14.0, EMR V3.48.0, or later Disabled Manually enable as described in the procedure below
EMR V3.44 (custom cluster) May be disabled See FAQ for the manual enable steps

Enable EMR-HOOK for Spark

Step 1: Go to the Services tab

  1. Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.

  2. In the top navigation bar, select a region and a resource group.

  3. Find your cluster and click Services in the Actions column.

Step 2: Configure EMR-HOOK

  1. On the Services tab, find the Spark 2 or Spark 3 service and click Configure.

  2. On the Configure tab, set the following parameters on their respective subtabs.

    Subtab Parameter Value
    spark-defaults.conf spark.sql.queryExecutionListeners

    To enable: com.aliyun.emr.meta.spark.listener.EMRQueryLogger

    To disable: leave blank. If left blank, EMR-HOOK stops working entirely, and the Data Overview tab for tables in the DLF console no longer displays File Visits within Last Day, File Visits within Last Seven Days, or File Visits within Last 30 Days.

    hive-site.xml dlf.emrhook.webtracking true to enable access frequency reporting; false to disable.
  3. Click Save. In the dialog box, set Execution Reason and click Save.

Step 3: Restart Spark

  1. In the upper-right corner of the Configuration tab, choose More > Restart.

  2. In the dialog box, set Execution Reason and click OK.

  3. In the Confirm dialog, click OK.

View results

After restarting Spark, EMR-HOOK begins collecting data. View the results in:

FAQ

How do I enable EMR-HOOK on a custom cluster running EMR V3.44?

On the Configure tab of the Spark service, add the following JAR path to both class path parameters, then apply the changes as prompted.

Subtab Parameter Modification
spark-defaults.conf spark.driver.extraClassPath Append /opt/apps/EMRHOOK/emrhook-1.1.5/spark-hook-1.1.5-spark30.jar
spark-defaults.conf spark.executor.extraClassPath Append /opt/apps/EMRHOOK/emrhook-1.1.5/spark-hook-1.1.5-spark30.jar

What's next

To configure EMR-HOOK for Hive, see Use the Hive extension feature to record data lineage and historical access information.