All Products
Search
Document Center

E-MapReduce:Use Hive extensions to record data lineage and access history

Last Updated:Mar 26, 2026

EMR-HOOK is integrated with Hive in E-MapReduce (EMR) clusters by default. It captures SQL job information — specifically data lineage and table or partition access frequency — based on metadata managed in Data Lake Formation (DLF). After enabling EMR-HOOK, view lineage data in DataWorks and access frequency data in the DLF console.

Version compatibility

EMR versionEMR-HOOK default stateGateway parameter sync
Earlier than V5.14.0 or V3.48.0EnabledNot supported
V5.14.0, V3.48.0, or laterDisabled — must enable manuallyNot supported
V5.16.0, V3.50.0, or laterDisabled — must enable manuallySupported; hive_aux_jars_path_gateway_only available

Prerequisites

Before you begin, ensure that you have:

  • A DataLake or custom cluster with the Hive service selected. See Create a cluster

Limitations

  • EMR-HOOK cannot collect SQL job information from a gateway deployed using EMR-CLI.

  • On EMR versions earlier than V5.16.0 or V3.50.0, hive.exec.post.hooks (Hive) and spark.sql.queryExecutionListeners (Spark) settings cannot be synchronized to a gateway. On V5.16.0, V3.50.0, or later, synchronization is supported, and the hive_aux_jars_path_gateway_only parameter lets you load a custom JAR file exclusively on the gateway.

Enable EMR-HOOK for Hive

Step 1: Open the Hive configuration

  1. Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.

  2. In the top navigation bar, select a region and a resource group.

  3. On the EMR on ECS page, find the cluster and click Services in the Actions column.

  4. On the Services tab, find the Hive service and click Configure.

Step 2: Set configuration parameters

On the Configure tab, update the following parameters. Parameters are organized by subtab.

hive-site.xml

ParameterValue
hive.exec.post.hooks
  • To enable EMR-HOOK, set this parameter to com.aliyun.emr.meta.hive.hook.LineageLoggerHook.

  • To disable EMR-HOOK, leave this parameter empty.

dlf.emrhook.webtrackingtrue to enable access frequency reporting; false to disable

hivemetastore-site.xml

ParameterValue
hive.metastore.event.listeners
  • To enable EMR-HOOK, set this parameter to com.aliyun.emr.meta.hive.listener.MetaStoreListener.

  • To disable EMR-HOOK, leave this parameter empty.

hive.metastore.pre.event.listeners
  • To enable EMR-HOOK, set this parameter to com.aliyun.emr.meta.hive.listener.MetaStorePreAuditListener.

  • To disable EMR-HOOK, leave this parameter empty.

Note

If EMR-HOOK is disabled, the Data Overview tab for a table in the DLF console no longer shows data in the File Visits within Last Day, File Visits within Last Seven Days, and File Visits within Last 30 Days columns.

Step 3: Save the configuration

  1. On the Configure tab, click Save.

  2. In the dialog box, set Execution Reason and click Save.

Step 4: Restart Hive

  1. In the upper-right corner of the Configure tab, choose More > Restart.

  2. In the dialog box, set Execution Reason and click OK.

  3. In the Confirm message, click OK.

View results

After Hive restarts, EMR-HOOK begins capturing data.

FAQ

How do I enable EMR-HOOK on a custom cluster running EMR V3.44?

On the Configure tab of the Hive service, add the JAR file path to hive_aux_jars_path on both subtabs, then apply the changes as prompted.

SubtabParameterChange
hive-site.xmlhive_aux_jars_pathAppend ,/opt/apps/EMRHOOK/emrhook-1.1.5/hive-hook-1.1.5-hive23.jar
hive-env.shhive_aux_jars_pathAppend ,/opt/apps/EMRHOOK/emrhook-1.1.5/hive-hook-1.1.5-hive23.jar

What's next