EMR-HOOK is integrated with Hive in E-MapReduce (EMR) clusters by default. It captures SQL job information — specifically data lineage and table or partition access frequency — based on metadata managed in Data Lake Formation (DLF). After enabling EMR-HOOK, view lineage data in DataWorks and access frequency data in the DLF console.
Version compatibility
| EMR version | EMR-HOOK default state | Gateway parameter sync |
|---|---|---|
| Earlier than V5.14.0 or V3.48.0 | Enabled | Not supported |
| V5.14.0, V3.48.0, or later | Disabled — must enable manually | Not supported |
| V5.16.0, V3.50.0, or later | Disabled — must enable manually | Supported; hive_aux_jars_path_gateway_only available |
Prerequisites
Before you begin, ensure that you have:
A DataLake or custom cluster with the Hive service selected. See Create a cluster
Limitations
EMR-HOOK cannot collect SQL job information from a gateway deployed using EMR-CLI.
On EMR versions earlier than V5.16.0 or V3.50.0,
hive.exec.post.hooks(Hive) andspark.sql.queryExecutionListeners(Spark) settings cannot be synchronized to a gateway. On V5.16.0, V3.50.0, or later, synchronization is supported, and thehive_aux_jars_path_gateway_onlyparameter lets you load a custom JAR file exclusively on the gateway.
Enable EMR-HOOK for Hive
Step 1: Open the Hive configuration
Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
In the top navigation bar, select a region and a resource group.
On the EMR on ECS page, find the cluster and click Services in the Actions column.
On the Services tab, find the Hive service and click Configure.
Step 2: Set configuration parameters
On the Configure tab, update the following parameters. Parameters are organized by subtab.
hive-site.xml
| Parameter | Value |
|---|---|
hive.exec.post.hooks |
|
dlf.emrhook.webtracking | true to enable access frequency reporting; false to disable |
hivemetastore-site.xml
| Parameter | Value |
|---|---|
hive.metastore.event.listeners |
|
hive.metastore.pre.event.listeners |
|
If EMR-HOOK is disabled, the Data Overview tab for a table in the DLF console no longer shows data in the File Visits within Last Day, File Visits within Last Seven Days, and File Visits within Last 30 Days columns.
Step 3: Save the configuration
On the Configure tab, click Save.
In the dialog box, set Execution Reason and click Save.
Step 4: Restart Hive
In the upper-right corner of the Configure tab, choose More > Restart.
In the dialog box, set Execution Reason and click OK.
In the Confirm message, click OK.
View results
After Hive restarts, EMR-HOOK begins capturing data.
Access frequency: In the DLF console, open a table and click Data Overview. See Data overview of data tables.
Data lineage: In the DataWorks console, open the lineage view. See View lineages.
FAQ
How do I enable EMR-HOOK on a custom cluster running EMR V3.44?
On the Configure tab of the Hive service, add the JAR file path to hive_aux_jars_path on both subtabs, then apply the changes as prompted.
| Subtab | Parameter | Change |
|---|---|---|
| hive-site.xml | hive_aux_jars_path | Append ,/opt/apps/EMRHOOK/emrhook-1.1.5/hive-hook-1.1.5-hive23.jar |
| hive-env.sh | hive_aux_jars_path | Append ,/opt/apps/EMRHOOK/emrhook-1.1.5/hive-hook-1.1.5-hive23.jar |
What's next
To capture lineage and access history for Spark jobs, see Use the Spark SQL extension feature to record data lineage and historical access information.