Analyze Table Access Frequency via JindoTable - E-MapReduce

JindoTable tracks how often Hive tables and partitions are accessed, so you can identify hot and cold data, reduce storage costs, and improve cache hit rates.

How it works

JindoTable injects a post-execution listener into Apache Hive and Apache Spark. After each query completes, the listener records which tables and partitions were accessed and stores the data in the namespaces of the SmartData service on your cluster.

Component	Listener class	Configuration parameter
Hive	`com.aliyun.emr.table.hive.HivePostHook`	`hive.exec.post.hooks`
Spark	`com.aliyun.emr.table.spark.SparkSQLQueryListener`	`spark.sql.queryExecutionListeners`

Access record collection is enabled by default. To disable it, see Disable access record collection.

Query access frequency statistics

Run the following command to retrieve the top N most-accessed tables or partitions within a time window:

jindo table -accessStat -d <days> -n <topNums>

Parameters:

Parameter	Type	Description
`-d <days>`	Positive integer	Number of days to look back. Setting `-d 1` queries records from 00:00 local time on the current day to the current time.
`-n <topNums>`	Positive integer	Number of top results to return, ranked by access frequency.

Example: Get the 20 most-accessed tables or partitions over the last seven days.

jindo table -accessStat -d 7 -n 20

For more information about JindoTable capabilities, see Use JindoTable.

Disable access record collection

To stop collecting access records, remove the listener class from the Hive or Spark service configuration, then restart the service.

Note Perform the following steps separately for Hive and Spark, depending on which engines you want to disable.

Step 1: Log on to the EMR console

Log on to the Alibaba Cloud EMR console.
In the top navigation bar, select the region where your cluster resides and select a resource group.
Click the Cluster Management tab.
Find your cluster and click Details in the Actions column.

Step 2: Remove the listener from service configuration

For Hive:

In the left-side navigation pane, choose Cluster Service > Hive.
Click the Configure tab.
In the Service Configuration section, click the hive-site tab.
In the Configuration Filter section, search for hive.exec.post.hooks.
Delete com.aliyun.emr.table.hive.HivePostHook from the parameter value.

For Spark:

In the left-side navigation pane, choose Cluster Service > Spark.
Click the Configure tab.
In the Service Configuration section, click the spark-defaults tab.
In the Configuration Filter section, search for spark.sql.queryExecutionListeners.
Delete com.aliyun.emr.table.spark.SparkSQLQueryListener from the parameter value.

Step 3: Save the configuration

In the upper-right corner of the Service Configuration section, click Save.
In the Confirm Changes dialog box, enter a description and turn on Auto-update Configuration.
Click OK.

Step 4: Restart the service

For Hive:

In the upper-right corner of the page, choose Actions > Restart HiveServer2.
In the Cluster Activities dialog box, set the required parameters.
Click OK, then click OK again in the confirmation message.

For Spark:

In the upper-right corner of the page, choose Actions > Restart ThriftServer.
In the Cluster Activities dialog box, set the required parameters.
Click OK, then click OK again in the confirmation message.