JindoTable tracks how often Hive tables and partitions are accessed, so you can identify hot and cold data, reduce storage costs, and improve cache hit rates.
How it works
JindoTable injects a post-execution listener into Apache Hive and Apache Spark. After each query completes, the listener records which tables and partitions were accessed and stores the data in the namespaces of the SmartData service on your cluster.
| Component | Listener class | Configuration parameter |
|---|---|---|
| Hive | com.aliyun.emr.table.hive.HivePostHook | hive.exec.post.hooks |
| Spark | com.aliyun.emr.table.spark.SparkSQLQueryListener | spark.sql.queryExecutionListeners |
Access record collection is enabled by default. To disable it, see Disable access record collection.
Query access frequency statistics
Run the following command to retrieve the top N most-accessed tables or partitions within a time window:
jindo table -accessStat -d <days> -n <topNums>Parameters:
| Parameter | Type | Description |
|---|---|---|
-d <days> | Positive integer | Number of days to look back. Setting -d 1 queries records from 00:00 local time on the current day to the current time. |
-n <topNums> | Positive integer | Number of top results to return, ranked by access frequency. |
Example: Get the 20 most-accessed tables or partitions over the last seven days.
jindo table -accessStat -d 7 -n 20For more information about JindoTable capabilities, see Use JindoTable.
Disable access record collection
To stop collecting access records, remove the listener class from the Hive or Spark service configuration, then restart the service.
Step 1: Log on to the EMR console
Log on to the Alibaba Cloud EMR console.
In the top navigation bar, select the region where your cluster resides and select a resource group.
Click the Cluster Management tab.
Find your cluster and click Details in the Actions column.
Step 2: Remove the listener from service configuration
For Hive:
In the left-side navigation pane, choose Cluster Service > Hive.
Click the Configure tab.
In the Service Configuration section, click the hive-site tab.
In the Configuration Filter section, search for
hive.exec.post.hooks.Delete
com.aliyun.emr.table.hive.HivePostHookfrom the parameter value.
For Spark:
In the left-side navigation pane, choose Cluster Service > Spark.
Click the Configure tab.
In the Service Configuration section, click the spark-defaults tab.
In the Configuration Filter section, search for
spark.sql.queryExecutionListeners.Delete
com.aliyun.emr.table.spark.SparkSQLQueryListenerfrom the parameter value.
Step 3: Save the configuration
In the upper-right corner of the Service Configuration section, click Save.
In the Confirm Changes dialog box, enter a description and turn on Auto-update Configuration.
Click OK.
Step 4: Restart the service
For Hive:
In the upper-right corner of the page, choose Actions > Restart HiveServer2.
In the Cluster Activities dialog box, set the required parameters.
Click OK, then click OK again in the confirmation message.
For Spark:
In the upper-right corner of the page, choose Actions > Restart ThriftServer.
In the Cluster Activities dialog box, set the required parameters.
Click OK, then click OK again in the confirmation message.