You can use JindoTable to collect access frequency statistics on tables and partitions, and separate cold and hot data based on these statistics. This helps you reduce storage costs and improve cache usage efficiency.

Prerequisites

An E-MapReduce (EMR) cluster is created. For more information, see Create a cluster.

Background information

JindoTable can be used to collect access frequency statistics on Hive tables. The collected data is saved in the namespaces of the SmartData service of a cluster.

In SmartData 3.2.X and later, JindoTable can be used to collect access frequency statistics on Spark, Hive, and Presto tables. This feature is enabled for Spark and Presto by default. If you want to disable the feature for Spark or Presto, perform the operations described in Disable the collection of access frequency statistics. The feature is disabled for Hive by default. If you want to enable the feature for Hive, perform the operations described in Enable the collection of access frequency statistics for Hive.

Query access frequency statistics

You can run a command to query access frequency statistics.
  • Syntax
    jindo table -accessStat <-d [days]> <-n [topNums]>

    Set days and topNums to positive integers. For example, if days is set to 1 and topNums is not specified, the access frequency statistics of all the tables or partitions that are visited on the current day (from 00:00 to the current time) are queried.

  • Description

    This command is used to query the first N tables or partitions that are visited the most frequently within a specific period of time, and the number of times each of them is visited.

  • For example, you can run the following command to query the first 20 tables or partitions that have been visited the most frequently within the last seven days, and the number of times each of them has been visited:
    jindo table -accessStat -d 7 -n 20

For more information about how to use JindoTable, see Use JindoTable.

Enable the collection of access frequency statistics for Hive

  1. Log on to the Alibaba Cloud EMR console.
  2. In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
  3. Click the Cluster Management tab.
  4. On the Cluster Management page, find your cluster and click Details in the Actions column.
  5. Modify the setting of a Hive parameter.
    1. In the left-side navigation pane, choose Cluster Service > Hive.
    2. On the Hive service page, click the Configure tab.
    3. Search for the hive.exec.post.hooks parameter and add com.aliyun.emr.table.hive.HivePostHook to the end of the existing parameter value. Separate the hooks with a comma (,).
  6. Save the configurations.
    1. In the upper-right corner of the Service Configuration section, click Save.
    2. In the Confirm Changes dialog box, specify Description and turn on Auto-update Configuration.
    3. Click OK.
  7. Restart the related service.
    1. In the upper-right corner of the Hive service page, choose Actions > Restart HiveServer2.
    2. In the Cluster Activities dialog box, specify Description.
    3. Click OK.
    4. In the Confirm message, click OK.

Disable the collection of access frequency statistics

  1. Log on to the Alibaba Cloud EMR console.
  2. In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
  3. Click the Cluster Management tab.
  4. On the Cluster Management page, find your cluster and click Details in the Actions column.
  5. Modify the setting of the related parameter.
    • Hive:
      1. In the left-side navigation pane, choose Cluster Service > Hive.
      2. On the Hive service page, click the Configure tab.
      3. Search for the hive.exec.post.hooks parameter and delete com.aliyun.emr.table.hive.HivePostHook from the parameter value. hive-site
    • Spark:
      1. In the left-side navigation pane, choose Cluster Service > Spark.
      2. On the Spark service page, click the Configure tab.
      3. Search for the spark.sql.queryExecutionListeners parameter and delete com.aliyun.emr.table.spark.SparkSQLQueryListener from the parameter value. spark_default
    • Presto:
      1. In the left-side navigation pane, choose Cluster Service > Presto.
      2. On the Presto service page, click the Configure tab.
      3. Search for the event-listener.name parameter and delete the parameter value.
  6. Save the configurations.
    1. In the upper-right corner of the Service Configuration section, click Save.
    2. In the Confirm Changes dialog box, specify Description and turn on Auto-update Configuration.
    3. Click OK.
  7. Restart the related service.
    • Hive:
      1. In the upper-right corner of the Hive service page, choose Actions > Restart HiveServer2.
      2. In the Cluster Activities dialog box, specify Description.
      3. Click OK.
      4. In the Confirm message, click OK.
    • Spark:
      1. In the upper-right corner of the Spark service page, choose Actions > Restart ThriftServer.
      2. In the Cluster Activities dialog box, specify Description.
      3. Click OK.
      4. In the Confirm message, click OK.
    • Presto:
      1. In the upper-right corner of the Presto service page, choose Actions > Restart All Components.
      2. In the Cluster Activities dialog box, specify Description.
      3. Click OK.
      4. In the Confirm message, click OK.