Use JindoTable to collect access frequency statistics on tables and partitions, then separate cold and hot data based on those statistics to reduce storage costs and improve cache efficiency.
Prerequisites
Before you begin, ensure that you have:
-
An E-MapReduce (EMR) cluster. For more information, see Create a cluster.
How it works
JindoTable tracks how often each table and partition is accessed and stores the statistics in the namespaces of the SmartData service on your cluster.
SmartData 3.2.X and later supports statistics collection for Spark, Hive, and Presto. The following table summarizes the default state and the configuration parameter for each engine:
| Engine | Default state | Configuration parameter |
|---|---|---|
| Spark | Enabled | spark.sql.queryExecutionListeners |
| Presto | Enabled | event-listener.name |
| Hive | Disabled | hive.exec.post.hooks |
To enable statistics collection for Hive, see Enable statistics collection for Hive. To disable it for Spark or Presto, see Disable statistics collection.
Query access frequency statistics
Use the jindo table -accessStat command to answer questions such as:
-
Which tables or partitions are accessed most this week?
-
Which partitions are good candidates for cold storage migration?
Syntax
jindo table -accessStat <-d [days]> <-n [topNums]>
Both days and topNums must be positive integers. If you omit -n, the command returns all tables and partitions accessed in the specified period.
Parameters
| Parameter | Description | Example |
|---|---|---|
-d |
Number of days to look back. A value of 1 covers from 00:00 on the current day to the current time. |
7 |
-n |
Number of top results to return, ranked by access frequency. | 20 |
Example
Run the following command to get the top 20 tables or partitions by access frequency over the last seven days:
jindo table -accessStat -d 7 -n 20
For more information about JindoTable, see Use JindoTable.
Enable statistics collection for Hive
-
Log on to the Alibaba Cloud EMR console.
-
In the top navigation bar, select the region where your cluster resides and select a resource group.
-
Click the Cluster Management tab.
-
On the Cluster Management page, find your cluster and click Details in the Actions column.
-
In the left-side navigation pane, choose Cluster Service > Hive.
-
On the Hive service page, click the Configure tab.
-
Search for the
hive.exec.post.hooksparameter and appendcom.aliyun.emr.table.hive.HivePostHookto the existing value, separated by a comma (,). -
Save the configuration:
-
In the upper-right corner of the Service Configuration section, click Save.
-
In the Confirm Changes dialog box, enter a description and turn on Auto-update Configuration.
-
Click OK.
-
-
Restart HiveServer2:
-
In the upper-right corner of the Hive service page, choose Actions > Restart HiveServer2.
-
In the Cluster Activities dialog box, enter a description and click OK.
-
In the confirmation message, click OK.
-
Disable statistics collection
Follow the steps below to disable statistics collection for the relevant engine.
-
Log on to the Alibaba Cloud EMR console.
-
In the top navigation bar, select the region where your cluster resides and select a resource group.
-
Click the Cluster Management tab.
-
On the Cluster Management page, find your cluster and click Details in the Actions column.
-
Navigate to the service configuration page for your engine and remove the relevant parameter value:
-
Hive: Choose Cluster Service > Hive > Configure tab. Search for
hive.exec.post.hooksand removecom.aliyun.emr.table.hive.HivePostHookfrom the value.
-
Spark: Choose Cluster Service > Spark > Configure tab. Search for
spark.sql.queryExecutionListenersand removecom.aliyun.emr.table.spark.SparkSQLQueryListenerfrom the value.
-
Presto: Choose Cluster Service > Presto > Configure tab. Search for
event-listener.nameand clear the parameter value.
-
-
Save the configuration:
-
In the upper-right corner of the Service Configuration section, click Save.
-
In the Confirm Changes dialog box, enter a description and turn on Auto-update Configuration.
-
Click OK.
-
-
Restart the service for your engine:
-
Hive: Choose Actions > Restart HiveServer2. In the Cluster Activities dialog box, enter a description, click OK, then click OK in the confirmation message.
-
Spark: Choose Actions > Restart ThriftServer. In the Cluster Activities dialog box, enter a description, click OK, then click OK in the confirmation message.
-
Presto: Choose Actions > Restart All Components. In the Cluster Activities dialog box, enter a description, click OK, then click OK in the confirmation message.
-
What's next
-
Use the access frequency statistics from
jindo table -accessStatto identify infrequently accessed tables and partitions, then move them to cold storage to reduce costs. -
For a full reference of JindoTable commands and tiered storage operations, see Use JindoTable.