JindoTable provides a native Optimized Row Columnar (ORC) reader that can be used to accelerate queries of ORC files. By default, query acceleration is disabled. Enabling query acceleration can improve the performance of Spark or Presto to read ORC files.
Prerequisites
ORC files are stored to JindoFS or Object Storage Service (OSS).
You cannot accelerate queries for HDFS.
Improve performance of Spark to read data
Enable query acceleration.
NoteWhen Spark is used to read ORC files, the DataFrame or Spark-SQL API is required to enable acceleration.
Configure the global parameter.
For more information, see Configure the global parameter of Spark.
Configure job-level parameters.
You can add Spark startup parameters when you run Spark Shell or Spark SQL jobs.
--conf spark.sql.extensions==io.delta.sql.DeltaSparkSessionExtension,com.aliyun.emr.sql.JindoTableExtension
For more information about the configuration of jobs, see Configure a Spark Shell job or Configure a Spark SQL job.
Check whether query acceleration is enabled.
Access the web UI of Spark History Server UI.
On the SQL tab of Spark, view the execution task.
If JindoDataSourceV2Scan appears, query acceleration is enabled. Otherwise, check the configurations in Step 1.
Improve performance of Presto to read data
Presto has built-in catalog: hive-acc
. You can use catalog: hive-acc
to enable query acceleration.
Example:
presto --server https://emr-header-1.cluster-xxx:7778/ --catalog hive-acc --schema default
emr-header-1.cluster-xxx
indicates the hostname of the emr-header-1 node.
Configure the global parameter of Spark
Go to the Spark service page.
Log on to the Alibaba Cloud EMR console.
In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
Click the Cluster Management tab.
On the Cluster Management page, find your cluster and click Details in the Actions column.
In the left-side navigation pane, choose .
On the Spark service page, click the Configure tab.
Find the spark.sql.extensions parameter and change its value to io.delta.sql.DeltaSparkSessionExtension,com.aliyun.emr.sql.JindoTableExtension.
Save the configurations.
Click Save in the upper-right corner of the Service Configuration section.
In the Confirm Changes dialog box, specify Description and click OK.
Restart ThriftServer.
Choose in the upper-right corner.
In the Cluster Activities dialog box, specify Description and click OK.
In the Confirm message, click OK.