Accelerate Spark Presto and Hive queries on ORC and Parquet files using JindoTable - E-MapReduce

Prerequisites

Before you begin, ensure that you have:

An E-MapReduce (EMR) cluster running V3.35.0 or later, or V4.9.0 or later
ORC or Parquet files stored in JindoFS or OSS

For instructions on creating an EMR cluster, see Create a cluster.

Supported engines and formats

The native engine supports the following engine and file format combinations:

Engine	ORC	Parquet
Spark2	Supported	Supported
Presto	Supported	Unsupported
Hive2	Unsupported	Supported

Limitations

Data of the binary type is not supported.
Partitioned tables whose values of partition key columns are stored in files are not supported.
Defining a schema with spark.read.schema (userDefinedSchema) is not allowed, because the schema may conflict with the existing file schema.
Data of the date type must be in the YYYY-MM-DD format and fall within the range 1400-01-01 to 9999-12-31.
Queries on case-sensitive columns in the same table cannot be accelerated. For example, if a table has both an ID column and an id column, queries on those columns cannot be accelerated.

Enable query acceleration for Spark

Note

Query acceleration uses off-heap memory. Add --conf spark.executor.memoryOverhead=4g to your Spark task to allocate enough memory for the native engine.

Configure global parameters

To apply query acceleration to all Spark jobs in a cluster, set the spark.sql.extensions parameter globally:

Log on to the Alibaba Cloud EMR console.
In the top navigation bar, select the region where your cluster resides and select a resource group.
Click the Cluster Management tab.
Find your cluster and click Details in the Actions column.
In the left-side navigation pane, choose Cluster Service > Spark.
Click the Configure tab.

Find the spark.sql.extensions parameter and set its value to:

io.delta.sql.DeltaSparkSessionExtension,com.aliyun.emr.sql.JindoTableExtension

Click Save in the upper-right corner of the Service Configuration section.
In the Confirm Changes dialog box, specify Description and click OK.
Choose Actions > Restart ThriftServer in the upper-right corner.
In the Cluster Activities dialog box, specify Description and click OK.
In the Confirm message, click OK.

Configure job-level parameters

To enable query acceleration for a single Spark Shell or Spark SQL job, add the following startup parameter when you submit the job:

spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension,com.aliyun.emr.sql.JindoTableExtension

For configuration details, see Configure a Spark Shell job and Configure a Spark SQL job.

Verify that acceleration is active

Open the Spark History Server web UI.
On the SQL tab, open the execution task for your job. If JindoDataSourceV2Scan appears in the plan, query acceleration is active. If it does not appear, check your configuration in the steps above.

Enable query acceleration for Presto

Presto includes a built-in catalog named hive-acc. Connect to this catalog to enable query acceleration:

presto --server https://emr-header-1.cluster-xxx:7778/ --catalog hive-acc --schema default

Replace emr-header-1.cluster-xxx with the hostname of your emr-header-1 node.

Note

Complex data types (MAP, STRUCT, and ARRAY) are not supported when using this feature in Presto.

Enable query acceleration for Hive

Important

If stable job scheduling is a priority, keep this feature disabled in Hive.

EMR Hive V2.3.7 (EMR V3.35.0) includes a JindoTable plug-in that accelerates Parquet queries. Set the hive.jindotable.native.enabled parameter to enable it.

Option 1: Set the parameter in your Hive job

set hive.jindotable.native.enabled=true;

Option 2: Set the parameter in the Hive configuration page (Hive on MapReduce and Hive on Tez)

On the Hive configuration page, click the hive-site.xml tab.
Add the custom parameter hive.jindotable.native.enabled and set its value to true.
Save the configuration and restart Hive.

Note

Complex data types (MAP, STRUCT, and ARRAY) are not supported when using this feature in Hive.