Enable JindoTable Query Acceleration for Spark, Hive, and Presto - E-MapReduce

JindoTable's native engine accelerates queries on ORC and Parquet files stored in JindoFS or OSS — no changes to your existing SQL or DataFrame code required. This topic describes how to enable query acceleration for Spark, Hive, and Presto in E-MapReduce (EMR).

Prerequisites

Before you begin, make sure you have:

An EMR cluster is created
ORC or Parquet files stored in JindoFS or Object Storage Service (OSS)

For information on creating a cluster, see Create a cluster.

Limitations

The following limitations apply to all engines:

Binary files are not supported.
Partitioned tables whose values of partition key columns are stored in files are not supported.
EMR clusters of V5.X.X or a later version are not supported.
spark.read.schema (userDefinedSchema) is not supported.
DATE values must use the YYYY-MM-DD format and fall within the range 1400-01-01 to 9999-12-31.
If a table has two columns with the same name but different letter cases (for example, NAME and name), queries on those columns cannot be accelerated.

Engine and file format support:

Engine	ORC	Parquet
Spark2	Supported	Supported
Spark3	Supported	Supported
Presto	Supported	Supported
Hive2	Not supported	Supported
Hive3	Not supported	Supported

Engine and file system support:

Engine	OSS	JindoFS	HDFS (Hadoop Distributed File System)
Spark2	Supported	Supported	Supported
Presto	Supported	Supported	Supported
Hive2	Supported	Supported	Not supported
Hive3	Supported	Supported	Not supported

Enable query acceleration for Spark

Note

Query acceleration uses off-heap memory. Add --conf spark.executor.memoryOverhead=4g to your Spark job to allocate additional memory for acceleration.

To read data with the native engine, use the DataFrame API or Spark SQL — no other APIs are supported.

You can enable query acceleration globally for all Spark jobs or at the individual job level.

Option 1: Global configuration (console)

Log on to the Alibaba Cloud EMR console.
In the top navigation bar, select the region of your cluster.
Click the Cluster Management tab.
Find your cluster and click Details in the Actions column.
In the left-side navigation pane, choose Cluster Service > Spark.
Click the Configure tab.

Search for the spark.sql.extensions parameter and set its value to:

io.delta.sql.DeltaSparkSessionExtension,com.aliyun.emr.sql.JindoTableExtension

Click Save in the upper-right corner. In the Confirm Changes dialog box, enter a description and click OK.
Choose Actions > Restart ThriftServer in the upper-right corner. In the Cluster Activities dialog box, enter a description and click OK. Then click OK in the confirmation message.

Option 2: Job-level configuration

Add the following flag when submitting a Spark Shell or Spark SQL job:

--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension,com.aliyun.emr.sql.JindoTableExtension

For details, see Configure a Spark Shell job or Configure a Spark SQL job.

Verify that acceleration is enabled

Open the Spark History Server web UI.
Click the SQL tab and find your query.

If JindoDataSourceV2Scan appears in the query plan, acceleration is active. If it does not appear, verify the spark.sql.extensions configuration in the steps above.

Enable query acceleration for Presto

Important

Presto uses off-heap memory for query acceleration and supports high query concurrency. Make sure the cluster has more than 10 GB of available memory before enabling this feature.

Note

Complex data types — MAP, STRUCT, and ARRAY — are not supported when using query acceleration with Presto.

The Presto service includes a built-in catalog named hive-acc that uses the native engine. Connect to Presto using this catalog to enable acceleration:

presto --server emr-header-1:9090 --catalog hive-acc --schema default

No other configuration is required.

Enable query acceleration for Hive

Important

If your jobs have high stability requirements, do not enable query acceleration for Hive.

Note

Complex data types — MAP, STRUCT, and ARRAY — are not supported when using query acceleration with Hive.

You can enable query acceleration using the EMR console or the Hive CLI.

Option 1: EMR console

On the Hive service page, click the Configure tab.
Search for the hive.jindotable.native.enabled parameter and set its value to true.
Save the configuration and restart the Hive service.

This method applies to both Hive on MapReduce and Hive on Tez jobs.

Option 2: Hive CLI

Run the following command in the CLI:

set hive.jindotable.native.enabled=true;

The query acceleration plug-in for Parquet files is included by default in EMR V3.35.0 and later. No additional installation is required.