JindoTable's native engine accelerates queries on ORC and Parquet files stored in JindoFS or OSS — no changes to your existing SQL or DataFrame code required. This topic describes how to enable query acceleration for Spark, Hive, and Presto in E-MapReduce (EMR).
Prerequisites
Before you begin, make sure you have:
-
An EMR cluster is created
-
ORC or Parquet files stored in JindoFS or Object Storage Service (OSS)
For information on creating a cluster, see Create a cluster.
Limitations
The following limitations apply to all engines:
-
Binary files are not supported.
-
Partitioned tables whose values of partition key columns are stored in files are not supported.
-
EMR clusters of V5.X.X or a later version are not supported.
-
spark.read.schema(userDefinedSchema) is not supported. -
DATE values must use the
YYYY-MM-DDformat and fall within the range 1400-01-01 to 9999-12-31. -
If a table has two columns with the same name but different letter cases (for example,
NAMEandname), queries on those columns cannot be accelerated.
Engine and file format support:
| Engine | ORC | Parquet |
|---|---|---|
| Spark2 | Supported | Supported |
| Spark3 | Supported | Supported |
| Presto | Supported | Supported |
| Hive2 | Not supported | Supported |
| Hive3 | Not supported | Supported |
Engine and file system support:
| Engine | OSS | JindoFS | HDFS (Hadoop Distributed File System) |
|---|---|---|---|
| Spark2 | Supported | Supported | Supported |
| Presto | Supported | Supported | Supported |
| Hive2 | Supported | Supported | Not supported |
| Hive3 | Supported | Supported | Not supported |
Enable query acceleration for Spark
Query acceleration uses off-heap memory. Add --conf spark.executor.memoryOverhead=4g to your Spark job to allocate additional memory for acceleration.
To read data with the native engine, use the DataFrame API or Spark SQL — no other APIs are supported.
You can enable query acceleration globally for all Spark jobs or at the individual job level.
Option 1: Global configuration (console)
-
Log on to the Alibaba Cloud EMR console.
-
In the top navigation bar, select the region of your cluster.
-
Click the Cluster Management tab.
-
Find your cluster and click Details in the Actions column.
-
In the left-side navigation pane, choose Cluster Service > Spark.
-
Click the Configure tab.
-
Search for the
spark.sql.extensionsparameter and set its value to:io.delta.sql.DeltaSparkSessionExtension,com.aliyun.emr.sql.JindoTableExtension -
Click Save in the upper-right corner. In the Confirm Changes dialog box, enter a description and click OK.
-
Choose Actions > Restart ThriftServer in the upper-right corner. In the Cluster Activities dialog box, enter a description and click OK. Then click OK in the confirmation message.
Option 2: Job-level configuration
Add the following flag when submitting a Spark Shell or Spark SQL job:
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension,com.aliyun.emr.sql.JindoTableExtension
For details, see Configure a Spark Shell job or Configure a Spark SQL job.
Verify that acceleration is enabled
-
Open the Spark History Server web UI.
-
Click the SQL tab and find your query.
If JindoDataSourceV2Scan appears in the query plan, acceleration is active. If it does not appear, verify the spark.sql.extensions configuration in the steps above.
Enable query acceleration for Presto
Presto uses off-heap memory for query acceleration and supports high query concurrency. Make sure the cluster has more than 10 GB of available memory before enabling this feature.
Complex data types — MAP, STRUCT, and ARRAY — are not supported when using query acceleration with Presto.
The Presto service includes a built-in catalog named hive-acc that uses the native engine. Connect to Presto using this catalog to enable acceleration:
presto --server emr-header-1:9090 --catalog hive-acc --schema default
No other configuration is required.
Enable query acceleration for Hive
If your jobs have high stability requirements, do not enable query acceleration for Hive.
Complex data types — MAP, STRUCT, and ARRAY — are not supported when using query acceleration with Hive.
You can enable query acceleration using the EMR console or the Hive CLI.
Option 1: EMR console
-
On the Hive service page, click the Configure tab.
-
Search for the
hive.jindotable.native.enabledparameter and set its value totrue. -
Save the configuration and restart the Hive service.
This method applies to both Hive on MapReduce and Hive on Tez jobs.
Option 2: Hive CLI
Run the following command in the CLI:
set hive.jindotable.native.enabled=true;
The query acceleration plug-in for Parquet files is included by default in EMR V3.35.0 and later. No additional installation is required.