JindoTable provides a native engine for you to accelerate the speed of using Spark, Hive, or Presto to query files in the ORC or Parquet format. This topic describes how to enable query acceleration based on the native engine to improve the performance of Spark, Hive, and Presto.

Prerequisites

An E-MapReduce (EMR) cluster is created, and a file in the ORC or Parquet format is stored in JindoFS or Object Storage Service (OSS). For more information about how to create a cluster, see Create a cluster.

Limits

  • Binary files are not supported.
  • Partitioned tables whose values of partition key columns are stored in files are not supported.
  • EMR clusters of V5.X.X or later are not supported.
  • spark.read.schema (userDefinedSchema) is not supported.
  • Data of the DATE type must be in the YYYY-MM-DD format and range from 1400-01-01 to 9999-12-31.
  • If a table contains two columns that have the same column name with different letter cases, such as ID and id, queries on the columns cannot be accelerated.
  • The following table lists the supported Spark, Hive, and Presto engines and the file formats supported by each engine.
    Engine ORC Parquet
    Spark2 Supported Supported
    Spark3 Supported Supported
    Presto Supported Supported
    Hive2 Not supported Supported
    Hive3 Not supported Supported
  • The following table lists the supported Spark, Hive, and Presto engines and the file systems supported by each engine.
    Engine OSS JindoFS HDFS
    Spark2 Supported Supported Supported
    Presto Supported Supported Supported
    Hive2 Supported Supported Not supported
    Hive3 Supported Supported Not supported

Improve the performance of Spark

  1. Enable query acceleration for ORC or Parquet files in JindoTable.
    Note
    • Query acceleration consumes off-heap memory. We recommend that you add --conf spark.executor.memoryOverhead=4g to a Spark task to apply for additional resources for query acceleration.
    • If you use Spark to read data from an ORC or Parquet file, the DataFrame API or Spark SQL is required.
    • Global configuration
      1. Go to the Cluster Overview page of your cluster.
        1. Log on to the Alibaba Cloud EMR console.
        2. In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
        3. Click the Cluster Management tab.
        4. On the Cluster Management page, find your cluster and click Details in the Actions column.
      2. Modify the related parameter.
        1. In the left-side navigation pane, choose Cluster Service > Spark.
        2. On the Spark service page, click the Configure tab.
        3. Search for the spark.sql.extensions parameter and change its value to io.delta.sql.DeltaSparkSessionExtension,com.aliyun.emr.sql.JindoTableExtension.
      3. Save the configuration.
        1. Click Save in the upper-right corner.
        2. In the Confirm Changes dialog box, specify Description and click OK.
      4. Restart ThriftServer.
        1. Choose Actions > Restart ThriftServer in the upper-right corner.
        2. In the Cluster Activities dialog box, specify Description and click OK.
        3. In the Confirm message, click OK.
    • Job-level configuration
      When you configure a Spark Shell or Spark SQL job, add the following configuration to the code:
      --conf spark.sql.extensions==io.delta.sql.DeltaSparkSessionExtension,com.aliyun.emr.sql.JindoTableExtension

      For more information about how to configure a Spark job, see Configure a Spark Shell job or Configure a Spark SQL job.

  2. Check whether query acceleration is enabled.
    1. Access the web UI of Spark History Server.
    2. On the SQL tab of Spark, view information about the related task.
      If JindoDataSourceV2Scan appears, query acceleration is enabled. Otherwise, check the configurations in Step 1. check_Jindo

Improve the performance of Presto

Notice Presto has high query concurrency and uses off-heap memory for query acceleration. To use the query acceleration feature, make sure that the memory is greater than 10 GB.
By default, catalog: hive-acc of the native engine is built in the Presto service. You can use catalog: hive-acc to enable query acceleration.
Example:
presto --server https://emr-header-1.cluster-xxx:7778/ --catalog hive-acc --schema default
emr-header-1.cluster-xxx indicates the hostname of the emr-header-1 node.
Note When you use this feature in Presto, complex data types such as MAP, STRUCT, and ARRAY are not supported.

Improve the performance of Hive

Notice If you have high requirements for job stability, we recommend that you do not enable query acceleration.
You can use one of the following methods to improve the performance of Hive:
  • Use the EMR console
    On the Configure tab of the Hive service page, search for the hive.jindotable.native.enabled parameter and change its value to true. Then, save the configuration and restart the Hive service. This method is suitable for Hive on MapReduce and Hive on Tez jobs. hive
  • Use the CLI
    Set hive.jindotable.native.enabled to true in the CLI to enable query acceleration. By default, the query acceleration plug-in for Parquet files is deployed in JindoTable in EMR V3.35.0 and later. You can directly set this parameter in EMR V3.35.0 and later.
    set hive.jindotable.native.enabled=true;
Note When you use this feature in Hive, complex data types such as MAP, STRUCT, and ARRAY are not supported.