All Products
Search
Document Center

E-MapReduce:Enable query acceleration for ORC files

Last Updated:Dec 21, 2023

JindoTable provides a native Optimized Row Columnar (ORC) reader that can be used to accelerate queries of ORC files. By default, query acceleration is disabled. Enabling query acceleration can improve the performance of Spark or Presto to read ORC files.

Prerequisites

ORC files are stored to JindoFS or Object Storage Service (OSS).

Note

You cannot accelerate queries for HDFS.

Improve performance of Spark to read data

  1. Enable query acceleration.

    Note

    When Spark is used to read ORC files, the DataFrame or Spark-SQL API is required to enable acceleration.

  2. Check whether query acceleration is enabled.

    1. Access the web UI of Spark History Server UI.

    2. On the SQL tab of Spark, view the execution task.

      If JindoDataSourceV2Scan appears, query acceleration is enabled. Otherwise, check the configurations in Step 1. check_Jindo

Improve performance of Presto to read data

Presto has built-in catalog: hive-acc. You can use catalog: hive-acc to enable query acceleration.

Example:

presto --server https://emr-header-1.cluster-xxx:7778/ --catalog hive-acc --schema default
Note

emr-header-1.cluster-xxx indicates the hostname of the emr-header-1 node.

Configure the global parameter of Spark

  1. Go to the Spark service page.

    1. Log on to the Alibaba Cloud EMR console.

    2. In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.

    3. Click the Cluster Management tab.

    4. On the Cluster Management page, find your cluster and click Details in the Actions column.

    5. In the left-side navigation pane, choose Cluster Service > Spark.

  2. On the Spark service page, click the Configure tab.

  3. Find the spark.sql.extensions parameter and change its value to io.delta.sql.DeltaSparkSessionExtension,com.aliyun.emr.sql.JindoTableExtension.

  4. Save the configurations.

    1. Click Save in the upper-right corner of the Service Configuration section.

    2. In the Confirm Changes dialog box, specify Description and click OK.

  5. Restart ThriftServer.

    1. Choose Actions > Restart ThriftServer in the upper-right corner.

    2. In the Cluster Activities dialog box, specify Description and click OK.

    3. In the Confirm message, click OK.