All Products
Search
Document Center

Object Storage Service:Use JindoSDK with Impala to query data in the OSS-HDFS service

Last Updated:Aug 06, 2025

JindoSDK is an easy-to-use Object Storage Service (OSS) client developed for the Hadoop and Spark ecosystems. It provides a highly optimized Hadoop FileSystem implementation for OSS. JindoSDK offers better performance than Hadoop community OSS clients when used with Impala to query data in the OSS-HDFS service.

Prerequisites

Procedure

  1. Connect to an ECS instance. For more information, see Connect to an instance.

  2. Configure JindoSDK.

    1. Download the latest version of the JindoSDK JAR package. For more information, see the download links on GitHub.

    2. Decompress the JindoSDK JAR package.

      The following sample code shows how to decompress a package named jindosdk-x.x.x-linux.tar.gz. If you use a different version of JindoSDK, replace the package name with the actual name of your package.

      tar zxvf jindosdk-x.x.x-linux.tar.gz
      Note

      x.x.x represents the version number of the JindoSDK JAR package.

    3. Copy the downloaded JindoSDK JAR package to the Impala classpath.

      cp jindosdk-x.x.x-linux/lib/*.jar  $HIVE_HOME/lib/
  3. Configure the implementation class for the OSS-HDFS service and the AccessKey pair.

    1. Configure the implementation class for the OSS-HDFS service in the Impala core-site.xml file.

      <configuration>
          <property>
              <name>fs.AbstractFileSystem.oss.impl</name>
              <value>com.aliyun.jindodata.oss.JindoOSS</value>
          </property>
      
          <property>
              <name>fs.oss.impl</name>
              <value>com.aliyun.jindodata.oss.JindoOssFileSystem</value>
          </property>
      </configuration>
    2. In the Impala core-site.xml file, configure the AccessKey ID and AccessKey secret for the bucket where the OSS-HDFS service is enabled.

      <configuration>
          <property>
              <name>fs.oss.accessKeyId</name>
              <value>LTAI********</value>
          </property>
      
          <property>
              <name>fs.oss.accessKeySecret</name>
              <value>KZo1********</value>
          </property>
      </configuration>
  4. Configure the endpoint of OSS-HDFS.

    You must specify the endpoint of OSS-HDFS if you want to use OSS-HDFS to access buckets in Object Storage Service (OSS). We recommend that you configure the path that is used to access OSS-HDFS in the oss://<Bucket>.<Endpoint>/<Object> format (example: oss://examplebucket.cn-shanghai.oss-dls.aliyuncs.com/exampleobject.txt). After you configure the access path, JindoSDK calls the corresponding OSS-HDFS operation based on the specified endpoint in the access path.

    You can also configure the endpoint of OSS-HDFS by using other methods. The endpoints that are configured by using different methods have different priorities. For more information, see Appendix 1: Other methods used to configure the endpoint of OSS-HDFS.

  5. Use Impala to query data in the OSS-HDFS service.

    1. Create a table.

      CREATE EXTERNAL TABLE customer_demographics (
       `cd_demo_sk` INT,
       `cd_gender` STRING,
       `cd_marital_status` STRING,
       `cd_education_status` STRING,
       `cd_purchase_estimate` INT,
       `cd_credit_rating` STRING,
       `cd_dep_count` INT,
       `cd_dep_employed_count` INT,
       `cd_dep_college_count` INT)
      STORED AS PARQUET
      LOCATION 'oss://bucket.endpoint/dir';
    2. Query data in the table.

      select * from customer_demographics;