How to use Impala and JindoSDK to query data in the OSS-HDFS service - Object Storage Service

JindoSDK is an easy-to-use Object Storage Service (OSS) client developed for the Hadoop and Spark ecosystems. It provides a highly optimized Hadoop FileSystem implementation for OSS. JindoSDK offers better performance than Hadoop community OSS clients when used with Impala to query data in the OSS-HDFS service.

Prerequisites

By default, an Alibaba Cloud account has permissions to access the OSS-HDFS service from a non-EMR cluster and perform common operations. If you want to use a Resource Access Management (RAM) user to access the OSS-HDFS service, the RAM user must be granted the required permissions. For more information, see Grant a RAM user the permissions to access the OSS-HDFS service from a non-EMR cluster.
An ECS instance has been purchased to use as the deployment environment. For more information, see Purchase an ECS instance.
A Hadoop environment has been created. For more information, see Create a Hadoop runtime environment.
The OSS-HDFS service is enabled for a bucket, and you have been granted permissions to access the service. For more information, see Enable the OSS-HDFS service.

Procedure

Connect to an ECS instance. For more information, see Connect to an instance.
Configure JindoSDK.
1. Download the latest version of the JindoSDK JAR package. For more information, see the download links on GitHub.
2. Decompress the JindoSDK JAR package.
  The following sample code shows how to decompress a package named jindosdk-x.x.x-linux.tar.gz. If you use a different version of JindoSDK, replace the package name with the actual name of your package.
```
tar zxvf jindosdk-x.x.x-linux.tar.gz
```
  Note
  x.x.x represents the version number of the JindoSDK JAR package.
3. Copy the downloaded JindoSDK JAR package to the Impala classpath.
```
cp jindosdk-x.x.x-linux/lib/*.jar  $HIVE_HOME/lib/
```

Configure the implementation class for the OSS-HDFS service and the AccessKey pair.

Configure the implementation class for the OSS-HDFS service in the Impala core-site.xml file.

<configuration>
    <property>
        <name>fs.AbstractFileSystem.oss.impl</name>
        <value>com.aliyun.jindodata.oss.JindoOSS</value>
    </property>

    <property>
        <name>fs.oss.impl</name>
        <value>com.aliyun.jindodata.oss.JindoOssFileSystem</value>
    </property>
</configuration>

In the Impala core-site.xml file, configure the AccessKey ID and AccessKey secret for the bucket where the OSS-HDFS service is enabled.

<configuration>
    <property>
        <name>fs.oss.accessKeyId</name>
        <value>LTAI********</value>
    </property>

    <property>
        <name>fs.oss.accessKeySecret</name>
        <value>KZo1********</value>
    </property>
</configuration>

Configure the endpoint of OSS-HDFS.
You must specify the endpoint of OSS-HDFS if you want to use OSS-HDFS to access buckets in Object Storage Service (OSS). We recommend that you configure the path that is used to access OSS-HDFS in the oss://<Bucket>.<Endpoint>/<Object> format (example: oss://examplebucket.cn-shanghai.oss-dls.aliyuncs.com/exampleobject.txt). After you configure the access path, JindoSDK calls the corresponding OSS-HDFS operation based on the specified endpoint in the access path.
You can also configure the endpoint of OSS-HDFS by using other methods. The endpoints that are configured by using different methods have different priorities. For more information, see Appendix 1: Other methods used to configure the endpoint of OSS-HDFS.

Use Impala to query data in the OSS-HDFS service.

Create a table.

CREATE EXTERNAL TABLE customer_demographics (
 `cd_demo_sk` INT,
 `cd_gender` STRING,
 `cd_marital_status` STRING,
 `cd_education_status` STRING,
 `cd_purchase_estimate` INT,
 `cd_credit_rating` STRING,
 `cd_dep_count` INT,
 `cd_dep_employed_count` INT,
 `cd_dep_college_count` INT)
STORED AS PARQUET
LOCATION 'oss://bucket.endpoint/dir';

Query data in the table.
```
select * from customer_demographics;
```