Use JindoSDK with Spark to query data in OSS-HDFS - Object Storage Service

JindoSDK is an easy-to-use OSS client developed for the Hadoop and Spark ecosystems. It provides a highly optimized Hadoop FileSystem implementation for OSS. You can use JindoSDK with Spark to query data in OSS-HDFS with better performance than that of Hadoop community OSS clients.

Prerequisites

An Alibaba Cloud ECS instance is available. For more information, see purchase an ECS instance.
A Hadoop environment is created. For more information, see Create a Hadoop runtime environment.
Apache Spark is deployed. For more information, see Apache Spark.
OSS-HDFS is enabled, and you are authorized to access it. For more information, see Enable OSS-HDFS.

Procedure

Connect to an ECS instance. For more information, see Connect to an instance.
Configure JindoSDK.
1. Download the latest version of the JindoSDK JAR package. For the download link, see GitHub.
2. Decompress the JindoSDK JAR package.
  The following sample code shows how to decompress a package named jindosdk-x.x.x-linux.tar.gz. If you use a different version of JindoSDK, replace the package name with the actual name of your package.
```
tar zxvf jindosdk-x.x.x-linux.tar.gz
```
  Note
  x.x.x indicates the version number of the JindoSDK JAR package.
3. Copy the JindoSDK JAR package to the Spark classpath.
```
cp jindosdk-x.x.x-linux/lib/*.jar  $SPARK_HOME/jars/
```

Configure the OSS-HDFS implementation class and the AccessKey pair.

Configure the settings in the core-site.xml file:

Configure the implementation class for the OSS-HDFS service in the core-site.xml file of Spark.

<configuration>
    <property>
        <name>fs.AbstractFileSystem.oss.impl</name>
        <value>com.aliyun.jindodata.oss.JindoOSS</value>
    </property>

    <property>
        <name>fs.oss.impl</name>
        <value>com.aliyun.jindodata.oss.JindoOssFileSystem</value>
    </property>
</configuration>

Configure the AccessKey ID and AccessKey secret for the bucket where OSS-HDFS is enabled in the core-site.xml file of Spark.

<configuration>
    <property>
        <name>fs.oss.accessKeyId</name>
        <value>LTAI********</value>
    </property>

    <property>
        <name>fs.oss.accessKeySecret</name>
        <value>KZo1********</value>
    </property>
</configuration>

Alternatively, you can configure the settings when you submit a task.

The following sample code shows how to configure the OSS-HDFS implementation class and the AccessKey pair when you submit a Spark task:

spark-submit --conf spark.hadoop.fs.AbstractFileSystem.oss.impl=com.aliyun.jindodata.oss.OSS --conf spark.hadoop.fs.oss.impl=com.aliyun.jindodata.oss.JindoOssFileSystem --conf spark.hadoop.fs.oss.accessKeyId=LTAI********  --conf spark.hadoop.fs.oss.accessKeySecret=KZo149BD9GLPNiDIEmdQ7d****

Configure the endpoint of OSS-HDFS.
You must specify the endpoint of OSS-HDFS if you want to use OSS-HDFS to access buckets in Object Storage Service (OSS). We recommend that you configure the path that is used to access OSS-HDFS in the oss://<Bucket>.<Endpoint>/<Object> format (example: oss://examplebucket.cn-shanghai.oss-dls.aliyuncs.com/exampleobject.txt). After you configure the access path, JindoSDK calls the corresponding OSS-HDFS operation based on the specified endpoint in the access path.
You can also configure the endpoint of OSS-HDFS by using other methods. The endpoints that are configured by using different methods have different priorities. For more information, see Appendix 1: Other methods used to configure the endpoint of OSS-HDFS.

Use Spark to access OSS-HDFS.

Create a table.

create table test_oss (c1 string) location "oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/dir/";

Insert data into the table.

insert into table test_oss values ("testdata");

Query the table.
```
select * from test_oss;
```