Use JindoSDK with Hive to process data stored in OSS-HDFS - Object Storage Service

When you build an offline data warehouse using Hive, a traditional data warehouse based on HDFS may not meet your requirements cost-effectively as the data volume grows. In this case, you can use OSS-HDFS as the underlying storage for the Hive data warehouse and use JindoSDK to achieve better read and write performance.

Prerequisites

An ECS instance is purchased. This topic uses an Alibaba Cloud ECS instance as an example. For more information, see Purchase an ECS instance.
OSS-HDFS is enabled for a bucket and permissions are granted to the RAM role to access OSS-HDFS. For more information, see Enable OSS-HDFS and grant access permissions.
A Hive client is deployed.

Procedure

Connect to an ECS instance. For more information, see Connect to an instance.
Configure JindoSDK.
1. Download the latest version of the JindoSDK JAR package. For the download address, see GitHub.
2. Decompress the JindoSDK JAR package.
  The following sample code provides an example of how to decompress a package named jindosdk-x.x.x-linux.tar.gz. If you use a different version of JindoSDK, replace the package name with the name of the corresponding JAR package.
```
tar zxvf jindosdk-x.x.x-linux.tar.gz
```
  Note
  x.x.x indicates the version number of the JindoSDK JAR package.
3. Copy the downloaded JindoSDK JAR package to the classpath of Hive.
```
cp jindosdk-x.x.x-linux/lib/*.jar  $HIVE_HOME/lib/
```

Configure the OSS-HDFS implementation class and the AccessKey pair.

Configure the OSS-HDFS implementation class in the core-site.xml file of Hive.

<configuration>
    <property>
        <name>fs.AbstractFileSystem.oss.impl</name>
        <value>com.aliyun.jindodata.oss.JindoOSS</value>
    </property>

    <property>
        <name>fs.oss.impl</name>
        <value>com.aliyun.jindodata.oss.JindoOssFileSystem</value>
    </property>
</configuration>

Configure the AccessKey ID and AccessKey secret for the bucket where OSS-HDFS is enabled in the core-site.xml file of Hive.

<configuration>
    <property>
        <name>fs.oss.accessKeyId</name>
        <value>LTAI******** </value>
    </property>

    <property>
        <name>fs.oss.accessKeySecret</name>
        <value>KZo1********</value>
    </property>
</configuration>

Configure the OSS-HDFS endpoint.
When you use OSS-HDFS to access an OSS bucket, you must configure an endpoint. The recommended access path format is oss://<Bucket>.<Endpoint>/<Object>, for example, oss://examplebucket.cn-shanghai.oss-dls.aliyuncs.com/exampleobject.txt. After the configuration is complete, JindoSDK uses the endpoint in the access path to access the corresponding OSS-HDFS API operation.
You can also configure the OSS-HDFS endpoint using other methods. Endpoints that are configured using different methods have different priorities. For more information, see Appendix 1: Other methods used to configure an endpoint.
Important
After you complete the preceding configurations, you must restart Hive for the configurations to take effect.
Run the following command on the terminal to connect to Hive.
For more information about other connection methods, see Hive connection methods.
```
hive
```
Store data using OSS-HDFS.
When you create a database or a table, you can use one of the following two methods to specify an OSS-HDFS path to store the data of the database or table in OSS-HDFS.
- Method 1: Specify the OSS-HDFS path in the command
  - Specify the OSS-HDFS path when you create a database
```
CREATE DATABASE db_on_oss1 LOCATION 'oss://bucket_name.endpoint_name/path/to/db1';
```
  - Specify the OSS-HDFS path when you create a table
```
CREATE TABLE db2.table_on_oss (
    id INT,
    name STRING,
    age INT
) LOCATION 'oss://bucket_name.endpoint_name/path/to/db2/tablepath';
```
- Method 2: Specify the OSS-HDFS path in the configuration file
  You can set hive.metastore.warehouse.dir to an OSS-HDFS path in the hive-site.xml configuration file of Hive Metastore and then restart Hive Metastore. By default, the databases and tables that are created later are stored in the specified OSS-HDFS path.
  The following sample code provides an example of the configuration:
```
<configuration>

    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>oss://bucket_name.endpoint_name/path/to/warehouse</value>
    </property>

</configuration>
```

Add partitions to an existing table.

You can add partitions to an existing table to divide it into smaller storage units. Based on query conditions, only partitions that meet the conditions are scanned to avoid a full table scan. This significantly improves query performance.

Command format

ALTER TABLE <table_name> ADD [IF NOT EXISTS] PARTITION <pt_spec> [PARTITION <pt_spec> PARTITION <pt_spec>...] LOCATION 'location';

The following table describes the parameters.

Parameter	Optional	Description
table_name	Yes	The name of the table to which you want to add partitions.
IF NOT EXISTS	Optional	If you do not specify IF NOT EXISTS, the operation fails and an error is returned if a partition with the same name already exists.
pt_spec	Required	The partition to add. The format is `(partition_col1 = partition_col_value1, partition_col2 = partition_col_value2, ...)`. In this format, `partition_col` indicates the partition field, and `partition_col_value` indicates the partition value. The partition fields are case-insensitive, and the partition values are case-sensitive.
location	Required	The OSS path where the partition is stored.

Example
The following example shows how to add a partition to the sale_detail table to store the sales records of the China (Hangzhou) region in December 2021. The partition is stored in the specified OSS path.
```
ALTER TABLE sale_detail ADD IF NOT EXISTS PARTITION (sale_date='202112', region='hangzhou') LOCATION 'oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/path/2021/';
```