How to use HiveJindoSDK to process data in OSS-HDFS - Object Storage Service

If you use Hive to build a traditional offline data warehouse that uses HDFS to store data, the data warehouse can be costly as the amount of data stored in the warehouse increases. In this case, you can use OSS-HDFS (JindoFS) as the underlying storage of the Hive data warehouse and use JindoSDK to obtain better read and write performance.

Prerequisites

An Elastic Compute Service (ECS) instance is created. For more information, see Create an instance.
OSS-HDFS is enabled for a bucket and permissions are granted to access OSS-HDFS. For more information, see Enable OSS-HDFS and grant access permissions.
A Hive client is deployed.

Procedure

Connect to the ECS instance. For more information, see Connect to an instance.
Configure JindoSDK.
1. Download the latest version of the JindoSDK JAR package. For more information, visit GitHub.
2. Decompress the JindoSDK JAR package.
  The following sample code provides an example on how to decompress a package named jindosdk-x.x.x-linux.tar.gz. If you use another version of JindoSDK, replace the package name with the name of the corresponding JAR package.
```
tar zxvf jindosdk-x.x.x-linux.tar.gz
```
  Note
  x.x.x indicates the version number of the JindoSDK JAR package.
3. Optional: If Kerberos-related and SASL-related dependencies are not included in your environment, install the following dependencies on all nodes on which JindoSDK is deployed.
  - Ubuntu or Debian
```
sudo apt-get install libkrb5-dev krb5-admin-server krb5-kdc krb5-user libsasl2-dev libsasl2-modules libsasl2-modules-gssapi-mit
```
  - Red Hat Enterprise Linux or CentOS
```
sudo yum install krb5-server krb5-workstation cyrus-sasl-devel cyrus-sasl-gssapi cyrus-sasl-plain
```
  - macOS
```
brew install krb5
```
4. Copy the JindoSDK JAR package to the classpath path of Hive.
```
cp jindosdk-x.x.x-linux/lib/*.jar  $HIVE_HOME/lib/
```

Configure the implementation class of OSS-HDFS and the AccessKey pair used to access a bucket

Configure the implementation class of OSS-HDFS in the core-site.xml file of Hive.

<configuration>
    <property>
        <name>fs.AbstractFileSystem.oss.impl</name>
        <value>com.aliyun.jindodata.oss.JindoOSS</value>
    </property>

    <property>
        <name>fs.oss.impl</name>
        <value>com.aliyun.jindodata.oss.JindoOssFileSystem</value>
    </property>
</configuration>

In the core-site.xml file of Hive, specify the AccessKey ID and AccessKey secret that you want to use to access the bucket for which OSS-HDFS is enabled.

<configuration>
    <property>
        <name>fs.oss.accessKeyId</name>
        <value>LTAI********</value>
    </property>

    <property>
        <name>fs.oss.accessKeySecret</name>
        <value>KZo1********</value>
    </property>
</configuration>

Configure the endpoint of OSS-HDFS.
You must specify the endpoint of OSS-HDFS if you want to use OSS-HDFS to access buckets in Object Storage Service (OSS). We recommend that you configure the path that is used to access OSS-HDFS in the oss://<Bucket>.<Endpoint>/<Object> format (example: oss://examplebucket.cn-shanghai.oss-dls.aliyuncs.com/exampleobject.txt). After you configure the endpoint, JindoSDK accesses the corresponding OSS-HDFS operation based on the specified endpoint in the access path.
You can also configure the endpoint of OSS-HDFS by using other methods. The endpoints that are configured by using different methods have different priorities. For more information, see Appendix 1: Other methods used to configure the endpoint of OSS-HDFS.
Important
After you complete the preceding configurations, you must restart Hive for the configurations to take effect.
Run the following command on the terminal to connect to Hive.
For other connection methods, see Connect to Hive.
```
hive
```
Use OSS-HDFS to store data.
When you create a database or a table, you can use one of the following methods to specify an OSS-HDFS path as the storage path of the database or table:
- Method 1: Specify the OSS-HDFS storage path in the command
  - Specify the OSS-HDFS storage path when you create a database:
```
CREATE DATABASE db_on_oss1 LOCATION 'oss://bucket_name.endpoint_name/path/to/db1';
```
  - Specify the OSS-HDFS storage path when you create a table:
```
CREATE TABLE db2.table_on_oss (
    id INT,
    name STRING,
    age INT
) LOCATION 'oss://bucket_name.endpoint_name/path/to/db2/tablepath';
```
- Method 2: Specify the OSS-HDFS storage path in the Hive configuration file
  You can set the value of hive.metastore.warehouse.dir to the OSS-HDFS storage path in the hive-site.xml configuration file of Hive Metastore and restart Hive Metastore. Databases and tables that are later created are stored in the specified OSS-HDFS storage path.
  The following sample code provides an example on how to set hive.metastore.warehouse.dir to the OSS-HDFS storage path:
```
<configuration>

    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>oss://bucket_name.endpoint_name/path/to/warehouse</value>
    </property>

</configuration>
```

Add partitions to an existing table.

You can add partitions to an existing table to store the data of the table in smaller units. You can specify query conditions based on partitions. This way, only partitions that meet the specified conditions are scanned and the query performance is improved.

Command syntax

ALTER TABLE <table_name> ADD [IF NOT EXISTS] PARTITION <pt_spec> [PARTITION <pt_spec> PARTITION <pt_spec>...] LOCATION 'location';

The following table describes the parameters in the preceding command.

Parameter	Required	Description
table_name	Yes	The name of the table to which you want to add partitions.
IF NOT EXISTS	No	The parameter that causes the error to be skipped if the table contains a partition with the same name. If the IF NOT EXISTS parameter is not specified and a partition whose name is the same as the partition that you want to add already exists, an operation failure occurs and an error is returned.
pt_spec	Yes	The partitions that you want to add. The value of this parameter is in the `(partition_col1 = partition_col_value1, partition_col2 = partition_col_value2, ...)` format. In this format, `partition_col` is the names of partition key columns, and `partition_col_value` is their values. The names of partition key columns are case-insensitive and their values are case-sensitive.
location	Yes	The OSS path that is used to store data in a partition.

Examples
The following sample code provides an example on how to add a partition to a table named sale_detail to store the sale records in the China (Hangzhou) region in December 2021 and specify the OSS path that is used to store data in the partition:
```
ALTER TABLE sale_detail ADD IF NOT EXISTS PARTITION (sale_date='202112', region='hangzhou') LOCATION 'oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/path/2021/';
```