All Products
Search
Document Center

Object Storage Service:Use JindoSDK with Hive to process data stored in OSS-HDFS

Last Updated:Aug 06, 2025

When you build an offline data warehouse using Hive, a traditional data warehouse based on HDFS may not meet your requirements cost-effectively as the data volume grows. In this case, you can use OSS-HDFS as the underlying storage for the Hive data warehouse and use JindoSDK to achieve better read and write performance.

Prerequisites

Procedure

  1. Connect to an ECS instance. For more information, see Connect to an instance.

  2. Configure JindoSDK.

    1. Download the latest version of the JindoSDK JAR package. For the download address, see GitHub.

    2. Decompress the JindoSDK JAR package.

      The following sample code provides an example of how to decompress a package named jindosdk-x.x.x-linux.tar.gz. If you use a different version of JindoSDK, replace the package name with the name of the corresponding JAR package.

      tar zxvf jindosdk-x.x.x-linux.tar.gz
      Note

      x.x.x indicates the version number of the JindoSDK JAR package.

    3. Copy the downloaded JindoSDK JAR package to the classpath of Hive.

      cp jindosdk-x.x.x-linux/lib/*.jar  $HIVE_HOME/lib/
  3. Configure the OSS-HDFS implementation class and the AccessKey pair.

    1. Configure the OSS-HDFS implementation class in the core-site.xml file of Hive.

      <configuration>
          <property>
              <name>fs.AbstractFileSystem.oss.impl</name>
              <value>com.aliyun.jindodata.oss.JindoOSS</value>
          </property>
      
          <property>
              <name>fs.oss.impl</name>
              <value>com.aliyun.jindodata.oss.JindoOssFileSystem</value>
          </property>
      </configuration>
    2. Configure the AccessKey ID and AccessKey secret for the bucket where OSS-HDFS is enabled in the core-site.xml file of Hive.

      <configuration>
          <property>
              <name>fs.oss.accessKeyId</name>
              <value>LTAI******** </value>
          </property>
      
          <property>
              <name>fs.oss.accessKeySecret</name>
              <value>KZo1********</value>
          </property>
      </configuration>
  4. Configure the OSS-HDFS endpoint.

    When you use OSS-HDFS to access an OSS bucket, you must configure an endpoint. The recommended access path format is oss://<Bucket>.<Endpoint>/<Object>, for example, oss://examplebucket.cn-shanghai.oss-dls.aliyuncs.com/exampleobject.txt. After the configuration is complete, JindoSDK uses the endpoint in the access path to access the corresponding OSS-HDFS API operation.

    You can also configure the OSS-HDFS endpoint using other methods. Endpoints that are configured using different methods have different priorities. For more information, see Appendix 1: Other methods used to configure an endpoint.

    Important

    After you complete the preceding configurations, you must restart Hive for the configurations to take effect.

  5. Run the following command on the terminal to connect to Hive.

    For more information about other connection methods, see Hive connection methods.

    hive
  6. Store data using OSS-HDFS.

    When you create a database or a table, you can use one of the following two methods to specify an OSS-HDFS path to store the data of the database or table in OSS-HDFS.

    • Method 1: Specify the OSS-HDFS path in the command

      • Specify the OSS-HDFS path when you create a database

        CREATE DATABASE db_on_oss1 LOCATION 'oss://bucket_name.endpoint_name/path/to/db1';
      • Specify the OSS-HDFS path when you create a table

        CREATE TABLE db2.table_on_oss (
            id INT,
            name STRING,
            age INT
        ) LOCATION 'oss://bucket_name.endpoint_name/path/to/db2/tablepath';
    • Method 2: Specify the OSS-HDFS path in the configuration file

      You can set hive.metastore.warehouse.dir to an OSS-HDFS path in the hive-site.xml configuration file of Hive Metastore and then restart Hive Metastore. By default, the databases and tables that are created later are stored in the specified OSS-HDFS path.

      The following sample code provides an example of the configuration:

      <configuration>
      
          <property>
              <name>hive.metastore.warehouse.dir</name>
              <value>oss://bucket_name.endpoint_name/path/to/warehouse</value>
          </property>
      
      </configuration>
  7. Add partitions to an existing table.

    You can add partitions to an existing table to divide it into smaller storage units. Based on query conditions, only partitions that meet the conditions are scanned to avoid a full table scan. This significantly improves query performance.

    • Command format

      ALTER TABLE <table_name> ADD [IF NOT EXISTS] PARTITION <pt_spec> [PARTITION <pt_spec> PARTITION <pt_spec>...] LOCATION 'location';

      The following table describes the parameters.

      Parameter

      Optional

      Description

      table_name

      Yes

      The name of the table to which you want to add partitions.

      IF NOT EXISTS

      Optional

      If you do not specify IF NOT EXISTS, the operation fails and an error is returned if a partition with the same name already exists.

      pt_spec

      Required

      The partition to add. The format is (partition_col1 = partition_col_value1, partition_col2 = partition_col_value2, ...). In this format, partition_col indicates the partition field, and partition_col_value indicates the partition value. The partition fields are case-insensitive, and the partition values are case-sensitive.

      location

      Required

      The OSS path where the partition is stored.

    • Example

      The following example shows how to add a partition to the sale_detail table to store the sales records of the China (Hangzhou) region in December 2021. The partition is stored in the specified OSS path.

      ALTER TABLE sale_detail ADD IF NOT EXISTS PARTITION (sale_date='202112', region='hangzhou') LOCATION 'oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/path/2021/';