Apache Hive is a distributed SQL query engine widely used in the Hadoop ecosystem. Hive manages data in databases, tables, and partitions. You can specify the storage location to query data at the storage backend.

JindoFS configuration

For example, a namespace named emr-jfs is created with the following configuration:

  • jfs.namespaces=emr-jfs
  • jfs.namespaces.emr-jfs.uri=oss://oss-bucket/oss-dir
  • jfs.namespaces.emr-jfs.mode=block

Specify the storage location for data warehouses, databases, tables, or partitions

  • Specify the storage location for a data warehouse

    The hive-site configuration file contains the hive.metastore.warehouse.dir parameter. This parameter specifies the directory in which a Hive data warehouse stores data. For example, set this parameter to jfs://emr-jfs/user/hive/warehouse.

  • Specify the storage location for a database

    A Hive database has a storage location. The location is also used as the default storage location of tables in the database. The parameter that specifies the storage location is optional when you create a database. By default, the storage location of a database is the value of the hive.metastore.warehouse.dir parameter in the hive-site file added with the database name. You can run the following SQL statements to specify the storage location.

    • Set the storage location to a directory in JindoFileSystem (JindoFS) when you create a database:
      CREATE DATABASE database_name
      LOCATION
      'jfs://namespace/database_dir';

      For example, to create a Hive database named database_on_jindofs and set the storage location to jfs://emr-jfs/warehouse/database_on_jindofs, run the following statement:

      CREATE DATABASE database_on_jindofs
      LOCATION
      'jfs://emr-jfs/hive/warehouse/database_on_jindofs';
    • Modify the storage location of a database to a directory in JindoFS:
      1. Run the following SHOW CREATE statement to query the storage location of a database:
        SHOW CREATE DATABASE database_name;
      2. View the returned storage location. By default, the storage location of a database is that of its data warehouse added with the database name.
        CREATE DATABASE `database_name`
        LOCATION
        'hdfs://emr-jfs/user/hive/warehouse/database_name.db'
      3. Modify the storage location to a directory in JindoFS. This operation does not affect existing tables. If you do not specify the storage location for a new table, the modified location is used as the default storage location of the table.
        ALTER DATABASE database_name SET LOCATION jfs_path;
  • Specify the storage location for a table or partition

    Similar to the storage location of a database, that of a table or partition is also specified based on the upper-level storage location. Data in a non-partitioned table is stored in the storage location of the table. Data in a partitioned table is stored in the storage location of respective partitions. You can run the following SQL statements to specify the storage location.

    • Set the storage location to a directory in JindoFS when you create a database:
      CREATE DATABASE database_name
      LOCATION
      'jfs://namespace/database_dir';

      For example, to create a Hive database named database_on_jindofs and set the storage location to jfs://emr-jfs/warehouse/database_on_jindofs, run the following statement:

      CREATE DATABASE database_on_jindofs
      LOCATION
      'jfs://emr-jfs/hive/warehouse/database_on_jindofs';
    • Modify the storage location of a database to a directory in JindoFS:
      1. Run the following SHOW CREATE statement to query the storage location of a database:
        SHOW CREATE DATABASE database_name;
      2. View the returned storage location. By default, the storage location of a database is that of its data warehouse added with the database name.
        CREATE DATABASE `database_name`
        LOCATION
        'hdfs://emr-jfs/user/hive/warehouse/database_name.db'
      3. Modify the storage location to a directory in JindoFS. This operation does not affect existing tables. If you do not specify the storage location for a new table, the modified location is used as the default storage location of the table.
        ALTER DATABASE database_name SET LOCATION jfs_path;

Query data in the scratch directory

Hive stores temporary output files and job plans to the scratch directory. You can set the hive.exec.scratchdir parameter in the hive-site configuration file to a directory in JindoFS. You can also set the parameter by running the following commands:

bin/hive --hiveconf hive.exec.scratchdir=jfs://emr-jfs/scratch_dir

or

set hive.exec.scratchdir=jfs://emr-jfs/scratch_dir;