Apache Hive is a distributed SQL query engine that is widely used in the Hadoop ecosystem. Hive manages data in databases, tables, and partitions. You can specify the storage location to query data at the storage backend.
JindoFS configuration
For example, a namespace named emr-jfs is created with the following configuration:
- jfs.namespaces=emr-jfs
- jfs.namespaces.emr-jfs.oss.uri=oss://oss-bucket/oss-dir
- jfs.namespaces.emr-jfs.mode=block
Specify the storage location for data warehouses, databases, tables, or partitions
- Specify the storage location for a data warehouse
The hive-site configuration file contains the hive.metastore.warehouse.dir parameter. This parameter specifies the directory in which a Hive data warehouse stores data. For example, set this parameter to
jfs://emr-jfs/user/hive/warehouse
. - Specify the storage location for a database
A Hive database has a storage location. The location is also used as the default storage location of tables in the database. The parameter that specifies the storage location is optional when you create a database. By default, the storage location of a database is the value of the hive.metastore.warehouse.dir parameter in the hive-site file added with the database name. You can execute the following SQL statements to specify the storage location:
- Set the storage location to a directory in JindoFS when you create a database:
CREATE DATABASE database_name LOCATION 'jfs://namespace/database_dir';
For example, to create a Hive database named database_on_jindofs and set the storage location to
jfs://emr-jfs/warehouse/database_on_jindofs
, execute the following statement:CREATE DATABASE database_on_jindofs LOCATION 'jfs://emr-jfs/hive/warehouse/database_on_jindofs';
- Change the storage location of a database to a directory in JindoFS:
- Execute the following SHOW CREATE statement to query the storage location of a database:
SHOW CREATE DATABASE database_name;
- View the returned query result. By default, the storage location of a database is
that of its data warehouse added with the database name.
CREATE DATABASE `database_name` LOCATION 'hdfs://emr-jfs/user/hive/warehouse/database_name.db'
- Change the storage location to a directory in JindoFS. This operation does not affect
existing tables. If you do not specify the storage location for a new table, the changed
location is used as the default storage location of the table.
For example, to query data in a partition in the jfs_table_name table, execute the following statement:
ALTER DATABASE database_name SET LOCATION jfs_path;
- Execute the following SHOW CREATE statement to query the storage location of a database:
- Set the storage location to a directory in JindoFS when you create a database:
- Specify the storage location for a table or partition
Similar to the storage location of a database, the storage location of a table or partition is also specified based on the upper-level storage location. Data in a non-partitioned table is stored in the storage location of the table. Data in a partitioned table is stored in the storage location of respective partitions. You can execute the following SQL statements to specify the storage location:
- Set the storage location to a directory in JindoFS when you create a table:
CREATE [EXTERNAL] TABLE table_name [(col_name data_type,...)] LOCATION 'jfs://emr-jfs/database_dir/table_dir';
- Change the storage location of a table or partition to a directory in JindoFS:
- Execute the following DESCRIBE statement to query the storage location of a table
or partition:
DESCRIBE FORMATTED [PARTITION partition_spec] table_name;
- Change the storage location to a directory in JindoFS:
ALTER TABLE table_name [PARTITION partition_spec] SET LOCATION "jfs_path";
For example, to query data in a partition in the jfs_table_name table, execute the following statement:
DESCRIBE FORMATTED jfs_table_name PARTITION (partition_key1=123,partition_key2='xxxx');
- Execute the following DESCRIBE statement to query the storage location of a table
or partition:
- Set the storage location to a directory in JindoFS when you create a table:
Query data in a scratch directory
Hive stores temporary output files and job plans to a scratch directory. You can set the hive.exec.scratchdir parameter in the hive-site configuration file to a directory in JindoFS. You can also run one of the following commands to specify this parameter:
bin/hive --hiveconf hive.exec.scratchdir=jfs://emr-jfs/scratch_dir
and
set hive.exec.scratchdir=jfs://emr-jfs/scratch_dir;