Query Hive Data in StarRocks via Hive Catalog - E-MapReduce

A Hive catalog is an external catalog that StarRocks has supported since v2.3.

Background

A Hive catalog lets you:

Directly query Hive data without manually creating tables.
Process and model Hive data and import it into StarRocks using INSERT INTO, or using asynchronous materialized views in v3.1 and later.
Create or delete Hive databases and tables in StarRocks. You can also write data from StarRocks tables to Hive tables in Parquet (from v3.2), ORC, or TextFile (from v3.3) format using INSERT INTO.

To access data in Hive, the StarRocks cluster must be able to access the storage system and metadata service of the Hive cluster. StarRocks currently supports the following storage systems and metadata services:

Hadoop Distributed File System (HDFS) or Alibaba Cloud Object Storage Service (OSS).
Metadata services. Currently, supported metadata services include Data Lake Formation DLF 1.0 (Legacy) and Hive Metastore (HMS).

Limitations

StarRocks supports querying Hive data in Parquet, ORC, TextFile, Avro, RCFile, and SequenceFile file formats:
- Parquet files support the SNAPPY, LZ4, ZSTD, GZIP, and NO_COMPRESSION compression formats. Support for the LZO compression format was added in v3.1.5.
- ORC files support ZLIB, SNAPPY, LZO, LZ4, ZSTD, and NO_COMPRESSION compression formats.
- TextFile files support the LZO compression format from v3.1.5.
StarRocks does not support querying Hive data of the INTERVAL, BINARY, or UNION data types. Additionally, for Hive tables in TextFile format, StarRocks does not support the MAP and STRUCT data types.
StarRocks supports writing data to Hive in Parquet (v3.2 and later), ORC, and TextFile (v3.3 and later) file formats:
- Parquet and ORC files support NO_COMPRESSION, SNAPPY, LZ4, ZSTD, and GZIP compression formats.
- TextFile supports the NO_COMPRESSION format.
You can set the compression algorithm for writing to Hive tables using the connector_sink_compression_codec system variable.

Create a Hive catalog

Syntax

CREATE EXTERNAL CATALOG <catalog_name>
[COMMENT <comment>]
PROPERTIES
(
    "type" = "hive",
    GeneralParams,
    MetastoreParams
)

Parameters

The parameter configuration varies depending on the metadata service used by Hive.

Use HMS

catalog_name: The name of the Hive catalog. This parameter is required. The name must meet the following requirements:
- It must start with a letter and can contain only letters (a-z or A-Z), numbers (0-9), and underscores (_).
- The total length cannot exceed 64 characters.
comment: The description of the Hive catalog. This parameter is optional.
type: The type of the data source. Set this to hive.

GeneralParams: A set of parameters for general settings. GeneralParams includes the following parameter.

Parameter

Required

Description

enable_recursive_listing

Specifies whether StarRocks recursively reads data from files in a table or partition directory, including its subdirectories. Valid values:

true (default): Recursively traverses the directory.
false: Reads data only from files at the current level of the table or partition directory.

MetastoreParams: Parameters related to how StarRocks accesses the metadata of the Hive cluster.

Property	Description
hive.metastore.type	The type of the metastore. The default value is empty, which means using the Hive Metastore.
hive.metastore.uris	The URI of the Hive Metastore. The format is `thrift://<IP_address_of_Hive_Metastore>:<port>`. The default port is 9083.

Use DLF 1.0 (Legacy)

catalog_name: The name of the Hive catalog. This parameter is required. The name must meet the following requirements:
- It must start with a letter and can contain only letters (a-z or A-Z), numbers (0-9), and underscores (_).
- The total length cannot exceed 64 characters.
comment: The description of the Hive catalog. This parameter is optional.
type: The type of the data source. Set this to hive.

GeneralParams: A set of parameters for general settings. GeneralParams includes the following parameter.

Parameter

Required

Description

enable_recursive_listing

Specifies whether StarRocks recursively reads data from files in a table or partition directory, including its subdirectories. Valid values:

true (default): Recursively traverses the directory.
false: Reads data only from files at the current level of the table or partition directory.

MetastoreParams: Parameters related to how StarRocks accesses the metadata of the Hive cluster.

Property	Description
hive.metastore.type	The type of metadata service used by Hive. Set this to `dlf`.
dlf.catalog.id	The ID of an existing data catalog in DLF 1.0. This parameter is required only when `hive.metastore.type` is set to `dlf`. If the `dlf.catalog.id` parameter is not specified, the system uses the default DLF Catalog.

Example

The following examples use HDFS for storage and create a Hive catalog named hive_catalog.

Use HMS

CREATE EXTERNAL CATALOG hive_catalog
PROPERTIES
(
    "type" = "hive",
    "hive.metastore.uris" = "thrift://xx.xx.xx.xx:9083"
);

Use DLF 1.0 (Legacy)

CREATE EXTERNAL CATALOG hive_catalog
PROPERTIES
(
    "type" = "hive",
    "hive.metastore.type" = "dlf",
    "dlf.catalog.id" = "sr_dlf"
);

View Hive catalogs

You can use SHOW CATALOGS to query all catalogs in the current StarRocks cluster.

SHOW CATALOGS;

You can also use SHOW CREATE CATALOG to query the creation statement of an external catalog. For example, run the following command to query the creation statement for the Hive catalog hive_catalog.

SHOW CREATE CATALOG hive_catalog;

Switch between Hive catalogs and databases

You can switch to a target Hive catalog and database in the following ways:

First, specify the Hive catalog for the current session with SET CATALOG. Then, specify the database with USE.

-- Switch the catalog for the current session.
SET CATALOG <catalog_name>;

-- Specify the database for the current session.
USE <db_name>;

Use USE to directly switch the session to a specific database under the target Hive catalog.
```
USE <catalog_name>.<db_name>;
```

Delete a Hive catalog

You can use DROP CATALOG to delete an external catalog.

For example, run the following command to delete hive_catalog.

DROP Catalog hive_catalog;

Create a Hive database

Similar to the StarRocks internal catalog, if you have the CREATE DATABASE permission for a Hive catalog, you can create a database in that catalog using the CREATE DATABASE statement. This feature is supported from v3.2 onwards.

Note

You can use GRANT and REVOKE to grant and revoke permissions for users and roles.

Syntax

Switch to the target Hive catalog and run the following statement to create a Hive database.

CREATE DATABASE <database_name>
[PROPERTIES ("location" = "<prefix>://<path_to_database>/<database_name.db>")]

Parameters

If you do not specify location when creating the database, the system uses the default path <warehouse_location>/<database_name.db>.

The location parameter specifies the file path for the database. It supports HDFS and Object Storage Service (OSS):

For HDFS, set Prefix to hdfs.
Alibaba Cloud OSS or OSS-HDFS: Set the value to oss.

Delete a Hive database

Similar to StarRocks internal databases, if you have the DROP permission for a Hive database, you can delete it using DROP DATABASE. This feature is supported from v3.2 onwards. You can only delete empty databases.

Note

You can use GRANT and REVOKE to grant and revoke permissions for users and roles.

The delete database operation does not delete the corresponding file path on HDFS or Object Storage Service (OSS).

Switch to the target Hive catalog and run the following statement to delete a Hive database.

DROP DATABASE <database_name>;

Create a Hive table

Similar to StarRocks internal databases, if you have the CREATE TABLE permission for a Hive database, you can create a Managed Table in that database using CREATE TABLE, CREATE TABLE AS SELECT (CTAS), or CREATE TABLE LIKE.

This feature is supported from v3.2 onwards, which initially only supported creating Hive tables in Parquet format. From v3.3, this feature also supports creating Hive tables in ORC and TextFile formats.

Switch to the target Hive catalog and database, and then run the following statement to create a Hive Managed Table.

Syntax

CREATE TABLE [IF NOT EXISTS] [database.]table_name
(column_definition1[, column_definition2, ...
partition_column_definition1,partition_column_definition2...])
[partition_desc]
[PROPERTIES ("key" = "value", ...)]
[AS SELECT query]
[LIKE [database.]<source_table_name>]

Parameters

column_definition

The syntax for column_definition is as follows.

col_name col_type [COMMENT 'comment']

The parameters are described in the following table.

Parameter

Description

col_name

Column name.

col_type

Column data type.

Currently, the following data types are supported: TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, DECIMAL, DATE, DATETIME, CHAR, VARCHAR[(length)], ARRAY, MAP, and STRUCT.
LARGEINT, HLL, and BITMAP types are not supported.

Note

All non-partition key columns default to NULL. This is equivalent to specifying DEFAULT "NULL" in the CREATE TABLE statement. In contrast, partition key columns must be declared last and cannot be NULL.

partition_desc
The syntax for partition_desc is as follows.
```
PARTITION BY (par_col1[, par_col2...])
```
Currently, StarRocks only supports identity transforms. This means a partition is created for each unique partition value.
Note
Partition key columns must be declared last. They support data types other than FLOAT, DOUBLE, DECIMAL, or DATETIME, and do not support NULL values. Additionally, the order of partition key columns declared in partition_desc must be consistent with the order of columns defined in column_definition.

PROPERTIES

You can declare properties for the Hive table in PROPERTIES using the "key" = "value" format. The following table lists some common properties.

Property	Description
location	The file path for the Managed Table. When using HMS as the metadata service, you do not need to specify the `location` parameter.
file_format	The file format of the Managed Table. Currently, Parquet, ORC, and TextFile formats are supported. ORC and TextFile formats are supported from v3.3. Valid values: `parquet`, `orc`, `textfile`. Default: `parquet`.
compression_codec	The compression format of the Managed Table. This property is deprecated from v3.2.3. Since then, the compression algorithm for writing to Hive tables is controlled by the session variable connector_sink_compression_codec.

Examples

The following CREATE TABLE statements use the default Parquet format as an example.

Create a non-partitioned table unpartition_tbl with two columns: id and score.
```
CREATE TABLE unpartition_tbl
(
    id int,
    score double
);
```

Create a partitioned table partition_tbl_1 with three columns: action, id, and dt. Define id and dt as partition key columns.

CREATE TABLE partition_tbl_1
(
    action varchar(20),
    id int NOT NULL,
    dt date NOT NULL
)
PARTITION BY (id,dt);

Query the source table partition_tbl_1 and create a partitioned table partition_tbl_2 based on the query result. Define id and dt as the partition key columns for partition_tbl_2:
```
CREATE TABLE partition_tbl_2
PARTITION BY (id, dt)
AS SELECT * from partition_tbl_1;
```

View Hive table schema

You can view the schema of a Hive table in the following ways:

View the table schema

DESC[RIBE] <catalog_name>.<database_name>.<table_name>;

View the table schema and file location from the CREATE command
```
SHOW CREATE TABLE <catalog_name>.<database_name>.<table_name>;
```

View Hive table data

Method 1: Directly query table data

SELECT * FROM <catalog_name>.<database_name>.<table_name>;

Method 2: View and query step by step
1. View the databases in the Hive cluster that belong to the specified catalog.
```
SHOW DATABASES FROM <catalog_name>;
```
2. Switch between Hive catalogs and databases.
3. Query the data of the target table.
```
SELECT count(*) FROM <table_name> LIMIT 10;
```

Import Hive data

Assume you have an OLAP table named olap_tbl. You can transform the data in this table and import it into StarRocks as follows.

INSERT INTO default_catalog.olap_db.olap_tbl SELECT * FROM hive_table;

Grant permissions on Hive tables and views

You can use GRANT to grant a role the query permission on all tables and views within a Hive catalog. The command syntax is as follows.

GRANT SELECT ON ALL TABLES IN ALL DATABASES TO ROLE <role_name>

For example, run the following commands to create a role hive_role_table, switch to the Hive catalog hive_catalog, and then grant the query permission on all tables and views in hive_catalog to hive_role_table.

-- Create the role hive_role_table.
CREATE ROLE hive_role_table;

-- Switch to the data catalog hive_catalog.
SET CATALOG hive_catalog;

-- Grant the query permission on all tables and views in hive_catalog to hive_role_table.
GRANT SELECT ON ALL TABLES IN ALL DATABASES TO ROLE hive_role_table;

Insert data into a Hive table

Similar to StarRocks internal tables, if you have the INSERT permission for a Hive table (Managed Table or External Table), you can use INSERT to write data from a StarRocks table into that Hive table.

This feature is supported from v3.2 onwards, which initially only supported writing to Hive tables in Parquet format. From v3.3, this feature also supports writing to Hive tables in ORC and TextFile formats.

Note that the feature to write data to External Tables is disabled by default. You must enable it using the ENABLE_WRITE_HIVE_EXTERNAL_TABLE system variable.

Note

You can use GRANT and REVOKE to grant and revoke permissions for users and roles.
You can use the connector_sink_compression_codec session variable to specify the compression algorithm for writing to Hive tables.

Switch to the target Hive catalog and database, and then use the following syntax to write data from a StarRocks table into a Hive table.

Syntax

INSERT {INTO | OVERWRITE} <table_name>
[ (column_name [, ...]) ]
{ VALUES ( { expression | DEFAULT } [, ...] ) [, ...] | query }

-- Write data to a specific partition.
INSERT {INTO | OVERWRITE} <table_name>
PARTITION (par_col1=<value> [, par_col2=<value>...])
{ VALUES ( { expression | DEFAULT } [, ...] ) [, ...] | query }

Note

Partition key columns cannot be NULL. Therefore, you must ensure that partition key columns have values during data import.

Parameters

Parameter	Description
INTO	Appends data to the target table.
OVERWRITE	Overwrites data in the target table.
column_name	The target columns for the import. You can specify one or more columns. When specifying multiple columns, separate them with commas (`,`). The specified columns must exist in the target table and must include the partition key columns. The names of these columns can be different from the source table's column names, but their order must correspond one-to-one. If you do not specify this parameter, data is imported into all columns of the target table by default. If a non-partition key column from the source table does not exist in the target columns, the default value `NULL` is written. If the data type of a result column from the query statement does not match the type of the target column, an implicit conversion is attempted. If conversion is not possible, the INSERT INTO statement reports a parsing error.
expression	An expression used to assign a value to the corresponding column.
DEFAULT	Assigns the default value to the corresponding column.
query	A query statement. The result of the query is imported into the target table. The query statement supports any SQL query syntax supported by StarRocks.
PARTITION	The target partition for the import. You must specify all partition key columns of the target table. The order of the specified partition key columns can be different from the order defined when the table was created. When specifying a partition, you cannot specify the target columns for import by column name (`column_name`).

Examples

The following INSERT statements use the default Parquet format as an example.

Insert the following three rows of data into the table partition_tbl_1.

INSERT INTO partition_tbl_1 VALUES("buy", 1, "2023-09-01"),("sell", 2, "2023-09-02"),("buy", 3, "2023-09-03");

Insert the result of a SELECT query that includes a simple calculation into the table partition_tbl_1, specifying the column order.
```
INSERT INTO partition_tbl_1 (id, action, dt) SELECT 1+1, 'buy', '2023-09-03';
```

Insert the result of a SELECT query that reads data from the table partition_tbl_1 itself.

INSERT INTO partition_tbl_1 SELECT 'buy', 1, date_add(dt, INTERVAL 2 DAY)FROM partition_tbl_1WHERE id=1;

Insert the result of a SELECT query into the partition where dt='2023-09-01' and id=1 in the table partition_tbl_2.

Method 1

INSERT INTO partition_tbl_2 SELECT 'order', 1, '2023-09-01';

Method 2

INSERT INTO partition_tbl_2 partition(dt='2023-09-01',id=1) SELECT 'order';

Overwrite all values in the action column with close for the partition where dt='2023-09-01' and id=1 in the table partition_tbl_1:

Method 1

INSERT OVERWRITE partition_tbl_1 SELECT 'close', 1, '2023-09-01';

Method 2

INSERT OVERWRITE partition_tbl_1 partition(dt='2023-09-01',id=1) SELECT 'close';

Delete a Hive table

Similar to StarRocks internal tables, if you have the DROP permission for a Hive table, you can delete it using DROP TABLE. This feature is supported from v3.2 onwards. Note that currently, you can only delete Hive Managed Tables.

Note

You can use GRANT and REVOKE to grant and revoke permissions for users and roles.

To delete a table, you must specify the FORCE keyword in the DROP TABLE statement. This operation deletes the table data on HDFS or Object Storage Service but does not delete the corresponding file path. Run this operation with caution.

Switch to the target Hive catalog and database, and then run the following statement to delete a Hive table.

DROP TABLE <table_name> FORCE

Manually or automatically update metadata cache

By default, StarRocks caches Hive metadata and automatically updates the cache in asynchronous mode to improve query performance. If you make schema changes or other updates to a Hive table, you can manually update its metadata using the following command to ensure that StarRocks can promptly generate efficient query plans.

REFRESH EXTERNAL TABLE <table_name> [PARTITION ('partition_name', ...)]

Manual metadata updates are suitable for the following situations:

Data files within an existing partition have changed, for example, after you run an INSERT OVERWRITE ... PARTITION ... command.
The schema of the Hive table has changed.
A Hive table was dropped and then a new Hive table with the same name was created.
When creating a Hive catalog, you specified "enable_cache_list_names" = "true" in PROPERTIES. In this case, you must query new partitions after they are added on the Hive side.

Note

Since v2.5.5, StarRocks has supported periodic refreshing of the Hive metadata cache. When this feature is enabled, the Hive metadata cache is automatically refreshed every 10 minutes by default. In most cases, you do not need to perform manual updates. Manual updates are only necessary when you add new partitions and need to query their data immediately.
REFRESH EXTERNAL TABLE only updates the tables and partitions that are already cached in the FE.

References

For more information about Hive, see Overview.

Background

Limitations

Create a Hive catalog

Syntax​

Parameters

Use HMS

Use DLF 1.0 (Legacy)

Example

Use HMS

Use DLF 1.0 (Legacy)

View Hive catalogs

Switch between Hive catalogs and databases​

Delete a Hive catalog​

Create a Hive database

Syntax

Parameters

Delete a Hive database

Create a Hive table

Syntax​

Parameters

Examples

View Hive table schema

View Hive table data

Import Hive data

Grant permissions on Hive tables and views​

Insert data into a Hive table

Syntax

Parameters

Examples

Delete a Hive table

Manually or automatically update metadata cache

References

Syntax

Switch between Hive catalogs and databases

Delete a Hive catalog

Syntax

Grant permissions on Hive tables and views