A Hive catalog is an external catalog that StarRocks has supported since v2.3.
Background
A Hive catalog lets you:
Directly query Hive data without manually creating tables.
Process and model Hive data and import it into StarRocks using INSERT INTO, or using asynchronous materialized views in v3.1 and later.
Create or delete Hive databases and tables in StarRocks. You can also write data from StarRocks tables to Hive tables in Parquet (from v3.2), ORC, or TextFile (from v3.3) format using INSERT INTO.
To access data in Hive, the StarRocks cluster must be able to access the storage system and metadata service of the Hive cluster. StarRocks currently supports the following storage systems and metadata services:
Hadoop Distributed File System (HDFS) or Alibaba Cloud Object Storage Service (OSS).
Metadata services. Currently, supported metadata services include Data Lake Formation DLF 1.0 (Legacy) and Hive Metastore (HMS).
Limitations
StarRocks supports querying Hive data in Parquet, ORC, TextFile, Avro, RCFile, and SequenceFile file formats:
Parquet files support the SNAPPY, LZ4, ZSTD, GZIP, and NO_COMPRESSION compression formats. Support for the LZO compression format was added in v3.1.5.
ORC files support ZLIB, SNAPPY, LZO, LZ4, ZSTD, and NO_COMPRESSION compression formats.
TextFile files support the LZO compression format from v3.1.5.
StarRocks does not support querying Hive data of the INTERVAL, BINARY, or UNION data types. Additionally, for Hive tables in TextFile format, StarRocks does not support the MAP and STRUCT data types.
StarRocks supports writing data to Hive in Parquet (v3.2 and later), ORC, and TextFile (v3.3 and later) file formats:
Parquet and ORC files support NO_COMPRESSION, SNAPPY, LZ4, ZSTD, and GZIP compression formats.
TextFile supports the NO_COMPRESSION format.
You can set the compression algorithm for writing to Hive tables using the
connector_sink_compression_codecsystem variable.
Create a Hive catalog
Syntax
CREATE EXTERNAL CATALOG <catalog_name>
[COMMENT <comment>]
PROPERTIES
(
"type" = "hive",
GeneralParams,
MetastoreParams
)Parameters
The parameter configuration varies depending on the metadata service used by Hive.
Use HMS
catalog_name: The name of the Hive catalog. This parameter is required. The name must meet the following requirements:It must start with a letter and can contain only letters (a-z or A-Z), numbers (0-9), and underscores (_).
The total length cannot exceed 64 characters.
comment: The description of the Hive catalog. This parameter is optional.type: The type of the data source. Set this tohive.GeneralParams: A set of parameters for general settings.GeneralParamsincludes the following parameter.Parameter
Required
Description
enable_recursive_listing
No
Specifies whether StarRocks recursively reads data from files in a table or partition directory, including its subdirectories. Valid values:
true (default): Recursively traverses the directory.
false: Reads data only from files at the current level of the table or partition directory.
MetastoreParams: Parameters related to how StarRocks accesses the metadata of the Hive cluster.Property
Description
hive.metastore.type
The type of the metastore. The default value is empty, which means using the Hive Metastore.
hive.metastore.uris
The URI of the Hive Metastore. The format is
thrift://<IP_address_of_Hive_Metastore>:<port>. The default port is 9083.
Use DLF 1.0 (Legacy)
catalog_name: The name of the Hive catalog. This parameter is required. The name must meet the following requirements:It must start with a letter and can contain only letters (a-z or A-Z), numbers (0-9), and underscores (_).
The total length cannot exceed 64 characters.
comment: The description of the Hive catalog. This parameter is optional.type: The type of the data source. Set this tohive.GeneralParams: A set of parameters for general settings.GeneralParamsincludes the following parameter.Parameter
Required
Description
enable_recursive_listing
No
Specifies whether StarRocks recursively reads data from files in a table or partition directory, including its subdirectories. Valid values:
true (default): Recursively traverses the directory.
false: Reads data only from files at the current level of the table or partition directory.
MetastoreParams: Parameters related to how StarRocks accesses the metadata of the Hive cluster.Property
Description
hive.metastore.type
The type of metadata service used by Hive. Set this to
dlf.dlf.catalog.id
The ID of an existing data catalog in DLF 1.0. This parameter is required only when
hive.metastore.typeis set todlf. If thedlf.catalog.idparameter is not specified, the system uses the default DLF Catalog.
Example
The following examples use HDFS for storage and create a Hive catalog named hive_catalog.
Use HMS
CREATE EXTERNAL CATALOG hive_catalog
PROPERTIES
(
"type" = "hive",
"hive.metastore.uris" = "thrift://xx.xx.xx.xx:9083"
);Use DLF 1.0 (Legacy)
CREATE EXTERNAL CATALOG hive_catalog
PROPERTIES
(
"type" = "hive",
"hive.metastore.type" = "dlf",
"dlf.catalog.id" = "sr_dlf"
);View Hive catalogs
You can use SHOW CATALOGS to query all catalogs in the current StarRocks cluster.
SHOW CATALOGS;You can also use SHOW CREATE CATALOG to query the creation statement of an external catalog. For example, run the following command to query the creation statement for the Hive catalog hive_catalog.
SHOW CREATE CATALOG hive_catalog;Switch between Hive catalogs and databases
You can switch to a target Hive catalog and database in the following ways:
First, specify the Hive catalog for the current session with
SET CATALOG. Then, specify the database with USE.-- Switch the catalog for the current session. SET CATALOG <catalog_name>; -- Specify the database for the current session. USE <db_name>;Use USE to directly switch the session to a specific database under the target Hive catalog.
USE <catalog_name>.<db_name>;
Delete a Hive catalog
You can use DROP CATALOG to delete an external catalog.
For example, run the following command to delete hive_catalog.
DROP Catalog hive_catalog;Create a Hive database
Similar to the StarRocks internal catalog, if you have the CREATE DATABASE permission for a Hive catalog, you can create a database in that catalog using the CREATE DATABASE statement. This feature is supported from v3.2 onwards.
You can use GRANT and REVOKE to grant and revoke permissions for users and roles.
Syntax
Switch to the target Hive catalog and run the following statement to create a Hive database.
CREATE DATABASE <database_name>
[PROPERTIES ("location" = "<prefix>://<path_to_database>/<database_name.db>")]Parameters
If you do not specify location when creating the database, the system uses the default path <warehouse_location>/<database_name.db>.
The location parameter specifies the file path for the database. It supports HDFS and Object Storage Service (OSS):
For HDFS, set
Prefixtohdfs.Alibaba Cloud OSS or OSS-HDFS: Set the value to
oss.
Delete a Hive database
Similar to StarRocks internal databases, if you have the DROP permission for a Hive database, you can delete it using DROP DATABASE. This feature is supported from v3.2 onwards. You can only delete empty databases.
You can use GRANT and REVOKE to grant and revoke permissions for users and roles.
The delete database operation does not delete the corresponding file path on HDFS or Object Storage Service (OSS).
Switch to the target Hive catalog and run the following statement to delete a Hive database.
DROP DATABASE <database_name>;Create a Hive table
Similar to StarRocks internal databases, if you have the CREATE TABLE permission for a Hive database, you can create a Managed Table in that database using CREATE TABLE, CREATE TABLE AS SELECT (CTAS), or CREATE TABLE LIKE.
This feature is supported from v3.2 onwards, which initially only supported creating Hive tables in Parquet format. From v3.3, this feature also supports creating Hive tables in ORC and TextFile formats.
Switch to the target Hive catalog and database, and then run the following statement to create a Hive Managed Table.
Syntax
CREATE TABLE [IF NOT EXISTS] [database.]table_name
(column_definition1[, column_definition2, ...
partition_column_definition1,partition_column_definition2...])
[partition_desc]
[PROPERTIES ("key" = "value", ...)]
[AS SELECT query]
[LIKE [database.]<source_table_name>]Parameters
column_definition
The syntax for
column_definitionis as follows.col_name col_type [COMMENT 'comment']The parameters are described in the following table.
Parameter
Description
col_nameColumn name.
col_typeColumn data type.
Currently, the following data types are supported: TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, DECIMAL, DATE, DATETIME, CHAR, VARCHAR[(length)], ARRAY, MAP, and STRUCT.
LARGEINT, HLL, and BITMAP types are not supported.
NoteAll non-partition key columns default to
NULL. This is equivalent to specifyingDEFAULT "NULL"in the CREATE TABLE statement. In contrast, partition key columns must be declared last and cannot beNULL.partition_desc
The syntax for
partition_descis as follows.PARTITION BY (par_col1[, par_col2...])Currently, StarRocks only supports identity transforms. This means a partition is created for each unique partition value.
NotePartition key columns must be declared last. They support data types other than FLOAT, DOUBLE, DECIMAL, or DATETIME, and do not support NULL values. Additionally, the order of partition key columns declared in partition_desc must be consistent with the order of columns defined in column_definition.
PROPERTIES
You can declare properties for the Hive table in
PROPERTIESusing the"key" = "value"format. The following table lists some common properties.Property
Description
location
The file path for the Managed Table. When using HMS as the metadata service, you do not need to specify the
locationparameter.file_format
The file format of the Managed Table. Currently, Parquet, ORC, and TextFile formats are supported. ORC and TextFile formats are supported from v3.3. Valid values:
parquet,orc,textfile. Default:parquet.compression_codec
The compression format of the Managed Table. This property is deprecated from v3.2.3. Since then, the compression algorithm for writing to Hive tables is controlled by the session variable connector_sink_compression_codec.
Examples
The following CREATE TABLE statements use the default Parquet format as an example.
Create a non-partitioned table
unpartition_tblwith two columns:idandscore.CREATE TABLE unpartition_tbl ( id int, score double );Create a partitioned table
partition_tbl_1with three columns:action,id, anddt. Defineidanddtas partition key columns.CREATE TABLE partition_tbl_1 ( action varchar(20), id int NOT NULL, dt date NOT NULL ) PARTITION BY (id,dt);Query the source table
partition_tbl_1and create a partitioned tablepartition_tbl_2based on the query result. Defineidanddtas the partition key columns forpartition_tbl_2:CREATE TABLE partition_tbl_2 PARTITION BY (id, dt) AS SELECT * from partition_tbl_1;
View Hive table schema
You can view the schema of a Hive table in the following ways:
View the table schema
DESC[RIBE] <catalog_name>.<database_name>.<table_name>;View the table schema and file location from the CREATE command
SHOW CREATE TABLE <catalog_name>.<database_name>.<table_name>;
View Hive table data
Method 1: Directly query table data
SELECT * FROM <catalog_name>.<database_name>.<table_name>;Method 2: View and query step by step
View the databases in the Hive cluster that belong to the specified catalog.
SHOW DATABASES FROM <catalog_name>;Query the data of the target table.
SELECT count(*) FROM <table_name> LIMIT 10;
Import Hive data
Assume you have an OLAP table named olap_tbl. You can transform the data in this table and import it into StarRocks as follows.
INSERT INTO default_catalog.olap_db.olap_tbl SELECT * FROM hive_table;Grant permissions on Hive tables and views
You can use GRANT to grant a role the query permission on all tables and views within a Hive catalog. The command syntax is as follows.
GRANT SELECT ON ALL TABLES IN ALL DATABASES TO ROLE <role_name>For example, run the following commands to create a role hive_role_table, switch to the Hive catalog hive_catalog, and then grant the query permission on all tables and views in hive_catalog to hive_role_table.
-- Create the role hive_role_table.
CREATE ROLE hive_role_table;
-- Switch to the data catalog hive_catalog.
SET CATALOG hive_catalog;
-- Grant the query permission on all tables and views in hive_catalog to hive_role_table.
GRANT SELECT ON ALL TABLES IN ALL DATABASES TO ROLE hive_role_table;Insert data into a Hive table
Similar to StarRocks internal tables, if you have the INSERT permission for a Hive table (Managed Table or External Table), you can use INSERT to write data from a StarRocks table into that Hive table.
This feature is supported from v3.2 onwards, which initially only supported writing to Hive tables in Parquet format. From v3.3, this feature also supports writing to Hive tables in ORC and TextFile formats.
Note that the feature to write data to External Tables is disabled by default. You must enable it using the ENABLE_WRITE_HIVE_EXTERNAL_TABLE system variable.
You can use GRANT and REVOKE to grant and revoke permissions for users and roles.
You can use the connector_sink_compression_codec session variable to specify the compression algorithm for writing to Hive tables.
Switch to the target Hive catalog and database, and then use the following syntax to write data from a StarRocks table into a Hive table.
Syntax
INSERT {INTO | OVERWRITE} <table_name>
[ (column_name [, ...]) ]
{ VALUES ( { expression | DEFAULT } [, ...] ) [, ...] | query }
-- Write data to a specific partition.
INSERT {INTO | OVERWRITE} <table_name>
PARTITION (par_col1=<value> [, par_col2=<value>...])
{ VALUES ( { expression | DEFAULT } [, ...] ) [, ...] | query }Partition key columns cannot be NULL. Therefore, you must ensure that partition key columns have values during data import.
Parameters
Parameter | Description |
INTO | Appends data to the target table. |
OVERWRITE | Overwrites data in the target table. |
column_name | The target columns for the import. You can specify one or more columns. When specifying multiple columns, separate them with commas ( |
expression | An expression used to assign a value to the corresponding column. |
DEFAULT | Assigns the default value to the corresponding column. |
query | A query statement. The result of the query is imported into the target table. The query statement supports any SQL query syntax supported by StarRocks. |
PARTITION | The target partition for the import. You must specify all partition key columns of the target table. The order of the specified partition key columns can be different from the order defined when the table was created. When specifying a partition, you cannot specify the target columns for import by column name ( |
Examples
The following INSERT statements use the default Parquet format as an example.
Insert the following three rows of data into the table
partition_tbl_1.INSERT INTO partition_tbl_1 VALUES("buy", 1, "2023-09-01"),("sell", 2, "2023-09-02"),("buy", 3, "2023-09-03");Insert the result of a SELECT query that includes a simple calculation into the table
partition_tbl_1, specifying the column order.INSERT INTO partition_tbl_1 (id, action, dt) SELECT 1+1, 'buy', '2023-09-03';Insert the result of a SELECT query that reads data from the table
partition_tbl_1itself.INSERT INTO partition_tbl_1 SELECT 'buy', 1, date_add(dt, INTERVAL 2 DAY)FROM partition_tbl_1WHERE id=1;Insert the result of a SELECT query into the partition where
dt='2023-09-01'andid=1in the tablepartition_tbl_2.Method 1
INSERT INTO partition_tbl_2 SELECT 'order', 1, '2023-09-01';Method 2
INSERT INTO partition_tbl_2 partition(dt='2023-09-01',id=1) SELECT 'order';
Overwrite all values in the
actioncolumn withclosefor the partition wheredt='2023-09-01'andid=1in the tablepartition_tbl_1:Method 1
INSERT OVERWRITE partition_tbl_1 SELECT 'close', 1, '2023-09-01';Method 2
INSERT OVERWRITE partition_tbl_1 partition(dt='2023-09-01',id=1) SELECT 'close';
Delete a Hive table
Similar to StarRocks internal tables, if you have the DROP permission for a Hive table, you can delete it using DROP TABLE. This feature is supported from v3.2 onwards. Note that currently, you can only delete Hive Managed Tables.
You can use GRANT and REVOKE to grant and revoke permissions for users and roles.
To delete a table, you must specify the FORCE keyword in the DROP TABLE statement. This operation deletes the table data on HDFS or Object Storage Service but does not delete the corresponding file path. Run this operation with caution.
Switch to the target Hive catalog and database, and then run the following statement to delete a Hive table.
DROP TABLE <table_name> FORCEManually or automatically update metadata cache
By default, StarRocks caches Hive metadata and automatically updates the cache in asynchronous mode to improve query performance. If you make schema changes or other updates to a Hive table, you can manually update its metadata using the following command to ensure that StarRocks can promptly generate efficient query plans.
REFRESH EXTERNAL TABLE <table_name> [PARTITION ('partition_name', ...)]Manual metadata updates are suitable for the following situations:
Data files within an existing partition have changed, for example, after you run an
INSERT OVERWRITE ... PARTITION ...command.The schema of the Hive table has changed.
A Hive table was dropped and then a new Hive table with the same name was created.
When creating a Hive catalog, you specified
"enable_cache_list_names" = "true"inPROPERTIES. In this case, you must query new partitions after they are added on the Hive side.
Since v2.5.5, StarRocks has supported periodic refreshing of the Hive metadata cache. When this feature is enabled, the Hive metadata cache is automatically refreshed every 10 minutes by default. In most cases, you do not need to perform manual updates. Manual updates are only necessary when you add new partitions and need to query their data immediately.
REFRESH EXTERNAL TABLEonly updates the tables and partitions that are already cached in the FE.
References
For more information about Hive, see Overview.