Manage the entire lifecycle of Hive catalogs - Realtime Compute for Apache Flink

A Hive catalog connects Realtime Compute for Apache Flink to your Hive metastore or Alibaba Cloud Data Lake Formation (DLF), so you can query and write Hive tables without declaring DDL statements for each table. Tables registered in a Hive catalog are available as source tables, result tables, or dimension tables in both streaming and batch deployments.

This topic covers:

Prerequisites
Limitations
Configure Hive metadata
Create a Hive catalog
Use a Hive catalog
View a Hive catalog
Drop a Hive catalog

Prerequisites

Before you begin, make sure that you have:

If you use a Hive metastore:

The Hive metastore service is running. To start it, run hive --service metastore. To verify it is listening on the default port (9083), run netstat -ln | grep 9083. If you configured a different port in hive-site.xml, replace 9083 with that port.
A whitelist configured for the Hive metastore service, with the CIDR blocks of Realtime Compute for Apache Flink added. For steps, see Configure a whitelist and Add a security group rule.

If you use Alibaba Cloud DLF:

DLF activated in your Alibaba Cloud account.

Limitations

Limitation	Details
Supported metastore types	Self-managed Hive metastores
Supported Hive versions	2.0.0–2.3.9 and 3.1.0–3.1.3
Legacy Hive versions (1.X, 2.1.X, 2.2.X)	Not supported in Apache Flink 1.16 or later; requires Ververica Runtime (VVR) 6.X
Non-Hive tables in DLF-backed catalogs	Requires VVR 8.0.6 or later
Writing to OSS-HDFS via Hive catalog	Requires VVR 8.0.6 or later

Configure Hive metadata

Before creating a Hive catalog, connect your Hadoop cluster to the virtual private cloud (VPC) where Realtime Compute for Apache Flink runs, then upload your configuration files to Object Storage Service (OSS).

Step 1: Connect your Hadoop cluster to the VPC

Use Alibaba Cloud DNS PrivateZone to resolve Hadoop hostnames within the VPC. For setup instructions, see Resolver.

Step 2: Configure the metastore in hive-site.xml

Choose the metastore type that matches your setup.

Hive metastore

Verify that hive.metastore.uris in hive-site.xml points to the internal or public IP address of your Hive host:

<property>
    <name>hive.metastore.uris</name>
    <value>thrift://xx.yy.zz.mm:9083</value>
    <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
</property>

If you set hive.metastore.uris to a hostname instead of an IP address, configure Alibaba Cloud DNS PrivateZone to resolve it. Without DNS resolution, Ververica Platform (VVP) returns an UnknownHostException when connecting to the metastore. See Add a DNS record to a private zone.

Alibaba Cloud DLF

Important

If hive-site.xml contains a dlf.catalog.akMode entry, delete it before proceeding. Leaving it in place prevents the Hive catalog from accessing DLF.

Add the following properties to hive-site.xml:

<property>
  <name>hive.imetastoreclient.factory.class</name>
  <value>com.aliyun.datalake.metastore.hive2.DlfMetaStoreClientFactory</value>
</property>
<property>
  <name>dlf.catalog.uid</name>
  <value>${YOUR_DLF_CATALOG_UID}</value>
</property>
<property>
  <name>dlf.catalog.endpoint</name>
  <value>${YOUR_DLF_ENDPOINT}</value>
</property>
<property>
  <name>dlf.catalog.region</name>
  <value>${YOUR_DLF_CATALOG_REGION}</value>
</property>
<property>
  <name>dlf.catalog.accessKeyId</name>
  <value>${YOUR_ACCESS_KEY_ID}</value>
</property>
<property>
  <name>dlf.catalog.accessKeySecret</name>
  <value>${YOUR_ACCESS_KEY_SECRET}</value>
</property>

Parameter	Required	Description
`dlf.catalog.uid`	Yes	Your Alibaba Cloud account ID. Find it on the Security Settings page.
`dlf.catalog.endpoint`	Yes	DLF service endpoint. Use the VPC endpoint for your region, for example `dlf-vpc.cn-hangzhou.aliyuncs.com`. For all endpoints, see Supported regions and endpoints. For cross-VPC access, see How does Realtime Compute for Apache Flink access a service across VPCs?
`dlf.catalog.region`	Yes	Region ID where DLF is activated. Must match the region of `dlf.catalog.endpoint`. See Supported regions and endpoints.
`dlf.catalog.accessKeyId`	Yes	AccessKey ID of your Alibaba Cloud account. See Obtain an AccessKey pair.
`dlf.catalog.accessKeySecret`	Yes	AccessKey secret of your Alibaba Cloud account. See Obtain an AccessKey pair.

Step 3: Configure table storage in hive-site.xml

Hive catalogs support two storage backends for table data: OSS and OSS-HDFS.

OSS

Add the following properties to hive-site.xml:

<property>
  <name>fs.oss.impl.disable.cache</name>
  <value>true</value>
</property>
<property>
  <name>fs.oss.impl</name>
  <value>org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem</value>
</property>
<property>
  <name>hive.metastore.warehouse.dir</name>
  <value>${YOUR_OSS_WAREHOUSE_DIR}</value>
</property>
<property>
  <name>fs.oss.endpoint</name>
  <value>${YOUR_OSS_ENDPOINT}</value>
</property>
<property>
  <name>fs.oss.accessKeyId</name>
  <value>${YOUR_ACCESS_KEY_ID}</value>
</property>
<property>
  <name>fs.oss.accessKeySecret</name>
  <value>${YOUR_ACCESS_KEY_SECRET}</value>
</property>
<property>
  <name>fs.defaultFS</name>
  <value>oss://${YOUR_OSS_BUCKET_DOMIN}</value>
</property>

Parameter	Required	Description
`hive.metastore.warehouse.dir`	Yes	Directory where table data is stored.
`fs.oss.endpoint`	Yes	OSS endpoint. See Regions and endpoints.
`fs.oss.accessKeyId`	Yes	AccessKey ID. See Obtain an AccessKey pair.
`fs.oss.accessKeySecret`	Yes	AccessKey secret. See Obtain an AccessKey pair.
`fs.defaultFS`	Yes	Default file system for table data. Set to the HDFS endpoint of the target bucket, for example `oss://oss-hdfs-bucket.cn-hangzhou.oss-dls.aliyuncs.com/`.

OSS-HDFS

Add the following properties to hive-site.xml:

Parameter	Required	Description
`hive.metastore.warehouse.dir`	Yes	Directory where table data is stored.
`fs.oss.endpoint`	Yes	OSS endpoint. See Regions and endpoints.
`fs.oss.accessKeyId`	Yes	AccessKey ID. See Obtain an AccessKey pair.
`fs.oss.accessKeySecret`	Yes	AccessKey secret. See Obtain an AccessKey pair.
`fs.defaultFS`	Yes	Default file system for table data. Set to the HDFS endpoint of the target bucket, for example `oss://oss-hdfs-bucket.cn-hangzhou.oss-dls.aliyuncs.com/`.

<property>
  <name>fs.jindo.impl</name>
  <value>com.aliyun.jindodata.jindo.JindoFileSystem</value>
</property>
<property>
  <name>hive.metastore.warehouse.dir</name>
  <value>${YOUR_OSS_WAREHOUSE_DIR}</value>
</property>
<property>
  <name>fs.oss.endpoint</name>
  <value>${YOUR_OSS_ENDPOINT}</value>
</property>
<property>
  <name>fs.oss.accessKeyId</name>
  <value>${YOUR_ACCESS_KEY_ID}</value>
</property>
<property>
  <name>fs.oss.accessKeySecret</name>
  <value>${YOUR_ACCESS_KEY_SECRET}</value>
</property>
<property>
  <name>fs.defaultFS</name>
  <value>oss://${YOUR_OSS_HDFS_BUCKET_DOMIN}</value>
</property>

(Optional) If you plan to read Parquet-format Hive tables stored in OSS-HDFS, add the following configuration to Realtime Compute for Apache Flink:
```
fs.oss.jindo.accessKeyId: ${YOUR_ACCESS_KEY_ID}
fs.oss.jindo.accessKeySecret: ${YOUR_ACCESS_KEY_SECRET}
fs.oss.jindo.endpoint: ${YOUR_JINODO_ENDPOINT}
fs.oss.jindo.buckets: ${YOUR_JINDO_BUCKETS}
```
For parameter details, see Write data to OSS-HDFS.

If your data is already stored in Realtime Compute for Apache Flink, skip Steps 2–3 and go directly to Create a Hive catalog.

Step 4: Upload configuration files to OSS

Log on to the OSS console and click Buckets in the left-side navigation pane.
Click the name of the bucket used by your Realtime Compute for Apache Flink workspace.

Create the folder ${hms} at the path oss://${bucket}/artifacts/namespaces/${ns}/. For instructions, see Create directories.

Realtime Compute for Apache Flink automatically creates the /artifacts/namespaces/${ns}/ directory when a workspace is provisioned. If the directory does not appear in the OSS console, upload any file on the Artifacts page in the development console to trigger its creation.

Variable	Description
`${bucket}`	The bucket used by your Realtime Compute for Apache Flink workspace
`${ns}`	The name of the workspace for which you are creating the Hive catalog
`${hms}`	The folder name. Use the same name as the Hive catalog you will create

Inside oss://${bucket}/artifacts/namespaces/${ns}/${hms}/, create two subdirectories: After creating both directories, go to Files > Projects in the OSS console to verify the structure and copy the OSS URLs.
- hive-conf-dir/ — stores hive-site.xml
- hadoop-conf-dir/ — stores Hadoop configuration files
Upload hive-site.xml to hive-conf-dir/. See Upload objects.
Upload the following files to hadoop-conf-dir/. See Upload objects.
- hive-site.xml
- core-site.xml
- hdfs-site.xml
- mapred-site.xml
- Any other required files, such as compressed packages used by Hive deployments

Create a Hive catalog

After configuring Hive metadata, create a Hive catalog from the console or by running an SQL statement. The console method is recommended.

Important

After creating a Hive catalog, you cannot modify its configuration. To change any parameter, drop the catalog and create it again.

Create a Hive catalog from the console

Log on to the Realtime Compute for Apache Flink console. Find the workspace and click Console in the Actions column.
Click Catalogs.
On the Catalog List page, click Create Catalog. In the dialog box, go to the Built-in Catalog tab, select Hive in the Choose Catalog Type step, and click Next.

In the Configure Catalog step, set the following parameters:

Parameter	Required	Description
`catalog name`	Yes	Name of the Hive catalog
`hive-version`	Yes	Version of the Hive metastore. Supported: 2.0.0–2.3.9, 3.1.0–3.1.3. Map your installed version: 2.0.X or 2.1.X → `2.2.0`; 2.2.X → `2.2.0`; 2.3.X → `2.3.6`; 3.1.X → `3.1.2`
`default-database`	Yes	Name of the default database
`hive-conf-dir`	Yes	OSS path to the directory containing `hive-site.xml`, created in Step 4. For fully managed storage, upload the file as prompted
`hadoop-conf-dir`	Yes	OSS path to the directory containing Hadoop configuration files, created in Step 4. For fully managed storage, upload the files as prompted
`hive-kerberos`	No	Enable Kerberos authentication and associate a registered Kerberized Hive cluster and a Kerberos principal. See Register a Kerberized Hive cluster

Click Confirm.
In the Catalogs pane on the left side of the Catalog List page, verify the new catalog appears.

Create a Hive catalog by running an SQL statement

On the Scripts page, run the following statement:

Parameter	Required	Description
`${HMS Name}`	Yes	Name of the Hive catalog
`type`	Yes	Connector type. Set to `hive`
`default-database`	Yes	Name of the default database
`hive-version`	Yes	Hive metastore version. Supported: 2.0.0–2.3.9, 3.1.0–3.1.3. Map your installed version: 2.0.X or 2.1.X → `2.2.0`; 2.2.X → `2.2.0`; 2.3.X → `2.3.6`; 3.1.X → `3.1.2`
`hive-conf-dir`	Yes	OSS path to the directory containing `hive-site.xml`. See Configure Hive metadata
`hadoop-conf-dir`	Yes	OSS path to the directory containing Hadoop configuration files. See Configure Hive metadata

CREATE CATALOG ${HMS Name} WITH (
    'type' = 'hive',
    'default-database' = 'default',
    'hive-version' = '<hive-version>',
    'hive-conf-dir' = '<hive-conf-dir>',
    'hadoop-conf-dir' = '<hadoop-conf-dir>'
);

Select the CREATE CATALOG statement and click Run.

After the catalog is created, tables in it are available in drafts as result tables or dimension tables without DDL declarations. Table names use the format ${hive-catalog-name}.${hive-db-name}.${hive-table-name}.

Use a Hive catalog

All examples below use a catalog named flinkexporthive and a database named flinkhive.

Create a Hive table

UI method

From the console:

Log on to the Realtime Compute for Apache Flink console, find the workspace, and click Console in the Actions column.
Click Catalogs.
On the Catalog List page, find the catalog and click View in the Actions column.
Find the target database and click View in the Actions column.
Click Create Table.
On the Built-in tab of the Create Table dialog box, select a table type from the Connection Type drop-down list, select a connector type, and click Next.

Enter the table creation statement and configure the parameters:

CREATE TABLE `${catalog_name}`.`${db_name}`.`${table_name}` (
  id INT,
  name STRING
) WITH (
  'connector' = 'hive'
);

Click OK.

SQL command method

By running an SQL statement:

On the Scripts page, run the CREATE TABLE statement, then select it and click Run.

Example — create flink_hive_test in the flinkhive database under flinkexporthive:

-- Create a table named flink_hive_test in the flinkhive database under the flinkexporthive catalog.
CREATE TABLE `flinkexporthive`.`flinkhive`.`flink_hive_test` (
  id INT,
  name STRING
) WITH (
  'connector' = 'hive'
);

Modify a Hive table

On the Scripts page, run the ALTER TABLE statement:

-- Add a column to the Hive table.
ALTER TABLE `${catalog_name}`.`${db_name}`.`${table_name}`
ADD column type-column;

-- Drop a column from the Hive table.
ALTER TABLE `${catalog_name}`.`${db_name}`.`${table_name}`
DROP column;

Example — add and then remove the color column from flink_hive_test:

-- Add the color field to the Hive table.
ALTER TABLE `flinkexporthive`.`flinkhive`.`flink_hive_test`
ADD color STRING;

-- Drop the color field from the Hive table.
ALTER TABLE `flinkexporthive`.`flinkhive`.`flink_hive_test`
DROP color;

Read data from a Hive table

INSERT INTO ${other_sink_table}
SELECT ...
FROM `${catalog_name}`.`${db_name}`.`${table_name}`;

Write data to a Hive table

INSERT INTO `${catalog_name}`.`${db_name}`.`${table_name}`
SELECT ...
FROM ${other_source_table};

Drop a Hive table

UI method

From the console:

Log on to the Realtime Compute for Apache Flink console, find the workspace, and click Console in the Actions column.
Click Catalogs.
In the Catalogs pane, navigate to the target table under the target database and catalog.
On the table details page, click Delete in the Actions column.
In the confirmation dialog, click OK.

SQL command method

By running an SQL statement:

On the Scripts page, run:

-- Drop the Hive table.
DROP TABLE `${catalog_name}`.`${db_name}`.`${table_name}`;

Example:

-- Drop the Hive table.
DROP TABLE `flinkexporthive`.`flinkhive`.`flink_hive_test`;

View a Hive catalog

Log on to the Realtime Compute for Apache Flink console, find the workspace, and click Console in the Actions column.
Click Catalogs.
On the Catalog List page, check the Name and Type columns for the catalog.

To view the databases and tables in the catalog, click View in the Actions column.

Drop a Hive catalog

Warning

Dropping a Hive catalog does not affect running deployments. However, it affects unpublished drafts and deployments that need to be suspended and resumed. Drop a catalog only when necessary.

Drop a Hive catalog from the console

Log on to the Realtime Compute for Apache Flink console, find the workspace, and click Console in the Actions column.
Click Catalogs.
On the Catalog List page, find the catalog and click Delete in the Actions column.
In the confirmation dialog, click Delete.
Verify the catalog no longer appears in the Catalogs pane.

Drop a Hive catalog by running an SQL statement

On the Scripts page, run:
```
DROP CATALOG ${HMS Name};
```
${HMS Name} is the catalog name shown in the development console of Realtime Compute for Apache Flink.
Right-click the statement and select Run from the shortcut menu.
Verify the catalog no longer appears in the Catalogs pane.