This topic describes how to configure a Hive metastore in fully managed Flink.

Background information

You can store the Hive metastore configuration file and Hadoop dependencies in a directory that you specify in the Object Storage Service (OSS) console. Then, you can configure a Hive metastore in the console of fully managed Flink. After you configure the Hive metastore, you can execute DML statements to create business logic and obtain the metadata of Hive tables. You do not need to execute DDL statements to declare related table information. A Hive metastore can be used to store source tables or result tables for streaming jobs and batch jobs.

Precautions

Before you configure a Hive metastore, take note of the following points:
  • Self-managed Hive metastores are supported.
  • Hive metastores are supported in Ververica Platform (VVP) 2.3.0 or later.
    • VVP 2.3.0 supports only Hive metastores of 2.3.6.
    • VVP versions later than 2.3.0 support Hive metastores of 2.2.0 to 2.3.7.
  • Hive metastores do not support Kerberos authentication.
    Note If Kerberos authentication is not enabled, the default username that is used to access Hive on VVP is vvp and the default username that is used to access Hive in Flink clusters is flink. Therefore, you must ensure that users vvp and flink have the permissions to access Hive metadata and Hive table data on file systems, such as Hadoop Distributed File System (HDFS).
  • A fully managed Flink instance supports only one Hive metastore. You cannot configure multiple Hive metastores for multiple projects.
  • Hive metastores are read-only. Therefore, you cannot create physical tables in a Hive metastore in the console of fully managed Flink.

Configure a Hive metastore

  1. Establish a connection between a Hadoop cluster and a fully managed Flink cluster in a virtual private cloud (VPC).
    You can use Alibaba Cloud DNS PrivateZone to connect a Hadoop cluster and a fully managed Flink cluster in a VPC. For more information, see Resolver. After the connection is established, the fully managed Flink cluster can access the Hadoop cluster by using the configuration file of the Hadoop cluster.
  2. In the OSS console, create two folders in an OSS bucket to store the Hive configuration file and Hadoop dependencies.
    1. Log on to the OSS console.
    2. In the left-side navigation pane, click Buckets.
    3. Click the name of the bucket in which you want to create folders.
    4. In the left-side navigation pane, click Files.
    5. Create folders to store the Hive configuration file and Hadoop dependencies.
      • Path used to store the Hive configuration file: oss://${bucket}/artifacts/namespaces/${ns}/${hms}/hive-conf-dir/
      • Path used to store Hadoop dependencies: oss://${bucket}/artifacts/namespaces/${ns}/${hms}/hadoop-conf-dir/
      Parameters:
      • ${bucket}: the name of the bucket that is used by the fully managed Flink cluster.
      • ${ns}: the name of the fully managed Flink project for which you want to configure a Hive metastore.
      • ${hms}: the Hive metastore name that you want to display in the console of fully managed Flink.
    6. Store the Hive configuration file hive-site.xml in the hive-conf-dir folder.
      Before you upload the hive-site.xml file, check whether the hive.metastore.uris parameter in the hive_site.xml file meets the following requirements:
      <property>
          <name>hive.metastore.uris</name>
          <value>thrift://xx.yy.zz.mm:9083</value>
          <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
       </property>
      xx.yy.zz.mm indicates the internal or public IP address of Hive. If you set hive.metastore.uris to a hostname, the value of the hive.metastore.uris parameter fails to be parsed and the error message UnknownHostException is returned when VVP remotely accesses Hive.
    7. Store the following configuration files in the hadoop-conf-dir folder:
      • hive-site.xml
      • core-site.xml
      • hdfs-site.xml
      • mapred-site.xml
      • Other required files, such as the compressed packages used by Hive jobs
  3. Configure a Hive metastore in the console of fully managed Flink.
    1. Log on to the Realtime Compute for Apache Flink console.
    2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.
    3. In the left-side navigation pane, click Draft Editor.
    4. In the upper-left corner of the Draft Editor page, click New. In the New Draft dialog box, select STREAM/SQL from the Type drop-down list.
    5. In the script editor, enter a statement to create a Hive metastore.
      CREATE CATALOG ${HMS Name} WITH (
          'type' = 'hive',
          'default-database' = 'default',
          'hive-version' = '<hive-version>',
          'hive-conf-dir' = '<hive-conf-dir>',
          'hadoop-conf-dir' = '<hadoop-conf-dir>'
      );
      Parameters:
      • ${HMS Name}: the name of the Hive metastore.
      • type: the type of the connector. Set the value to hive.
      • default-database: the name of the default database.
      • hive-version: the version of the Hive metastore.
        Note Fully managed Flink is compatible with Hive metastores of 2.2.0 to 2.3.7. Configure the hive-version parameter based on the following requirements:
        • If the version of the Hive metastore ranges from 2.0.0 to 2.2.0, set hive-version to 2.2.0.
        • If the version of the Hive metastore ranges from 2.3.0 to 2.3.7, set hive-version to 2.3.6.
      • hive-conf-dir: the directory in which the Hive configuration file is stored.
      • hadoop-conf-dir: the directory in which the Hadoop dependencies are stored.
    6. Click Execute.
      After the Hive metastore is configured, you can reference tables from the Hive metastore as result tables and dimension tables in jobs. You do not need to declare DDL statements for these tables. Table names in the Hive metastore are in the format of ${hive-catlog-name}.${hive-db-name}.${hive-table-name}.

      If you want to delete the Hive metastore, follow the instructions provided in Delete a Hive metastore.

View Hive metadata

  1. Log on to the Realtime Compute for Apache Flink console.
  2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.
  3. In the left-side navigation pane, click Draft Editor.
  4. Click the Schemas tab.
  5. In the top navigation bar, select the Hive metastore that you want to manage from the vvp/default drop-down list.
    Hive
  6. Click Tables to view the tables and fields in different databases.

Delete a Hive metastore

  1. Log on to the Realtime Compute for Apache Flink console.
  2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.
  3. In the left-side navigation pane, click Draft Editor.
  4. In the upper-left corner of the Draft Editor page, click New. In the New Draft dialog box, select STREAM/SQL from the Type drop-down list.
  5. In the script editor, enter the following command:
    Drop CATALOG ${HMS Name}
    In the preceding command, HMS Name indicates the name of the Hive metastore that you want to delete. The name is displayed in the console of fully managed Flink.
    Notice The delete operation does not affect the jobs that are running. However, the jobs that are not published or the jobs that must be suspended and then resumed are affected. Proceed with caution.
  6. Click Execute.