This topic describes how to configure, use, and delete a Hive metastore in the console of fully managed Flink. This topic also describes how to view Hive metadata in the console of fully managed Flink.

Background information

You can store the Hive metastore configuration file and Hadoop dependencies in a directory that you specify in the Object Storage Service (OSS) console. Then, you can configure a Hive metastore in the console of fully managed Flink. After you configure the Hive metastore, you can execute DML statements to create business logic and obtain the metadata of Hive tables. You do not need to execute DDL statements to declare related table information. A Hive metastore can be used to store source tables or result tables for streaming jobs and batch jobs.

Prerequisites

  • The Hive metastore service is activated.
    Commands related to the Hive metastore service:
    • hive --service metastore: activates the Hive metastore service.
    • netstat -ln | grep 9083: checks whether the Hive metastore service is activated.

      9083 is the default port number of the Hive metastore service. If you specify a different port number in the hive-site.xml file, you must replace 9083 in the preceding command with the port number that you specify in the hive-site.xml file.

  • A whitelist is configured for the Hive metastore service and fully managed Flink is added to this whitelist.

    For more information about how to obtain the CIDR blocks of fully managed Flink, see Configure an allowlist. For more information about how to configure a whitelist for the Hive metastore service, see Add a security group rule.

  • Read permissions on the directory in which the hive.metastore.warehouse.dir configuration file is stored are granted to the Ververica Platform (VVP) and Flink users. You can view the directory in which the hive.metastore.warehouse.dir configuration file is stored in the hive-site.xml file.
  • Read permissions on the data directories of each external table are granted to the VVP and Flink users if an external table exists in the Hive metastore catalog.

    You can run the show create table ${tableName} command to view the data directories of an external table. The data directory is specified in the LOCATION field.

Precautions

Before you configure a Hive metastore, take note of the following points:
  • Self-managed Hive metastores are supported.
  • Hive metastores are supported in VVP 2.3.0 and later.
    • VVP 2.3.0 supports only the Hive metastore 2.3.6 version.
    • VVP versions later than 2.3.0 support Hive metastore versions 2.2.0 to 2.3.7.
  • Hive metastores do not support Kerberos authentication.
    Note If Kerberos authentication is not enabled, the default username that is used to access Hive on VVP is vvp and the default username that is used to access Hive in Flink clusters is flink. Therefore, you must ensure that users vvp and flink have the permissions to access Hive metadata and Hive table data in file systems, such as Hadoop Distributed File System (HDFS).
  • You can configure only one Hive metastore for each fully managed Flink cluster. You cannot configure multiple Hive metastores for multiple projects in a cluster.
  • Hive metastores are read-only. Therefore, you cannot create physical tables in a Hive metastore in the console of fully managed Flink.

Configure a Hive metastore

  1. Establish a connection between a Hadoop cluster and a fully managed Flink cluster in a virtual private cloud (VPC).
    You can use Alibaba Cloud DNS PrivateZone to connect a Hadoop cluster and a fully managed Flink cluster in a VPC. For more information, see Resolver. After the connection is established, the fully managed Flink cluster can access the Hadoop cluster by using the configuration file of the Hadoop cluster.
  2. In the OSS console, create two directories in an OSS bucket and upload the Hive configuration file and Hadoop dependencies to these directories.
    1. Log on to the OSS console.
    2. In the left-side navigation pane, click Buckets.
    3. Click the name of the bucket in which you want to create a directory.
    4. In the left-side navigation pane, click Files.
    5. Create a directory named ${hms} in the oss://${bucket}/artifacts/namespaces/${ns}/ path.
      For more information about how to create a directory in the OSS console, see Create directories. The following table describes the variables in the path in which you want to create a directory.
      Folder Description
      ${bucket} The name of the bucket that is used by the fully managed Flink cluster.
      ${ns} The name of the fully managed Flink project for which you want to configure a Hive metastore.
      ${hms} The name of the Hive metastore that you want to display in the console of fully managed Flink.
      Note After you activate the fully managed Flink service, the system automatically creates the /artifacts/namespaces/${ns}/ directory in the specified bucket to store data, such as JAR packages. If you do not find the directory in the OSS console, you must manually upload a file to create the directory on the Resources page in the console of fully managed Flink.
    6. Create a directory named hive-conf-dir and a directory named hadoop-conf-dir in the oss://${bucket}/artifacts/namespaces/${ns}/${hms} path.
      The following examples describe the files stored in the hive-conf-dir and hadoop-conf-dir directories:
      • oss://${bucket}/artifacts/namespaces/${ns}/${hms}/hive-conf-dir/ is used to store the Hive configuration file named hive-site.xml
      • oss://${bucket}/artifacts/namespaces/${ns}/${hms}/hadoop-conf-dir/ is used to store the Hadoop configuration files, such as core-site.xml, hdfs-site.xml, yarn-site.xml, and mapred-site.xml.

      For more information about how to create a directory in the OSS console, see Create directories. After the directory is created, you can click Artifacts in the left-side navigation pane, view the new directory and file on the Artifacts page, and then copy the OSS URL.

    7. Upload the Hive configuration file hive-site.xml to the hive-conf-dir directory. For more information about how to upload a file, see Upload objects.
      Before you upload the hive-site.xml file, check whether the hive.metastore.uris parameter in the hive_site.xml file meets the following requirements:
      <property>
          <name>hive.metastore.uris</name>
          <value>thrift://xx.yy.zz.mm:9083</value>
          <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
       </property>
      xx.yy.zz.mm indicates the internal or public IP address of Hive.
      Note If you set the hive.metastore.uris parameter to the hostname of Hive, you must configure the Alibaba Cloud DNS service to parse the parameter value. Otherwise, the value of the hive.metastore.uris parameter fails to be parsed and the error message UnknownHostException is returned when VVP remotely accesses Hive. For more information about how to configure the Alibaba Cloud DNS service, see Add a DNS record to a private zone.
    8. Upload the following configuration files to the hadoop-conf-dir directory. For more information about how to upload a file, see Upload objects.
      • hive-site.xml
      • core-site.xml
      • hdfs-site.xml
      • mapred-site.xml
      • Other required files, such as the compressed packages used by Hive jobs
  3. Configure a Hive metastore in the console of fully managed Flink.
    1. Log on to the Realtime Compute for Apache Flink console.
    2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.
    3. In the left-side navigation pane, click Draft Editor.
    4. In the upper-left corner of the Draft Editor page, click New. In the New Draft dialog box, select STREAM / SQL from the Type drop-down list.
    5. In the script editor, enter a statement to create a Hive metastore.
      CREATE CATALOG ${HMS Name} WITH (
          'type' = 'hive',
          'default-database' = 'default',
          'hive-version' = '<hive-version>',
          'hive-conf-dir' = '<hive-conf-dir>',
          'hadoop-conf-dir' = '<hadoop-conf-dir>'
      );
      Parameter Description
      ${HMS Name} The name of the Hive metastore.
      type The type of the connector. Set the value to hive.
      default-database The name of the default database.
      hive-version The version of the Hive metastore.
      Note Fully managed Flink is compatible with Hive metastores of versions 2.2.0 to 2.3.7. Configure the hive-version parameter based on the following requirements:
      • If the version of the Hive metastore ranges from 2.0.0 to 2.2.0, set hive-version to 2.2.0.
      • If the version of the Hive metastore ranges from 2.3.0 to 2.3.7, set hive-version to 2.3.6.
      hive-conf-dir The directory in which the Hive configuration file is stored. Directory: oss://${bucket}/artifacts/namespaces/${ns}/${hms}/hive-conf-dir/
      hadoop-conf-dir The directory in which the Hadoop dependencies are stored. Directory: oss://${bucket}/artifacts/namespaces/${ns}/${hms}/hadoop-conf-dir/
    6. Click Execute.
      After the Hive metastore is configured, you can reference tables from the Hive metastore as result tables and dimension tables in jobs. You do not need to declare DDL statements for these tables. Table names in the Hive metastore are in the ${hive-catalog-name}.${hive-db-name}.${hive-table-name} format.

      If you want to delete the Hive metastore, follow the instructions provided in Delete a Hive metastore.

    7. On the left side of the Draft Editor page, click the Schemas tab.
    8. Click the Refresh icon to refresh the page and view the Hive catalog that you created.
      Refresh

View Hive metadata

  1. Log on to the Realtime Compute for Apache Flink console.
  2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.
  3. In the left-side navigation pane, click Draft Editor.
  4. On the left side of the Draft Editor page, click the Schemas tab.
  5. In the top navigation bar, select the Hive metastore that you want to manage from the vvp / default drop-down list.
    Hive
  6. Click Tables to view the tables and fields in different databases.

Use a Hive metastore

  • Read data from a Hive table.
    INSERT INTO ${other_sink_table}
    SELECT ...
    FROM `${catalog_name}`.`${db_name}`.`${table_name}`
  • Insert the result data into a Hive table.
    INSERT INTO `${catalog_name}`.`${db_name}`.`${table_name}`
    SELECT ... 
    FROM ${other_source_table}

Delete a Hive metastore

  1. Log on to the Realtime Compute for Apache Flink console.
  2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.
  3. In the left-side navigation pane, click Draft Editor.
  4. In the upper-left corner of the Draft Editor page, click New. In the New Draft dialog box, select STREAM / SQL from the Type drop-down list.
  5. In the script editor, enter the following command:
    Drop CATALOG ${HMS Name}
    In the preceding command, HMS Name indicates the name of the Hive metastore that you want to delete. The name is displayed in the console of fully managed Flink.
    Notice The delete operation does not affect the jobs that are running. However, the jobs that are not published or the jobs that must be suspended and then resumed are affected. Proceed with caution.
  6. Click Execute.
  7. On the left side of the Draft Editor page, click the Schemas tab.
  8. Click the Refresh icon to refresh the page and check whether the Hive catalog is deleted.
    Refresh