A data catalog is the top-level metadata entity in Data Lake Formation (DLF) or Hive Metastore (HMS) and can contain multiple databases. In EMR Serverless Spark, you can view the databases and tables in an attached data catalog and add existing data catalogs. This feature is useful for scenarios that require metadata isolation.
Interactive jobs submitted through Livy or Kyuubi can access only the default catalog (Default Catalog). Concurrent access to multiple types of data catalogs is not supported.
Add a data catalog
Go to the Data Catalog page.
Log on to the EMR console.
In the navigation pane on the left, choose .
On the Spark page, click the name of the target workspace.
On the EMR Serverless Spark page, click Catalog in the navigation pane on the left.
NoteThe Data Catalog page displays the databases and tables in the DLF data catalog that was selected when the cluster was created.
Click Add Catalog.
In the Add Catalog dialog box, you can configure the following parameters and click Add.
DLF Catalog: A metadata management service used to manage and query metadata stored in a data lake. You can select an existing DLF data catalog or create a new one to quickly access the metadata in your data lake.
To create a new DLF data catalog, click Create Catalog. You are then redirected to the Data Lake Formation console. For more information, see Metadata Management.
NoteTo use a DLF data catalog, you must use one of the following engine versions: esr-4.3.0 or later, esr-3.3.0 or later, or esr-2.7.0 or later.
External Hive Metastore: An independent metadata service that is typically used to manage Hive table metadata. You can configure this service to integrate metadata from an external Hive Metastore into your current environment.
To use this method, ensure that Serverless Spark can connect to the VPC where the service is located.
Parameter
Description
Network Connection
The network connection between your environment and the VPC of the external Hive Metastore.
Select the name of a created network connection from the drop-down list. For more information, see Step 1: Add a network connection.
Metastore Service Address
The service address of the external Hive Metastore. The format is
thrift://<metastore-host>:<port>.Where:
<metastore-host>: The hostname or IP address of the Hive Metastore service.<port>: The port number of the Hive Metastore service. The default is9083.
Kerberos authentication
If Kerberos authentication is enabled for your external Hive Metastore, specify the keytab file path and the principal name.
Kerberos Keytab File: The path of the Kerberos keytab file.
Kerberos principal: The name of the principal in the keytab file. This principal is used for identity verification with the Kerberos service.
NoteUse the
klist -kt <keytab_file>command to view the principal name in the target keytab file.
View databases and tables
On the Catalog page, click a data catalog ID.
The page displays information about all databases in the data catalog.
In the Actions column, click Tables.
The page displays information about all tables in the database.
In the Actions column, click Columns.
The page displays the table and column information for the selected table.
References
For more information about how to add an external Metastore service, see Connect to an external Hive Metastore Service.