Initialize the metadata warehouse using Amazon EMR as the metadata warehouse engine - Dataphin

The Dataphin metadata warehouse (metadata warehouse for short) is a data warehouse that uniformly manages Dataphin internal business metadata and corresponding compute engine metadata. It exists in a Dataphin project within the Dataphin metadata tenant (OPS tenant) and consists of a series of periodic data integration nodes, SQL script nodes, and Shell nodes. Metadata warehouse initialization is the process of configuring the compute engine type for the Dataphin system and initializing metadata. This topic describes how to initialize the metadata warehouse using Amazon EMR as the compute engine.

Limits

Only accounts with the metadata tenant super administrator or system administrator role can initialize the system.

Important

Keep the account and password of the metadata tenant super administrator or system administrator secure. Additionally, exercise caution when performing operations after logging on to the system with the metadata tenant super administrator account.

Procedure

In the top menu bar of the Dataphin homepage, select Management Hub > System Settings.
In the navigation pane on the left, select System O&M > Metadata Warehouse Settings.
On the Metadata Warehouse Settings configuration wizard page, click Start.
In the Select Initialization Engine Type step, select the Amazon EMR engine type.
Important
If the metadata warehouse has already been initialized, the previously successful metadata warehouse is selected by default. Switching to an incompatible compute engine will cause the administration features to become unavailable.
Click Next.

On the Parameter Checking page, configure the following parameters.

Parameter	Description
Primary Node Public DNS	The public DNS used to obtain the private DNS of the VPC. Both Hive and Spark connect using the private DNS. The format is `ec2-{public_ip}.{region}.compute.amazonaws.com`.
*Key File (.pem)**	The key pair for accessing the primary node EC2 (the key pair set when creating the EMR cluster).
core-site.xml	You can upload the relevant cluster configuration files yourself or click Get Cluster Configuration (you need to first fill in the primary node public DNS and upload the key file) to download the relevant files from the primary node.
yarn-site.xml
hive-site.xml
hdfs-site.xml
Cluster Storage	Currently, only HDFS can be selected.
Metadata Retrieval Method	You can select HMS or Amazon Glue. HMS: HMS is selected by default. Amazon Glue: After selecting Amazon Glue, you need to configure Glue Region Code, Glue AccessKey ID, and Glue AccessKey Secret. Glue Region Code: Enter the Region Code for Amazon Glue, such as ap-northeast-3, us-east-1, or us-west-1. Glue AccessKey ID, Glue AccessKey Secret: Enter the AccessKey ID and AccessKey Secret for accessing Amazon Glue.
Engine Type	You can select Hive or Spark. After selecting Hive, you need to enter the Hive JDBC URL. After selecting Spark, you need to enter the Spark JDBC URL. Hive JDBC URL: Enter the JDBC connection address for Hive, or click Automatically Retrieve to obtain the address. To use this option, you must first enter the public DNS of the primary node and upload the key file. The Hive JDBC URL format is `jdbc:hive2//host1:port1,host2:post2/`. The database name is not required. Spark JDBC URL: Enter the JDBC connection address for Spark. The format is `jdbc:hive2//host1:port1/` or `jdbc:kyuubi://host1:port1/`. The database name is not required.
Username	The username for Hive or Spark. This username is set as the `username` for the JDBC connection.
Database	Enter the Database name for the Amazon EMR compute engine.
Metadata Production Project	Enter the name of the metadata warehouse project in Dataphin. This project is used for metadata production and processing.

Click Test Connection. After the connection test passes, click Next.
On the initialization page, click Start.
Note
System initialization takes approximately 15 minutes. Please wait patiently.
After the page indicates successful execution, click Finish to complete the configuration.

What to do next

After the system metadata is initialized, you must set the compute engine for the Dataphin instance. If the metadata warehouse engine is Amazon EMR, you can set the business tenant engine to any engine type except MaxCompute. For more information, see Compute settings.