Metadata center settings - Dataphin - Alibaba Cloud Documentation Center

All metadata acquisition tasks for tenants run in the metadata warehouse tenant. Before you can use the Metadata Center feature, you must complete the initial setup in the metadata warehouse tenant. This setup specifies the compute source for running metadata acquisition tasks. This topic describes how to configure the Metadata Center.

Limits

The compute engine type for the Metadata Center must match the engine type of the metadata warehouse.
The Metadata Center feature supports the following compute engines: MaxCompute, E-MapReduce 5.x Hadoop, E-MapReduce 3.x Hadoop, CDH 5.x Hadoop, CDH 6.x Hadoop, Cloudera Data Platform 7.x, Huawei FusionInsight 8.x Hadoop, and AsiaInfo DP5.3 Hadoop.
After the Metadata Center is initialized, you cannot reinitialize it.

Permissions

A super administrator or system administrator of the metadata warehouse tenant can initialize the Metadata Center.

Glossary

Metadata: Data about data, including technical, business, and management metadata. It describes the characteristics, source, format, and relationships of data to help you retrieve, use, and maintain the data.
Metadata Center: A system that extracts, processes, centrally stores, and manages metadata from various business systems to support data governance and improve data organization, retrieval, and analysis within an organization.

Initialize the Metadata Center

Log on to the metadata warehouse tenant as a super administrator or system administrator.
On the Dataphin homepage, in the top menu bar, choose Management Center > System Settings.
In the navigation pane on the left, under System O&M, click Metadata Center Settings to open the Metadata Center Initialization Configuration page.

Based on the compute engine of the metadata warehouse, select a compute source type for the Metadata Center initialization. The supported engines are MaxCompute and Hadoop.

MaxCompute

Parameter	Description
Compute source type	Select the MaxCompute compute engine.
Endpoint	Configure the endpoint for the MaxCompute region where the Dataphin instance is located. For details about MaxCompute endpoints for different regions and network types, see MaxCompute Endpoint.
Project Name	This is the name of the MaxCompute project, not the DataWorks workspace. Log on to the MaxCompute console. In the upper-left corner, switch the region. You can find the project name on the project management tab.
AccessKey ID, AccessKey secret	Enter the AccessKey ID and AccessKey secret of an account that has permissions to access the MaxCompute project. Use an existing AccessKey, or refer to create an AccessKey to create a new one. Note To reduce the risk of a leak, the AccessKey secret is displayed only when you create it and cannot be retrieved later. Make sure to store it securely. To ensure a stable connection between the Dataphin project and the MaxCompute project, use the AccessKey pair of a MaxCompute project administrator. To ensure successful metadata acquisition, do not change the AccessKey pair for the MaxCompute project.

Hadoop

Compute source type:
- HDFS cluster storage: This option supports the E-MapReduce 5.x Hadoop, E-MapReduce 3.x Hadoop, CDH 5.x Hadoop, CDH 6.x Hadoop, Cloudera Data Platform 7.x, Huawei FusionInsight 8.x Hadoop, and AsiaInfo DP5.3 Hadoop compute engines.
- OSS-HDFS cluster storage: This option supports only the E-MapReduce 5.x Hadoop compute engine.

Cluster configuration

HDFS cluster storage

Parameter	Description
NameNode	The NameNode manages the file system namespace in HDFS and access permissions for external clients. Click Add. In the Add NameNode dialog box, enter the hostname and port number of the NameNode, and then click OK. After you enter the information, the system automatically generates the configuration in the required format, such as `host=hostname,webUiPort=50070,ipcPort=8020`.
Configuration File	Upload cluster configuration files to configure cluster parameters. The system supports files such as core-site.xml and hdfs-site.xml. If you use the HMS method to retrieve metadata, you must upload the hdfs-site.xml, hive-site.xml, core-site.xml, and hivemetastore-site.xml files. If the compute engine is FusionInsight 8.X or E-MapReduce 5.x Hadoop, you must also upload the hivemetastore-site.xml file.
History Log	Configure the log path for the cluster. Example: `tmp/hadoop-yarn/staging/history/done`.
Authentication Type	Supports No Authentication and Kerberos authentication. Kerberos is an identity authentication protocol that uses symmetric key technology. It is often used for authentication between cluster components. Enabling Kerberos improves cluster security. If you enable Kerberos authentication, configure the following parameters: Kerberos configuration method KDC Server: Enter the unified service address of the Key Distribution Center (KDC) to assist with Kerberos authentication. krb5 file configuration: Upload the krb5 file for Kerberos authentication. HDFS configuration HDFS Keytab File: Upload the HDFS keytab file. HDFS Principal: Enter the principal for Kerberos authentication. Example: `XXXX/hadoopclient@xxx.xxx`.

OSS-HDFS cluster storage

Parameter	Description
Cluster storage	You can check the cluster storage class in the following ways: If you have not created a cluster: You can view the created Hadoop storage cluster pass the E-MapReduce5.x cluster type creation page. After Cluster Creation: You can view the cluster storage type created pass the details page of the E-MapReduce5.x Hadoop cluster.
Cluster storage root directory	Fill in the cluster store root catalog. Can be obtained pass viewing E-MapReduce5.x Hadoop cluster information. Important If the path that you enter includes an Endpoint, Dataphin uses that Endpoint by default. If the path does not include an Endpoint, the bucket-level Endpoint configured in core-site.xml is used. If a bucket-level Endpoint is not configured, the global Endpoint in core-site.xml is used. For more information, see Alibaba Cloud OSS-HDFS Service (JindoFS Service) Endpoint Configuration.
Configuration File	Upload cluster configuration files to configure cluster parameters. The system supports files such as core-site.xml and hive-site.xml. If you use the HMS method to retrieve metadata, you must upload the hive-site.xml, core-site.xml, and hivemetastore-site.xml files.
History Log	Configure the log path for the cluster. Example: `tmp/hadoop-yarn/staging/history/done`.
AccessKey ID, AccessKey secret	Enter the AccessKey ID and AccessKey secret to access OSS. Use an existing AccessKey or refer to create an AccessKey to create a new one. Note To reduce the risk of a leak, the AccessKey secret is displayed only when you create it and cannot be retrieved later. Make sure to store it securely. Important The AccessKey pair that you configure here has a higher priority than the AccessKey pair configured in the core-site.xml file.
Authentication Type	Supports No Authentication and Kerberos authentication. Kerberos is an identity authentication protocol that uses symmetric key technology. It is often used for authentication between cluster components. Enabling Kerberos improves cluster security. If you enable Kerberos authentication, you must upload the krb5 file.

Hive configuration

Parameter	Description
JDBC URL	Enter the Java Database Connectivity (JDBC) URL for connecting to Hive.
Authentication Type	If you set the cluster authentication method to No Authentication, you can set the Hive authentication method to No Authentication or LDAP. If you set the cluster authentication method to Kerberos, you can set the Hive authentication method to No Authentication, LDAP, or Kerberos. Note You can configure the authentication method if the compute engine is E-MapReduce 3.x, E-MapReduce 5.x, Cloudera Data Platform 7.x, AsiaInfo DP5.3, or Huawei FusionInsight 8.X.
Username, Password	The username and password for accessing Hive. No Authentication: Enter a username. LDAP Authentication: Enter a username and password. Kerberos Authentication: These fields are not required.
Hive Keytab File	This parameter is required if you enable Kerberos authentication. Upload the keytab file. You can obtain the keytab file from the Hive server.
Hive Principal	This parameter is required if you enable Kerberos authentication. Enter the Kerberos authentication principal that corresponds to the Hive keytab file. Example: `XXXX/hadoopclient@xxx.xxx`.
Execution engine	Select an appropriate execution engine as needed. The supported execution engines vary based on the compute engine. E-MapReduce 3.X: MapReduce, Spark. E-MapReduce 5.X: MapReduce, Tez. CDH 5.X: MapReduce. CDH 6.X: MapReduce, Spark, Tez. FusionInsight 8.X: MapReduce. AsiaInfo DP 5.3 Hadoop: MapReduce. Cloudera Data Platform 7.x: Tez. Note After you set the execution engine, the compute settings, compute source, and nodes in the metadata warehouse tenant use the specified Hive execution engine. If you reinitialize the settings, these components are reset to use the newly specified execution engine.

Metadata retrieval method

You can retrieve metadata using the metadatabase or Hive Metastore Service (HMS). The required configuration depends on the method you select.

Retrieve metadata from a metadatabase

Parameter	Description
Database type	Only MySQL is supported as the database type for the Hive metadatabase. Supported MySQL versions: MySQL 5.1.43, MYSQL 5.6/5.7, and MySQL 8.
JDBC URL	Enter the JDBC URL of the destination database. Example: `jdbc:mysql://host:port/dbname`.
Username, Password	The username and password of the destination database.

Obtaining HMS
If you use the HMS method and Kerberos is enabled, you must upload the keytab file and specify the principal.
Parameter
Description
Keytab File
The Kerberos authentication keytab file for the Hive metastore.
Principal
The Kerberos authentication principal for the Hive metastore.

After you configure the required parameters, click Test Connection to verify the connection to Dataphin.
After the connection test passes, click OK and Start Initialization. The system then checks for the required permissions and the metadata warehouse initialization configuration.
Permissions: Checks whether the current user is a super administrator or system administrator of the metadata warehouse tenant.
Metadata warehouse initialization configuration: Checks whether the metadata warehouse is successfully initialized.
After the checks pass, the initialization process starts. This process creates the compute source, project, and data source, and then runs the initial DDL statements. After the process is complete, the Metadata Center is initialized.

References

After you initialize the Metadata Center, you can collect metadata from databases and import it into Dataphin for analysis and management. For more information, see Create and manage metadata acquisition tasks.

Parameter	Description
Keytab File	The Kerberos authentication keytab file for the Hive metastore.
Principal	The Kerberos authentication principal for the Hive metastore.