Create and manage Hadoop clusters - Dataphin - Alibaba Cloud Documentation Center

Limits

Hadoop cluster management is available only when the compute engine is CDH5.x, CDH6.x, Cloudera Data Platform 7.x, E-MapReduce3.x, E-MapReduce5.x, AsiaInfo DP5.3, or Huawei FusionInsight 8.x.

Permissions

Super administrators, system administrators, and users with the Hadoop Cluster - Manage global role can create and manage Hadoop clusters. These users can also specify which users can reference the cluster when creating a Hadoop compute source and assign cluster administrators.
Cluster administrators can manage the clusters they are responsible for.
Users with the Compute Source Management - Create global role can reference Hadoop clusters they have permission to use when creating a Hadoop compute source.

Create a Hadoop cluster

In the top menu bar of the Dataphin home page, choose Planning > Compute Source.
On the Compute Source page, click Manage Hadoop Clusters.
In the Manage Hadoop Clusters dialog box, click + Create Hadoop Cluster.

On the Create Hadoop Cluster page, configure the following parameters.

Basic information

Parameter	Description
Cluster Name	Enter a name for the cluster. The name can contain only letters, digits, underscores (_), and hyphens (-). It can be up to 128 characters long.
Cluster Administrator	Select one or more members under the current tenant to be the cluster administrators for this cluster. Cluster administrators can manage the current cluster, including editing, viewing historical versions, and deleting it.
Description (Optional)	Enter a brief description of the cluster. The description can be up to 128 characters long.

Cluster security control
Usable by: Specify which users can reference the cluster's configuration when creating a compute source. You can select Roles with "Create Compute Source" permission or Specified users.
- Roles with "Create Compute Source" permission : Selected by default.
- Specified users : You can select one or more personal accounts and user groups.

Cluster configuration

Parameter	Description
Cluster Storage	Select HDFS or OSS-HDFS. If you select HDFS, you can add a NameNode in the NameNode configuration item. If you select OSS-HDFS, you must also configure the cluster storage root directory, AccessKey ID, and AccessKey secret. Note This parameter is configurable only when the compute engine is E-MapReduce 5.x. For all other compute engines, the default value is HDFS.
NameNode	Click + Add to open the Add NameNode dialog box and configure the parameters. You can add multiple NameNodes. A NameNode is the hostname or IP address and port of a NameNode in the HDFS cluster. Configuration example: NameNode: 193.168.xx.xx Web UI Port: 50070 IPC Port: 8020 Select at least one port: Web UI Port or IPC Port. After configuration, the NameNode is `host=192.168.xx.xx,webUiPort=50070,ipcPort=8020`. Note This parameter is configurable when Cluster Storage is set to HDFS.
Cluster Storage Root Directory	You can obtain this from the EMR cluster's basic information. The format is `oss://<Bucket>.<Endpoint>/`. Note This parameter is configurable only when the compute engine is E-MapReduce 5.x and Cluster Storage is set to OSS-HDFS.
AccessKey ID, AccessKey Secret	Enter the AccessKey ID and AccessKey secret for OSS access. Note The configuration here has a higher priority than the AccessKey configured in core-site.xml. This parameter is configurable only when the compute engine is E-MapReduce 5.x and Cluster Storage is set to OSS-HDFS.
core-site.xml	Upload the core-site.xml, hdfs-site.xml, hive-site.xml, hivemetastore-site.xml, yarn-site.xml, and other configuration files for the current compute engine. Note When the compute engine is E-MapReduce 5.x and Cluster Storage is set to OSS-HDFS, you do not need to upload the hdfs-site.xml file. When Hive Metadata Retrieval Method is set to HMS, you must upload the hive-site.xml file. When the compute engine is E-MapReduce 5.x or Huawei FusionInsight 8.x and Hive Metadata Retrieval Method is set to HMS, you must upload the hivemetastore-site.xml file.
hdfs-site.xml
hive-site.xml (Optional)
hivemetastore-site.xml (Optional)
yarn-site.xml (Optional)
Other Configuration Files (Optional)
Task Execution Machine	Configure the connection address of the machine where the MapReduce or Spark JAR is executed. The format is `hostname:port` or `ip:port`. The port is optional and defaults to 22.
Execution Username, Password	The username and password to log on to the task execution machine for operations such as MapReduce task execution and HDFS read/write. Ensure this user has task submission permissions.
Authentication Type	Supports No Authentication and Kerberos authentication. Kerberos is an identity authentication protocol based on symmetric key technology. It can provide identity authentication for other services and supports single sign-on (SSO), which means a client can access multiple services, such as HBase and HDFS, after being authenticated. When Authentication Type is set to Kerberos, you must also select a Kerberos Configuration Method. Krb5 Authentication File: Upload a krb5 file for Kerberos authentication. KDC Server Address: The address of the KDC server, which assists with Kerberos authentication. You can configure multiple KDC server addresses, separated by semicolons (;). Note When the compute engine is E-MapReduce 5.x, only the krb5 file configuration method is supported.

HDFS information configuration

Parameter

Description

Authentication Type

Supports No Authentication and Kerberos.

If the Hadoop cluster uses Kerberos authentication, you must enable HDFS Kerberos, upload a Keytab File, and configure a Principal.

Keytab File: Upload the keytab file. You can get the keytab file from the HDFS server.
Principal: Enter the Kerberos authentication username corresponding to the HDFS Keytab File.

Note

This parameter is not required when the compute engine is E-MapReduce 5.x and Cluster Storage is set to OSS-HDFS.

HDFS User (Optional)

Specify the username for file uploads. If left blank, the execution username is used by default.

Note

This parameter is configurable only when Authentication Type is set to No Authentication.

Hive compute engine configuration

Parameter	Description
JDBC URL	You can configure one of the following three types of connection addresses: The connection address of the Hive server, in the format `jdbc:hive://{connection_address}:{port}/{database_name}`. The connection address of ZooKeeper. For example, `jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2`. The connection address with Kerberos enabled, in the format `jdbc:hive2://{connection_address}:{port}/{database_name};principal=hive/_HOST@xx.com`.
Authentication Type	Supports No Authentication, LDAP, and Kerberos authentication. No Authentication: Requires a username for the Hive service. LDAP: Requires a username and password for the Hive service. Kerberos: If the Hadoop cluster uses Kerberos authentication, you must enable Hive Kerberos, upload a Keytab File, and configure a Principal. Keytab File: Upload the keytab file. You can get the keytab file from the Hive server. Principal: Enter the Kerberos authentication username corresponding to the Hive Keytab File.
Username	Enter the username for the Hive service. Note This parameter is configurable when Authentication Type is set to No Authentication or LDAP. The specified user must have task execution permissions to ensure that tasks run correctly.
Password	Enter the password for the Hive service user. Note This parameter is configurable only when Authentication Type is set to LDAP.
Execution Engine	Default: Tasks under projects that are attached to this compute source, including logical table tasks, use this execution engine by default. Custom: Select another compute engine type.

Hive metadata configuration

Metadata Retrieval Method: You can select one of the following methods: Metadata Database, HMS, or DLF. The required configuration varies based on the selected method.

Important

The DLF method is available only when the compute engine is E-MapReduce 5.x.
If you use the DLF method to retrieve metadata, you must first upload the hive-site.xml configuration file.

Metadata Retrieval Method	Parameter	Description
Metadata Database	Database Type	Select the database based on the metadatabase type used in the cluster. Dataphin supports MySQL. Supported MySQL versions include MySQL 5.1.43, MySQL 5.6/5.7, and MySQL 8.
	JDBC URL	Enter the JDBC connection address of the destination database. For example: MySQL: The format is `jdbc:mysql://{connection_address}[,failoverhost...]{port}/{database_name} [?propertyName1][=propertyValue1][&propertyName2][=propertyValue2]...`.
	Username, Password	Enter the username and password to log on to the metadatabase.
HMS	Authentication Type	The HMS retrieval method supports No Authentication, LDAP, and Kerberos. For Kerberos authentication, you must upload a Keytab File and configure a Principal.
DLF	Endpoint	Enter the DLF endpoint for the region where your cluster is located. To obtain the endpoint, see DLF Regions and Endpoints.
DLF	AccessKey ID, AccessKey Secret	Enter the AccessKey ID and AccessKey secret of the account where the cluster is located. You can obtain the AccessKey ID and AccessKey secret on the User Information Management page.

Spark JAR service configuration

Important

For performance reasons, connectivity and validity tests are not performed when you modify the Spark execution machine or local client configuration. After you make changes, go to the developer module and run a test program to check whether the Spark service is available.

Parameter	Description
Spark Execution Machine	If Spark is deployed on the Hadoop cluster, you can enable Spark SQL tasks. Note When using the Spark client on the task execution machine specified in the Cluster configuration, you must first deploy and configure the client on the execution machine (for example, set environment variables, create an execution user, and grant permissions to the user). However, only one execution machine is supported, which means high availability and load balancing cannot be achieved. After a Spark job is submitted, you cannot view its log or stop it from Dataphin.
Execution Username, Password	Enter the username and password to log on to the compute execution machine. Note Confirm that the user has Spark-submit permissions.
Spark Local Client	You can enable or disable the Spark local client. If it is enabled and referenced by a task, you cannot disable it. To run Spark programs using the Spark local client, you must upload the yarn-site.xml configuration file and ensure that the port connection between Dataphin and YARN is normal. Click + Add Client to open the Add Client dialog box. Enter a Client Name and upload a Client File. Client Name: Can contain only letters, digits, underscores (_), hyphens (-), and periods (.). It can be up to 32 characters long. The client name must be unique (case-sensitive) within the same Hadoop cluster. Client File: Upload the client file. Only .tgz and .zip formats are supported. Note You can download the corresponding version of the Spark client from https://spark.apache.org/downloads.html. A custom client must have the same directory structure as the community edition, include the Hadoop client, and be uploaded as a complete compressed package (.tgz or .zip). Dataphin uses the uploaded client to submit jobs through the scheduling cluster, which enables full lifecycle management of jobs. After the client is uploaded, you can click the icon in the client list to edit the corresponding client. If you upload a new client file, it will overwrite the existing one. Click the icon to delete the corresponding client. Note If an uploaded client is referenced by a task (including tasks in draft status), you cannot edit the client name or delete the client.
Authentication Type	Supports No Authentication or Kerberos authentication. If the Hadoop cluster uses Kerberos authentication, you must enable Spark Kerberos, upload a Keytab File, and configure a Principal. Keytab File: Upload the keytab file. You can get the keytab file from the Spark server. Principal: Enter the Kerberos authentication username corresponding to the Spark Keytab File.

Spark SQL service configuration

Parameter	Description
Spark SQL Task	If Spark is deployed on the Hadoop cluster, you can enable Spark SQL tasks.
Spark Version	Currently, only 3.x is supported.
Service Type	Select the server type for Spark JDBC access.
JDBC URL	Enter the JDBC connection method, for example, `jdbc:hive2://host1:port1/` or `jdbc:kyuubi://host1:port1/`. You do not need to enter a database name.
Authentication Type	Supports No Authentication, LDAP, and Kerberos authentication. No Authentication: Requires a username for the Spark service. LDAP: Requires a username and password for the Spark service. Kerberos: If the Hadoop cluster uses Kerberos authentication, you must enable Spark Kerberos, upload a Keytab File, and configure a Principal. Keytab File: Upload the keytab file. You can get the keytab file from the Spark server. Principal: Enter the Kerberos authentication username corresponding to the Spark Keytab File. Note For No Authentication and LDAP methods, the specified user must have task execution permissions to ensure that tasks run correctly.
Username	Enter the username for the Spark service. Note This parameter is configurable when Authentication Type is set to No Authentication or LDAP. The specified user must have task execution permissions to ensure that tasks run correctly.
Password	Enter the password for the Spark service user. Note This parameter is configurable only when Authentication Type is set to LDAP.
SQL Task Queue Settings	Different service types use different SQL task queues. Details are as follows: Spark Thrift Server: Task queues cannot be set. Kyuubi: Uses the priority queue settings from the HDFS connection information. This takes effect only when Kyuubi uses YARN for resource scheduling. Production tasks use the Connection shared level. Livy: Uses the priority queue settings from the HDFS connection information. This takes effect only when Livy uses YARN for resource scheduling. Both ad hoc queries and production tasks are executed using a new connection. MapReduce (MRS): Uses the priority queue settings from the HDFS connection information.

Impala task configuration

Parameter	Description
Impala Task	If Impala is deployed on the Hadoop cluster, you can enable Impala tasks.
JDBC URL	Enter the JDBC connection method for Impala, for example, `jdbc:impala://host:port/`. You do not need to enter a schema.
Authentication Type	Supports No Authentication, LDAP, and Kerberos authentication. No Authentication: Requires an Impala username. LDAP: Requires an Impala username and password. Kerberos: Requires uploading a Keytab File and configuring a Principal.
Username	Enter the Impala username. Note This parameter is configurable when Authentication Type is set to No Authentication or LDAP. The specified user must have task execution permissions to ensure that tasks run correctly.
Password	Enter the Impala user's password. Note This parameter is configurable only when Authentication Type is set to LDAP.
Developer Task Request Pool	Enter the name of the Impala request pool for developer tasks.
Auto Triggered Task Request Pool	Enter the name of the Impala request pool for auto triggered tasks.

Click Test Connection. The system automatically tests the connection to each service.
If the connection test is successful, you can save the configuration. If the test fails, the Connection Test Failed dialog box appears. This dialog box shows the services that failed the test and their error details.
After the connection test is successful, click Save to create the Hadoop cluster.

Manage Hadoop clusters

In the top menu bar of the Dataphin home page, choose Planning > Compute Source.
On the Compute Source page, click Manage Hadoop Clusters.
In the Manage Hadoop Clusters dialog box, view the list of Hadoop clusters. The list shows information such as Cluster Name, Cluster Administrator, Associated Compute Sources, Creation Information, and Modification Information.
- Associated Compute Sources: Displays the total number of associated compute sources. Click the icon to view the list of associated compute sources. Click a compute source name to go to its details page.
- Creation Information: Displays the user who created the cluster and the creation time.
- Modification Information: Displays the user who last edited the cluster and the modification time.
Note
Compute tasks can run on only one cluster. Therefore, you cannot join data from different Hadoop clusters.
(Optional) You can enter a cluster name in the search box to perform a fuzzy search.

In the Actions column of the Hadoop cluster list, you can perform management operations on a cluster. The supported operations are as follows.

Operation	Description
View	Click the icon in the Actions column of the target cluster to view details about the current version of the cluster. Users with the Manage Hadoop Cluster permission can download the cluster configuration file.
Edit	Click the icon in the Actions column of the target cluster to open the Edit Hadoop Cluster page. You can modify existing configurations on the Edit Hadoop Cluster page. If the service configurations for Spark Jar, Spark SQL, or Impala tasks are in the enabling status and the corresponding services have been enabled on the compute source associated with the cluster, you cannot shut down the related services. After editing, if you only modified the cluster's Basic Information and Cluster Security Control, you do not need to test the connection. You can save directly. If you made other changes, you must test the connection again. After the test is successful, click Save. In the dialog box that appears, enter a Change Description and click OK.
Clone	Click the icon in the Actions column of the target cluster. The system automatically clones all data of the cluster and opens the Create Hadoop Cluster page. You can then modify the existing configuration.
Version History	Click the icon in the Actions column of the target cluster and select History. A dialog box opens and displays information about each version of the current cluster, such as the version name, modifier, and change description. In this dialog box, you can perform View, Compare, and Rollback operations. View: Click the icon in the Actions column for the target version. You are redirected to the View Hadoop Cluster page to view the details of the current cluster version. Users with the Hadoop Cluster-Manage permission can download the cluster configuration file. Compare: Click the icon in the Actions column of the target version to go to the version comparison page. On this page, you can select different versions from the filter drop-down list. By default, the current version of the Hadoop cluster is compared with the target version. Rollback: Click the icon in the Actions column of the target version, and in the pop-up dialog box, click OK. After you click OK, the system automatically tests the connection for the cluster information of that version. If the test is successful, the rollback proceeds. If the rollback fails, a message appears indicating the failure and the reason. If the connection test fails, the rollback is terminated, and a dialog box appears showing the services that failed the connection test.
Delete	Note You can delete a Hadoop cluster only if it has no associated compute sources. A deleted cluster cannot be restored. Click the icon in the Actions column of the target cluster, select Delete, and in the dialog box that appears, click OK.