Hadoop clusters centralize storage and compute configurations so that multiple compute sources can share them. You can create, edit, clone, and delete Hadoop clusters on the Compute Source page.
Limits
Hadoop cluster management is supported only for the following compute engines: CDH5.x, CDH6.x, Cloudera Data Platform 7.x, E-MapReduce3.x, E-MapReduce5.x, AsiaInfo DP5.3, and Huawei FusionInsight 8.x.
Permissions
-
Super administrators, system administrators, and users with the Hadoop Cluster - Manage global role can create and manage Hadoop clusters. They can also specify which users can reference the cluster when creating a Hadoop compute source and assign cluster administrators.
-
Cluster administrators can manage the clusters they are responsible for.
-
Users with the Compute Source Management - Create global role can reference permitted Hadoop clusters when creating a Hadoop compute source.
Create a Hadoop cluster
-
In the top menu bar of the Dataphin home page, choose Planning > Compute Source.
-
On the Compute Source page, click Manage Hadoop Clusters.
-
In the Manage Hadoop Clusters dialog box, click + Create Hadoop Cluster.
-
On the Create Hadoop Cluster page, configure the following parameters.
-
Basic information
Parameter
Description
Cluster Name
The cluster name. It can contain only letters, digits, underscores (_), and hyphens (-), up to 128 characters.
Cluster Administrator
Select one or more tenant members as cluster administrators. Cluster administrators can manage the cluster, including editing, viewing historical versions, and deleting it.
Description (Optional)
A brief description of the cluster, up to 128 characters.
-
Cluster security control
Usable by: Specify which users can reference this cluster's configuration when creating a compute source. Options: Roles with "Create Compute Source" permission or Specified users.
-
Roles with "Create Compute Source" permission : Selected by default.
-
Specified users : You can select one or more personal accounts and user groups.
-
-
Cluster configuration
Parameter
Description
Cluster Storage
Select HDFS or OSS-HDFS as the storage type.
-
If you select HDFS, you can add NameNodes in the NameNode configuration section.
-
If you select OSS-HDFS, you must also configure the cluster storage root directory, AccessKey ID, and AccessKey secret.
NoteThis parameter is configurable only when the compute engine is E-MapReduce 5.x. For all other compute engines, the default value is HDFS.
NameNode
Click + Add to add a NameNode. You can add multiple NameNodes.
A NameNode is identified by its hostname or IP address and port in the HDFS cluster. Example:
-
NameNode: 193.168.xx.xx
-
Web UI Port: 50070
-
IPC Port: 8020
Select at least one port: Web UI Port or IPC Port. After configuration, the NameNode is
host=192.168.xx.xx,webUiPort=50070,ipcPort=8020.NoteThis parameter is configurable when Cluster Storage is set to HDFS.
Cluster Storage Root Directory
Obtain this value from the EMR cluster's basic information. The format is
oss://<Bucket>.<Endpoint>/.NoteThis parameter is configurable only when the compute engine is E-MapReduce 5.x and Cluster Storage is set to OSS-HDFS.
AccessKey ID, AccessKey Secret
The AccessKey ID and AccessKey secret for OSS access.
Note-
The configuration here has a higher priority than the AccessKey configured in core-site.xml.
-
This parameter is configurable only when the compute engine is E-MapReduce 5.x and Cluster Storage is set to OSS-HDFS.
core-site.xml
Upload the core-site.xml, hdfs-site.xml, hive-site.xml, hivemetastore-site.xml, yarn-site.xml, and other configuration files required by the compute engine.
Note-
When the compute engine is E-MapReduce 5.x and Cluster Storage is set to OSS-HDFS, you do not need to upload the hdfs-site.xml file.
-
When Hive Metadata Retrieval Method is set to HMS, you must upload the hive-site.xml file.
-
When the compute engine is E-MapReduce 5.x or Huawei FusionInsight 8.x and Hive Metadata Retrieval Method is set to HMS, you must upload the hivemetastore-site.xml file.
hdfs-site.xml
hive-site.xml (Optional)
hivemetastore-site.xml (Optional)
yarn-site.xml (Optional)
Other Configuration Files (Optional)
Task Execution Machine
The connection address of the machine that runs MapReduce or Spark JAR tasks. Format:
hostname:portorip:port. The port is optional and defaults to 22.Execution Username, Password
The username and password to log on to the task execution machine for MapReduce task execution and HDFS read/write. Ensure this user has task submission permissions.
Authentication Type
Supports No Authentication and Kerberos authentication.
Kerberos is a symmetric-key identity authentication protocol that supports single sign-on (SSO), allowing a client to access multiple services such as HBase and HDFS after a single authentication.
When Authentication Type is set to Kerberos, you must also select a Kerberos Configuration Method.
-
Krb5 Authentication File: Upload a krb5 file for Kerberos authentication.
-
KDC Server Address: The address of the KDC server, which assists with Kerberos authentication. You can configure multiple KDC server addresses, separated by semicolons (;).
NoteWhen the compute engine is E-MapReduce 5.x, only the krb5 file configuration method is supported.
-
-
HDFS information configuration
Parameter
Description
Authentication Type
Supports No Authentication and Kerberos.
If the Hadoop cluster uses Kerberos authentication, you must enable HDFS Kerberos, upload a Keytab File, and configure a Principal.
-
Keytab File: Upload the keytab file. You can get the keytab file from the HDFS server.
-
Principal: Enter the Kerberos authentication username corresponding to the HDFS Keytab File.
NoteThis parameter is not required when the compute engine is E-MapReduce 5.x and Cluster Storage is set to OSS-HDFS.
HDFS User (Optional)
Specify the username for file uploads. If left blank, the execution username is used by default.
NoteThis parameter is configurable only when Authentication Type is set to No Authentication.
-
-
Hive compute engine configuration
Parameter
Description
JDBC URL
You can configure one of the following three types of connection addresses:
-
The connection address of the Hive server, in the format
jdbc:hive://{connection_address}:{port}/{database_name}. -
The connection address of ZooKeeper. For example,
jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2. -
The connection address with Kerberos enabled, in the format
jdbc:hive2://{connection_address}:{port}/{database_name};principal=hive/_HOST@xx.com.
Authentication Type
Supports No Authentication, LDAP, and Kerberos authentication.
-
No Authentication: Requires a username for the Hive service.
-
LDAP: Requires a username and password for the Hive service.
-
Kerberos: If the Hadoop cluster uses Kerberos authentication, you must enable Hive Kerberos, upload a Keytab File, and configure a Principal.
-
Keytab File: Upload the keytab file. You can get the keytab file from the Hive server.
-
Principal: Enter the Kerberos authentication username corresponding to the Hive Keytab File.
-
Username
Enter the username for the Hive service.
NoteThis parameter is configurable when Authentication Type is set to No Authentication or LDAP. The specified user must have task execution permissions to ensure that tasks run correctly.
Password
Enter the password for the Hive service user.
NoteThis parameter is configurable only when Authentication Type is set to LDAP.
Execution Engine
Default: Tasks under projects that are attached to this compute source, including logical table tasks, use this execution engine by default.
Custom: Select another compute engine type.
-
-
Hive metadata configuration
Metadata Retrieval Method: You can select one of the following methods: Metadata Database, HMS, or DLF. The required configuration varies based on the selected method.
Important-
The DLF method is available only when the compute engine is E-MapReduce 5.x.
-
If you use the DLF method to retrieve metadata, you must first upload the hive-site.xml configuration file.
Metadata Retrieval Method
Parameter
Description
Metadata Database
Database Type
Select the database type that matches the metadatabase used in the cluster. Dataphin supports MySQL.
Supported MySQL versions include MySQL 5.1.43, MySQL 5.6/5.7, and MySQL 8.
JDBC URL
The JDBC connection address of the destination database. Example:
MySQL: The format is
jdbc:mysql://{connection_address}[,failoverhost...]{port}/{database_name} [?propertyName1][=propertyValue1][&propertyName2][=propertyValue2]....Username, Password
The username and password for the metadatabase.
HMS
Authentication Type
The HMS retrieval method supports No Authentication, LDAP, and Kerberos. For Kerberos authentication, you must upload a Keytab File and configure a Principal.
DLF
Endpoint
Enter the DLF endpoint for the region where your cluster is located. To obtain the endpoint, see DLF Regions and Endpoints.
AccessKey ID, AccessKey Secret
Enter the AccessKey ID and AccessKey secret of the account where the cluster is located.
You can obtain the AccessKey ID and AccessKey secret on the User Information Management page.
-
-
Spark JAR service configuration
ImportantConnectivity and validity tests are not performed when you modify the Spark execution machine or local client configuration. After making changes, run a test program in the developer module to verify that the Spark service is available.
Parameter
Description
Spark Execution Machine
If Spark is deployed on the Hadoop cluster, you can enable Spark SQL tasks.
NoteWhen using the Spark client on the task execution machine specified in Cluster configuration, you must first deploy and configure the client on that machine (for example, set environment variables, create an execution user, and grant permissions). Only one execution machine is supported, so high availability and load balancing are not available. After a Spark job is submitted, you cannot view its log or stop it from Dataphin.
Execution Username, Password
The username and password for the compute execution machine.
NoteConfirm that the user has Spark-submit permissions.
Spark Local Client
Enable or disable the Spark local client. If it is enabled and referenced by a task, you cannot disable it. To run Spark programs using the Spark local client, you must upload the yarn-site.xml configuration file and ensure that Dataphin can connect to YARN.
Click + Add Client to open the Add Client dialog box. Enter a Client Name and upload a Client File.
-
Client Name: Can contain only letters, digits, underscores (_), hyphens (-), and periods (.). It can be up to 32 characters long.
The client name must be unique (case-sensitive) within the same Hadoop cluster.
-
Client File: Upload the client file. Only .tgz and .zip formats are supported.
NoteYou can download the corresponding Spark client version from https://spark.apache.org/downloads.html. A custom client must have the same directory structure as the community edition, include the Hadoop client, and be uploaded as a complete compressed package (.tgz or .zip). Dataphin uses the uploaded client to submit jobs through the scheduling cluster for full lifecycle management.
After the client is uploaded, you can click the
icon in the client list to edit the corresponding client. If you upload a new client file, it will overwrite the existing one. Click the
icon to delete the corresponding client.NoteIf an uploaded client is referenced by a task (including draft tasks), you cannot edit the client name or delete the client.
Authentication Type
Supports No Authentication or Kerberos authentication.
If the Hadoop cluster uses Kerberos authentication, you must enable Spark Kerberos, upload a Keytab File, and configure a Principal.
-
Keytab File: Upload the keytab file. You can get the keytab file from the Spark server.
-
Principal: Enter the Kerberos authentication username corresponding to the Spark Keytab File.
-
-
Spark SQL service configuration
Parameter
Description
Spark SQL Task
If Spark is deployed on the Hadoop cluster, you can enable Spark SQL tasks.
Spark Version
Currently, only 3.x is supported.
Service Type
Select the server type for Spark JDBC access.
JDBC URL
Enter the JDBC connection method, for example,
jdbc:hive2://host1:port1/orjdbc:kyuubi://host1:port1/. You do not need to enter a database name.Authentication Type
Supports No Authentication, LDAP, and Kerberos authentication.
-
No Authentication: Requires a username for the Spark service.
-
LDAP: Requires a username and password for the Spark service.
-
Kerberos: If the Hadoop cluster uses Kerberos authentication, you must enable Spark Kerberos, upload a Keytab File, and configure a Principal.
-
Keytab File: Upload the keytab file. You can get the keytab file from the Spark server.
-
Principal: Enter the Kerberos authentication username corresponding to the Spark Keytab File.
-
NoteFor No Authentication and LDAP methods, the specified user must have task execution permissions to ensure that tasks run correctly.
Username
Enter the username for the Spark service.
NoteThis parameter is configurable when Authentication Type is set to No Authentication or LDAP. The specified user must have task execution permissions to ensure that tasks run correctly.
Password
Enter the password for the Spark service user.
NoteThis parameter is configurable only when Authentication Type is set to LDAP.
SQL Task Queue Settings
Different service types use different SQL task queues. Details are as follows:
-
Spark Thrift Server: Task queues cannot be set.
-
Kyuubi: Uses the priority queue settings from the HDFS connection information. This takes effect only when Kyuubi uses YARN for resource scheduling. Production tasks use the Connection shared level.
-
Livy: Uses the priority queue settings from the HDFS connection information. This takes effect only when Livy uses YARN for resource scheduling. Both ad hoc queries and production tasks are executed using a new connection.
-
MapReduce (MRS): Uses the priority queue settings from the HDFS connection information.
-
-
Impala task configuration
Parameter
Description
Impala Task
If Impala is deployed on the Hadoop cluster, you can enable Impala tasks.
JDBC URL
Enter the JDBC connection method for Impala, for example,
jdbc:impala://host:port/. You do not need to enter a schema.Authentication Type
Supports No Authentication, LDAP, and Kerberos authentication.
-
No Authentication: Requires an Impala username.
-
LDAP: Requires an Impala username and password.
-
Kerberos: Requires uploading a Keytab File and configuring a Principal.
Username
Enter the Impala username.
NoteThis parameter is configurable when Authentication Type is set to No Authentication or LDAP. The specified user must have task execution permissions to ensure that tasks run correctly.
Password
Enter the Impala user's password.
NoteThis parameter is configurable only when Authentication Type is set to LDAP.
Developer Task Request Pool
Enter the name of the Impala request pool for developer tasks.
Auto Triggered Task Request Pool
Enter the name of the Impala request pool for auto triggered tasks.
-
-
-
Click Test Connection. The system automatically tests the connection to each service.
If the connection test succeeds, you can save the configuration. If it fails, the Connection Test Failed dialog box appears, showing the failed services and their error details.
-
After the connection test succeeds, click Save to create the Hadoop cluster.
Manage Hadoop clusters
-
In the top menu bar of the Dataphin home page, choose Planning > Compute Source.
-
On the Compute Source page, click Manage Hadoop Clusters.
-
In the Manage Hadoop Clusters dialog box, view the Hadoop cluster list. The list displays the Cluster Name, Cluster Administrator, Associated Compute Sources, Creation Information, and Modification Information.
-
Associated Compute Sources: The total number of associated compute sources. Click the
icon to view the list. Click a compute source name to go to its details page. -
Creation Information: The user who created the cluster and the creation time.
-
Modification Information: The user who last edited the cluster and the modification time.
NoteCompute tasks can run on only one cluster, so you cannot join data across different Hadoop clusters.
-
-
(Optional) You can enter a cluster name in the search box to perform a fuzzy search.
-
In the Actions column of the Hadoop cluster list, you can perform the following management operations on a cluster.
Operation
Description
View
Click the
icon in the Actions column to view the current version of the cluster. Users with the Manage Hadoop Cluster permission can download the cluster configuration file.Edit
Click the
icon in the Actions column to open the Edit Hadoop Cluster page. You can modify existing configurations on this page. If the Spark Jar, Spark SQL, or Impala service configurations are being enabled and the corresponding services are already enabled on the associated compute source, you cannot shut down those services.After editing, if you only modified the cluster's Basic Information and Cluster Security Control, you do not need to test the connection. You can save directly. If you made other changes, you must test the connection again. After the test is successful, click Save. In the dialog box that appears, enter a Change Description and click OK.
Clone
Click the
icon in the Actions column. The system clones all cluster data and opens the Create Hadoop Cluster page, where you can modify the configuration.Version History
Click the
icon in the Actions column and select History. The dialog box displays each version of the cluster, including the version name, modifier, and change description. You can perform View, Compare, and Rollback operations.-
View: Click the
icon in the Actions column to view the cluster version details. Users with the Hadoop Cluster-Manage permission can download the cluster configuration file. -
Compare: Click the
icon in the Actions column to open the version comparison page. You can select different versions from the filter drop-down list. By default, the current version is compared with the selected version. -
Rollback: Click the
icon in the Actions column of the target version, and in the pop-up dialog box, click OK.After you click OK, the system tests the connection for that version's cluster configuration. If the test succeeds, the rollback proceeds. If the rollback fails, a failure message with the reason is displayed. If the connection test fails, the rollback is terminated and the failed services are shown.
Delete
Note-
You can delete a Hadoop cluster only if it has no associated compute sources.
-
A deleted cluster cannot be restored.
Click the
icon in the Actions column of the target cluster, select Delete, and in the dialog box that appears, click OK. -