Limits
Hadoop cluster management is only supported with compute engines such as CDH5.x, CDH6.x, Cloudera Data Platform 7.x, E-MapReduce3.x, E-MapReduce5.x, AsiaInfo DP5.3, or Huawei FusionInsight 8.x.
Permission description
Super administrators, system administrators, and custom global roles with Hadoop cluster - Management permission can create and manage Hadoop clusters. They can also designate cluster administrators and set user permissions for referencing the cluster during Hadoop compute source creation.
Cluster administrators have management authority over their assigned clusters.
Users with the global role Compute source management - Create can reference Hadoop clusters they have access to when creating Hadoop compute sources.
Create a Hadoop cluster
On the Dataphin home page, navigate to Planning > Compute Source via the top menu bar.
On the Compute Source page, click Manage Hadoop Cluster.
In the Manage Hadoop Cluster dialog box, click + Create New Hadoop Cluster.
On the Create Hadoop Cluster page, configure the following parameters:
Basic Information
Parameter
Description
Cluster Name
Enter the name of the current cluster. Only Chinese, English, numbers, underscores (_), and hyphens (-) are supported, with a length not exceeding 128 characters.
Cluster Administrator
Select one or more members under the current tenant to become the cluster administrators of the current cluster. Cluster administrators can manage the current cluster, supporting operations such as editing, viewing historical versions, and deleting.
Description (Optional)
Enter a brief description of the current cluster, with a length not exceeding 128 characters.
Cluster Security Control
Usable Members: When setting up a compute source, define which users are authorized to access the cluster's configuration details. You can choose either Roles with "Create Compute Source" permission or Specified Users.
Roles with "Create Compute Source" Permission are selected by default.
Specified Users: Allows selection of individual personal accounts or user groups.
Cluster Configuration
Parameter
Description
Cluster Storage
Select HDFS or OSS-HDFS.
When selecting HDFS, you can add a NameNode in the NameNode configuration item.
When selecting OSS-HDFS, you also need to configure the cluster storage root directory, AccessKey ID, and AccessKey Secret.
NoteThis item can only be configured when the compute engine is E-MapReduce5.x. Other compute engines default to HDFS.
NameNode
Click +add to configure related parameters in the Add Namenode dialog box. Multiple NameNodes can be added.
NameNode refers to the HostName or IP and port of the NameNode node in the HDFS cluster. Configuration example:
NameNode: 193.168.xx.xx
Web UI Port: 50070
IPC Port: 8020
At least one of the Web UI Port or IPC Port must be selected. After configuration, the NameNode is
host=192.168.xx.xx,webUiPort=50070,ipcPort=8020.NoteThis item can be configured when the cluster storage is HDFS.
Cluster Storage Root Directory
You can view the basic information of the EMR cluster to obtain it. The format is
oss://<Bucket>.<Endpoint>/.NoteThis option is supported only when the compute engine is E-MapReduce 5.x and the cluster storage is set to OSS-HDFS.
AccessKey ID, AccessKey Secret
Enter the AccessKey ID and AccessKey Secret for OSS access.
NoteThe configuration entered here takes precedence over the AccessKey configured in core-site.xml.
This option is supported only when the compute engine is E-MapReduce 5.x and the cluster storage is selected as OSS-HDFS.
core-site.xml
Upload the core-site.xml, hdfs-site.xml, hive-site.xml, hivemetastore-site.xml, yarn-site.xml, and other configuration files for the current compute engine.
NoteOnly when the compute engine is E-MapReduce 5.x and the cluster storage is selected as OSS-HDFS, there is no need to upload hdfs-site.xml file.
When the Hive metadata retrieval method is set to HMS, you must upload the hive-site.xml file.
When the compute engine is E-MapReduce 5.x or Huawei FusionInsight 8.x, and the Hive metadata retrieval method is selected as HMS, you must upload the hivemetastore-site.xml file.
hdfs-site.xml
hive-site.xml (Optional)
hivemetastore-site.xml (Optional)
yarn-site.xml (Optional)
Other Configuration Files (Optional)
Task Execution Machine
Configure the connection address of the execution machine for MapReduce or Spark Jar. The format is
hostname:portorip:port, with the default port being 22 and not required.Execution Username, Password
Enter the username and password for logging into the task execution machine for MR task execution and HDFS read and write. Ensure task submission permissions are available.
Authentication Type
Supports No Authentication and Kerberos authentication methods.
Kerberos is an identity authentication protocol based on symmetric key technology that can provide identity authentication functions for other services and supports SSO (Single Sign-On), allowing access to multiple services such as HBase and HDFS after client authentication.
When the authentication method is selected as Kerberos, you also need to select the Kerberos Configuration Method.
Krb5 Authentication File: Upload the Krb5 file for Kerberos authentication.
KDC Server Address: KDC server address to assist in completing Kerberos authentication. Supports configuring multiple KDC Server service addresses, separated by semicolons (;).
NoteWhen the compute engine type is E-MapReduce 5.x, only the Krb5 File Configuration method is supported.
HDFS Information Configuration
Parameter
Description
Authentication Type
Supports No Authentication and Kerberos.
If the Hadoop cluster has Kerberos authentication, you need to enable HDFS Kerberos and upload the Keytab File authentication file and configure the Principal.
Keytab File: Upload the keytab file, which can be obtained from the HDFS Server.
Principal: Enter the Kerberos authentication username corresponding to the HDFS Keytab File.
NoteThis configuration is not required only when the compute engine is E-MapReduce 5.x and the cluster storage is selected as OSS-HDFS.
HDFS User (Optional)
Specify the username for file uploads. If not filled, the default is the execution username.
NoteThis item can only be configured when the authentication method is selected as No Authentication.
Hive Compute Engine Configuration
Parameter
Description
JDBC URL
Supports configuring the following three types of connection addresses:
Connection address of Hive Server, format:
jdbc:hive://{connection address}:{port}/{database name}.Connection address of ZooKeeper. For example:
jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2.Connection address with Kerberos enabled, format:
jdbc:hive2://{connection address}:{port}/{database name};principal=hive/_HOST@xx.com.
Authentication Type
Supports No Authentication, LDAP, and Kerberos authentication methods.
No Authentication: The no authentication method requires entering the username of the Hive service.
LDAP: The LDAP authentication method requires entering the username and password of the Hive service.
Kerberos: If the Hadoop cluster has Kerberos authentication, you need to enable Hive Kerberos and upload the Keytab File authentication file and configure the Principal.
Keytab File: Upload the keytab file, which can be obtained from the Hive Server.
Principal: Enter the Kerberos authentication username corresponding to the Hive Keytab File.
Username
Enter the username of the Hive service.
NoteThis item can only be configured when the authentication method is selected as No Authentication or LDAP. The entered user must ensure task execution permissions to ensure normal task execution.
Password
Enter the password of the Hive service user.
NoteThis item can only be configured when the authentication method is selected as LDAP.
Execution Engine
Default: Default: Tasks (including logical table tasks) under the project bound to this compute source use this execution engine by default.
Custom: Select other types of compute engines.
Hive Metadata Configuration
Metadata Retrieval Method: This supports Metadata Database, HMS, and DLF as the three methods for sourcing metadata. Each method requires specific configuration details.
ImportantYou can select this item only if the compute engine is E-MapReduce5.x, which allows for the selection of the DLF retrieval method.
When retrieving metadata using the DLF method, first upload the hive-site.xml configuration file.
Metadata Retrieval Method
Parameter
Description
Metadata Database
Database Type
Select the database according to the type of metadatabase used in the cluster. Dataphin supports selecting MySQL.
Supported versions of MySQL include MySQL 5.1.43, MySQL 5.6/5.7, and MySQL 8.
JDBC URL
Enter the JDBC connection address of the target database. For example:
MySQL: Format:
jdbc:mysql://{connection address}[,failoverhost...]{port}/{database name} [?propertyName1][=propertyValue1][&propertyName2][=propertyValue2]....Username, Password
Enter the username and password for logging into the metadatabase.
HMS
Authentication Type
The HMS retrieval method supports No Authentication, LDAP, and Kerberos authentication methods. The Kerberos authentication method requires uploading the Keytab File file and configuring the Principal.
DLF
Endpoint
Enter the Endpoint of the cluster in the region of the DLF data center. For more information, see DLF Region and Endpoint Comparison Table.
AccessKey ID, AccessKey Secret
Enter the AccessKey ID and AccessKey Secret of the account where the cluster is located.
You can obtain the account's AccessKey ID and AccessKey Secret on the User Information Management page.
Spark Jar Service Configuration
ImportantFor performance reasons, changes to the Spark execution machine or local client configuration will not trigger connectivity tests. Please run a test program in the development module after changes to verify Spark service availability.
Parameter
Description
Spark Execution Machine
If the Hadoop cluster has Spark deployed, it supports enabling Spark SQL Tasks.
NoteWhen using the Spark client on the task execution machine in the Cluster Configuration, you need to deploy and complete related settings (such as setting environment variables, creating execution users, authorizing execution users, etc.) on the execution machine in advance. However, only one execution machine is supported, and high availability and load balancing cannot be achieved. After the Spark Job is submitted, Dataphin cannot view logs or terminate it.
Execution Username, Password
Enter the username and password for logging into the compute execution machine.
NoteEnsure that the user has Spark-submit permissions.
Spark Local Client
Supports enabling or disabling the Spark local client. Once enabled, if there are tasks referencing the Spark local client, disabling is not supported. To run Spark programs using the Spark local client, you must upload the yarn-site.xml configuration file and ensure that the connection between Dataphin and yarn is normal.
Click +add Client and enter the Client Name and upload the Client File in the Add Client dialog box.
Client Name: Only letters, numbers, underscores (_), hyphens (-), and half-width periods (.) are supported, with a length not exceeding 32 characters.
The client name is unique within the same Hadoop cluster (case-sensitive).
Client File: Upload the client file. The file format only supports .tgz and .zip.
NoteYou can go to https://spark.apache.org/downloads.html to download the corresponding version of the Spark client. The self-owned client must have the same directory structure as the Community Edition, include the Hadoop client, and upload the complete compressed package (format: .tgz or .zip). Dataphin uses the uploaded client to submit jobs through the scheduling cluster, enabling full lifecycle management of jobs.
After the client is uploaded, you can click the
icon in the client list to edit the corresponding client. If a new client file is uploaded, the new client file will overwrite the existing file. Click the
icon to delete the corresponding client.NoteIf the uploaded client is referenced by tasks (including draft tasks), editing the client name and deleting the client are not supported.
Authentication Type
Supports No Authentication or Kerberos authentication methods.
If the Hadoop cluster has Kerberos authentication, you need to enable Spark Kerberos and upload the Keytab File authentication file and configure the Principal.
Keytab File: Upload the keytab file, which can be obtained from the Spark Server.
Principal: Enter the Kerberos authentication username corresponding to the Spark Keytab File.
Spark SQL Service Configuration
Parameter
Description
Spark SQL Tasks
If the Hadoop cluster has Spark deployed, it supports enabling Spark SQL Tasks.
Spark Version
Currently, only 3.x is supported.
Service Type
Select the target server type for Spark JDBC access.
JDBC URL
Enter the JDBC connection method, for example:
jdbc:hive2://host1:port1/orjdbc:kyuubi://host1:port1/, no need to enter the database name.Authentication Type
Supports No Authentication, LDAP, and Kerberos authentication methods.
No Authentication: The no authentication method requires entering the username of the Spark service.
LDAP: The LDAP authentication method requires entering the username and password of the Spark service.
Kerberos: If the Hadoop cluster has Kerberos authentication, you need to enable Spark Kerberos and upload the Keytab File authentication file and configure the Principal.
Keytab File: Upload the keytab file, which can be obtained from the Spark Server.
Principal: Enter the Kerberos authentication username corresponding to the Spark Keytab File.
NoteThe user entered for No Authentication and LDAP methods must ensure task execution permissions to ensure normal task execution.
Username
Enter the username of the Spark service.
NoteThis item can only be configured when the authentication method is selected as No Authentication or LDAP. The entered user must ensure task execution permissions to ensure normal task execution.
Password
Enter the password of the Spark service user.
NoteThis item can only be configured when the authentication method is selected as LDAP.
SQL Task Queue Settings
Different service types use different SQL task queues. Details are as follows:
Spark Thrift Server: Does not support setting task queues.
Kyuubi: Uses the priority queue settings of the HDFS connection information. It only takes effect when Kyuubi uses Yarn as resource scheduling. Production tasks use the Connection sharing level.
Livy: Uses the priority queue settings of the HDFS connection information. It only takes effect when Livy uses Yarn as resource scheduling. Ad hoc queries and production tasks use a new Connection for execution.
MapReduce (MRS): Uses the priority queue settings of the HDFS connection information.
Impala Task Configuration
Parameter
Description
Impala Tasks
If the Hadoop cluster has Impala deployed, it supports enabling Impala tasks.
JDBC URL
Enter the JDBC connection method for Impala, for example:
jdbc:impala://host:port/, no need to enter the schema.Authentication Type
Supports No Authentication, LDAP, and Kerberos authentication methods.
No Authentication: The no authentication method requires entering the Impala username.
LDAP: The LDAP authentication method requires entering the username and password of Impala.
Kerberos: The Kerberos authentication method requires uploading the Keytab File authentication file and configuring the Principal.
Username
Enter the Impala username.
NoteThis item can only be configured when the authentication method is selected as No Authentication or LDAP. The entered user must ensure task execution permissions to ensure normal task execution.
Password
Enter the password of the Impala user.
NoteThis item can only be configured when the authentication method is selected as LDAP.
Development Task Request Pool
Enter the name of the Impala request pool (request pool) used for development tasks.
Periodic Task Request Pool
Enter the name of the Impala request pool (request pool) used for periodic tasks.
Click Test Connection to initiate automatic service connection tests.
If the connection test is successful, you can save the settings. A failed test will prompt a Test Connection Failed dialog box, displaying the services that failed and their error details.
Once the connection test is successful, click Save to finalize the Hadoop cluster creation.
Manage Hadoop clusters
On the Dataphin home page, navigate to Planning > Compute Source via the top menu bar.
On the Compute Source page, click Manage Hadoop Clusters.
In the Manage Hadoop Clusters dialog box, you can view the list of Hadoop clusters, which includes details such as cluster name, cluster administrator, associated compute source, and creation and modification information.
Associated Compute Source: Shows the total number of associated compute sources. Click the
icon to view the list and navigate to the compute source page by clicking on a source name.Creation Information: Documents the user and time of creation.
Modification Information: Documents the username and time of the last edit for the cluster.
NoteCompute tasks can only run within a single cluster. Data cannot be joined across different Hadoop clusters.
(Optional) Use the search box to perform a fuzzy search by entering the cluster name.
In the Actions column of the cluster list, you can manage the selected cluster with the available operations.
Operation
Description
View
Click the Actions column of the target
icon to view detailed information about the current version of the cluster. Users with Hadoop Cluster - Management permission can download the cluster configuration file.Edit
Click the Actions column of the target cluster's
icon to open the edit Hadoop cluster page. You can modify the existing configuration on the edit Hadoop cluster page. If the service configuration for Spark Jar tasks, Spark SQL tasks, or Impala tasks is enabled and the associated computing source of the cluster has already enabled the corresponding service, you cannot shut down the related service.After editing, if you only modify the cluster's basic information and cluster security control information, testing the connection is not required. You can directly save. If there are other modifications, you still need to test the connection. After a successful connection test, click Save. In the pop-up dialog box, fill in the Change Description and click OK.
Clone
Click the Actions column of the target cluster and the
icon. The system automatically clones all data of the current cluster and opens the Create Hadoop Cluster page, where you can modify the configuration based on the existing settings.History
Click the Actions icon in the target cluster's
column, and select History. The dialog box displays the version information of the current cluster, including the version name, modifier, and change description. You can View, Compare, and Roll Back historical versions.View: You can click the Actions column of the target version
icon to navigate to the view Hadoop cluster page and view detailed information about the current version of the cluster. Users with Hadoop Cluster - Management permission can download the cluster configuration file.Comparison: Click the Actions column
icon for the target version to navigate to the version comparison page. On the comparison page, you can select different versions from the filter dropdown list. By default, the current version's Hadoop cluster is compared with the target version.Rollback: Click the Actions column of the target version
icon. In the dialog box that appears, click OK.After clicking OK, the system will automatically test the connection for the cluster information of that version. If the test passes, the rollback proceeds normally. If the rollback fails, the system will pop up a rollback failure prompt, where you can view the specific reason for the failure. If the connection test fails, the rollback ends, and you can view the services that failed the connection test in the pop-up dialog box.
Delete
NoteYou can delete the current cluster only when there are no associated compute sources under the current Hadoop cluster.
Once the cluster is deleted, it cannot be restored.
To delete the target cluster, click the Actions column, then select the
icon. Choose Delete and, in the confirmation dialog box that appears, click Confirm.