All Products
Search
Document Center

Dataphin:Create and manage Hadoop clusters

Last Updated:Jan 19, 2026

Limits

Hadoop cluster management is available only when the compute engine is CDH5.x, CDH6.x, Cloudera Data Platform 7.x, E-MapReduce3.x, E-MapReduce5.x, AsiaInfo DP5.3, or Huawei FusionInsight 8.x.

Permissions

  • Super administrators, system administrators, and users with the Hadoop Cluster - Manage global role can create and manage Hadoop clusters. These users can also specify which users can reference the cluster when creating a Hadoop compute source and assign cluster administrators.

  • Cluster administrators can manage the clusters they are responsible for.

  • Users with the Compute Source Management - Create global role can reference Hadoop clusters they have permission to use when creating a Hadoop compute source.

Create a Hadoop cluster

  1. In the top menu bar of the Dataphin home page, choose Planning > Compute Source.

  2. On the Compute Source page, click Manage Hadoop Clusters.

  3. In the Manage Hadoop Clusters dialog box, click + Create Hadoop Cluster.

  4. On the Create Hadoop Cluster page, configure the following parameters.

    • Basic information

      Parameter

      Description

      Cluster Name

      Enter a name for the cluster. The name can contain only letters, digits, underscores (_), and hyphens (-). It can be up to 128 characters long.

      Cluster Administrator

      Select one or more members under the current tenant to be the cluster administrators for this cluster. Cluster administrators can manage the current cluster, including editing, viewing historical versions, and deleting it.

      Description (Optional)

      Enter a brief description of the cluster. The description can be up to 128 characters long.

    • Cluster security control

      Usable by: Specify which users can reference the cluster's configuration when creating a compute source. You can select Roles with "Create Compute Source" permission or Specified users.

      • Roles with "Create Compute Source" permission : Selected by default.

      • Specified users : You can select one or more personal accounts and user groups.

    • Cluster configuration

      Parameter

      Description

      Cluster Storage

      Select HDFS or OSS-HDFS.

      • If you select HDFS, you can add a NameNode in the NameNode configuration item.

      • If you select OSS-HDFS, you must also configure the cluster storage root directory, AccessKey ID, and AccessKey secret.

      Note

      This parameter is configurable only when the compute engine is E-MapReduce 5.x. For all other compute engines, the default value is HDFS.

      NameNode

      Click + Add to open the Add NameNode dialog box and configure the parameters. You can add multiple NameNodes.

      A NameNode is the hostname or IP address and port of a NameNode in the HDFS cluster. Configuration example:

      • NameNode: 193.168.xx.xx

      • Web UI Port: 50070

      • IPC Port: 8020

      Select at least one port: Web UI Port or IPC Port. After configuration, the NameNode is host=192.168.xx.xx,webUiPort=50070,ipcPort=8020.

      Note

      This parameter is configurable when Cluster Storage is set to HDFS.

      Cluster Storage Root Directory

      You can obtain this from the EMR cluster's basic information. The format is oss://<Bucket>.<Endpoint>/.

      Note

      This parameter is configurable only when the compute engine is E-MapReduce 5.x and Cluster Storage is set to OSS-HDFS.

      AccessKey ID, AccessKey Secret

      Enter the AccessKey ID and AccessKey secret for OSS access.

      Note
      • The configuration here has a higher priority than the AccessKey configured in core-site.xml.

      • This parameter is configurable only when the compute engine is E-MapReduce 5.x and Cluster Storage is set to OSS-HDFS.

      core-site.xml

      Upload the core-site.xml, hdfs-site.xml, hive-site.xml, hivemetastore-site.xml, yarn-site.xml, and other configuration files for the current compute engine.

      Note
      • When the compute engine is E-MapReduce 5.x and Cluster Storage is set to OSS-HDFS, you do not need to upload the hdfs-site.xml file.

      • When Hive Metadata Retrieval Method is set to HMS, you must upload the hive-site.xml file.

      • When the compute engine is E-MapReduce 5.x or Huawei FusionInsight 8.x and Hive Metadata Retrieval Method is set to HMS, you must upload the hivemetastore-site.xml file.

      hdfs-site.xml

      hive-site.xml (Optional)

      hivemetastore-site.xml (Optional)

      yarn-site.xml (Optional)

      Other Configuration Files (Optional)

      Task Execution Machine

      Configure the connection address of the machine where the MapReduce or Spark JAR is executed. The format is hostname:port or ip:port. The port is optional and defaults to 22.

      Execution Username, Password

      The username and password to log on to the task execution machine for operations such as MapReduce task execution and HDFS read/write. Ensure this user has task submission permissions.

      Authentication Type

      Supports No Authentication and Kerberos authentication.

      Kerberos is an identity authentication protocol based on symmetric key technology. It can provide identity authentication for other services and supports single sign-on (SSO), which means a client can access multiple services, such as HBase and HDFS, after being authenticated.

      When Authentication Type is set to Kerberos, you must also select a Kerberos Configuration Method.

      • Krb5 Authentication File: Upload a krb5 file for Kerberos authentication.

      • KDC Server Address: The address of the KDC server, which assists with Kerberos authentication. You can configure multiple KDC server addresses, separated by semicolons (;).

      Note

      When the compute engine is E-MapReduce 5.x, only the krb5 file configuration method is supported.

    • HDFS information configuration

      Parameter

      Description

      Authentication Type

      Supports No Authentication and Kerberos.

      If the Hadoop cluster uses Kerberos authentication, you must enable HDFS Kerberos, upload a Keytab File, and configure a Principal.

      • Keytab File: Upload the keytab file. You can get the keytab file from the HDFS server.

      • Principal: Enter the Kerberos authentication username corresponding to the HDFS Keytab File.

      Note

      This parameter is not required when the compute engine is E-MapReduce 5.x and Cluster Storage is set to OSS-HDFS.

      HDFS User (Optional)

      Specify the username for file uploads. If left blank, the execution username is used by default.

      Note

      This parameter is configurable only when Authentication Type is set to No Authentication.

    • Hive compute engine configuration

      Parameter

      Description

      JDBC URL

      You can configure one of the following three types of connection addresses:

      • The connection address of the Hive server, in the format jdbc:hive://{connection_address}:{port}/{database_name}.

      • The connection address of ZooKeeper. For example, jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2.

      • The connection address with Kerberos enabled, in the format jdbc:hive2://{connection_address}:{port}/{database_name};principal=hive/_HOST@xx.com.

      Authentication Type

      Supports No Authentication, LDAP, and Kerberos authentication.

      • No Authentication: Requires a username for the Hive service.

      • LDAP: Requires a username and password for the Hive service.

      • Kerberos: If the Hadoop cluster uses Kerberos authentication, you must enable Hive Kerberos, upload a Keytab File, and configure a Principal.

        • Keytab File: Upload the keytab file. You can get the keytab file from the Hive server.

        • Principal: Enter the Kerberos authentication username corresponding to the Hive Keytab File.

      Username

      Enter the username for the Hive service.

      Note

      This parameter is configurable when Authentication Type is set to No Authentication or LDAP. The specified user must have task execution permissions to ensure that tasks run correctly.

      Password

      Enter the password for the Hive service user.

      Note

      This parameter is configurable only when Authentication Type is set to LDAP.

      Execution Engine

      Default: Tasks under projects that are attached to this compute source, including logical table tasks, use this execution engine by default.

      Custom: Select another compute engine type.

    • Hive metadata configuration

      Metadata Retrieval Method: You can select one of the following methods: Metadata Database, HMS, or DLF. The required configuration varies based on the selected method.

      Important
      • The DLF method is available only when the compute engine is E-MapReduce 5.x.

      • If you use the DLF method to retrieve metadata, you must first upload the hive-site.xml configuration file.

      Metadata Retrieval Method

      Parameter

      Description

      Metadata Database

      Database Type

      Select the database based on the metadatabase type used in the cluster. Dataphin supports MySQL.

      Supported MySQL versions include MySQL 5.1.43, MySQL 5.6/5.7, and MySQL 8.

      JDBC URL

      Enter the JDBC connection address of the destination database. For example:

      MySQL: The format is jdbc:mysql://{connection_address}[,failoverhost...]{port}/{database_name} [?propertyName1][=propertyValue1][&propertyName2][=propertyValue2]....

      Username, Password

      Enter the username and password to log on to the metadatabase.

      HMS

      Authentication Type

      The HMS retrieval method supports No Authentication, LDAP, and Kerberos. For Kerberos authentication, you must upload a Keytab File and configure a Principal.

      DLF

      Endpoint

      Enter the DLF endpoint for the region where your cluster is located. To obtain the endpoint, see DLF Regions and Endpoints.

      AccessKey ID, AccessKey Secret

      Enter the AccessKey ID and AccessKey secret of the account where the cluster is located.

      You can obtain the AccessKey ID and AccessKey secret on the User Information Management page.

    • Spark JAR service configuration

      Important

      For performance reasons, connectivity and validity tests are not performed when you modify the Spark execution machine or local client configuration. After you make changes, go to the developer module and run a test program to check whether the Spark service is available.

      Parameter

      Description

      Spark Execution Machine

      If Spark is deployed on the Hadoop cluster, you can enable Spark SQL tasks.

      Note

      When using the Spark client on the task execution machine specified in the Cluster configuration, you must first deploy and configure the client on the execution machine (for example, set environment variables, create an execution user, and grant permissions to the user). However, only one execution machine is supported, which means high availability and load balancing cannot be achieved. After a Spark job is submitted, you cannot view its log or stop it from Dataphin.

      Execution Username, Password

      Enter the username and password to log on to the compute execution machine.

      Note

      Confirm that the user has Spark-submit permissions.

      Spark Local Client

      You can enable or disable the Spark local client. If it is enabled and referenced by a task, you cannot disable it. To run Spark programs using the Spark local client, you must upload the yarn-site.xml configuration file and ensure that the port connection between Dataphin and YARN is normal.

      Click + Add Client to open the Add Client dialog box. Enter a Client Name and upload a Client File.

      • Client Name: Can contain only letters, digits, underscores (_), hyphens (-), and periods (.). It can be up to 32 characters long.

        The client name must be unique (case-sensitive) within the same Hadoop cluster.

      • Client File: Upload the client file. Only .tgz and .zip formats are supported.

        Note

        You can download the corresponding version of the Spark client from https://spark.apache.org/downloads.html. A custom client must have the same directory structure as the community edition, include the Hadoop client, and be uploaded as a complete compressed package (.tgz or .zip). Dataphin uses the uploaded client to submit jobs through the scheduling cluster, which enables full lifecycle management of jobs.

      After the client is uploaded, you can click the image icon in the client list to edit the corresponding client. If you upload a new client file, it will overwrite the existing one. Click the image icon to delete the corresponding client.

      Note

      If an uploaded client is referenced by a task (including tasks in draft status), you cannot edit the client name or delete the client.

      Authentication Type

      Supports No Authentication or Kerberos authentication.

      If the Hadoop cluster uses Kerberos authentication, you must enable Spark Kerberos, upload a Keytab File, and configure a Principal.

      • Keytab File: Upload the keytab file. You can get the keytab file from the Spark server.

      • Principal: Enter the Kerberos authentication username corresponding to the Spark Keytab File.

    • Spark SQL service configuration

      Parameter

      Description

      Spark SQL Task

      If Spark is deployed on the Hadoop cluster, you can enable Spark SQL tasks.

      Spark Version

      Currently, only 3.x is supported.

      Service Type

      Select the server type for Spark JDBC access.

      JDBC URL

      Enter the JDBC connection method, for example, jdbc:hive2://host1:port1/ or jdbc:kyuubi://host1:port1/. You do not need to enter a database name.

      Authentication Type

      Supports No Authentication, LDAP, and Kerberos authentication.

      • No Authentication: Requires a username for the Spark service.

      • LDAP: Requires a username and password for the Spark service.

      • Kerberos: If the Hadoop cluster uses Kerberos authentication, you must enable Spark Kerberos, upload a Keytab File, and configure a Principal.

        • Keytab File: Upload the keytab file. You can get the keytab file from the Spark server.

        • Principal: Enter the Kerberos authentication username corresponding to the Spark Keytab File.

      Note

      For No Authentication and LDAP methods, the specified user must have task execution permissions to ensure that tasks run correctly.

      Username

      Enter the username for the Spark service.

      Note

      This parameter is configurable when Authentication Type is set to No Authentication or LDAP. The specified user must have task execution permissions to ensure that tasks run correctly.

      Password

      Enter the password for the Spark service user.

      Note

      This parameter is configurable only when Authentication Type is set to LDAP.

      SQL Task Queue Settings

      Different service types use different SQL task queues. Details are as follows:

      • Spark Thrift Server: Task queues cannot be set.

      • Kyuubi: Uses the priority queue settings from the HDFS connection information. This takes effect only when Kyuubi uses YARN for resource scheduling. Production tasks use the Connection shared level.

      • Livy: Uses the priority queue settings from the HDFS connection information. This takes effect only when Livy uses YARN for resource scheduling. Both ad hoc queries and production tasks are executed using a new connection.

      • MapReduce (MRS): Uses the priority queue settings from the HDFS connection information.

    • Impala task configuration

      Parameter

      Description

      Impala Task

      If Impala is deployed on the Hadoop cluster, you can enable Impala tasks.

      JDBC URL

      Enter the JDBC connection method for Impala, for example, jdbc:impala://host:port/. You do not need to enter a schema.

      Authentication Type

      Supports No Authentication, LDAP, and Kerberos authentication.

      • No Authentication: Requires an Impala username.

      • LDAP: Requires an Impala username and password.

      • Kerberos: Requires uploading a Keytab File and configuring a Principal.

      Username

      Enter the Impala username.

      Note

      This parameter is configurable when Authentication Type is set to No Authentication or LDAP. The specified user must have task execution permissions to ensure that tasks run correctly.

      Password

      Enter the Impala user's password.

      Note

      This parameter is configurable only when Authentication Type is set to LDAP.

      Developer Task Request Pool

      Enter the name of the Impala request pool for developer tasks.

      Auto Triggered Task Request Pool

      Enter the name of the Impala request pool for auto triggered tasks.

  5. Click Test Connection. The system automatically tests the connection to each service.

    If the connection test is successful, you can save the configuration. If the test fails, the Connection Test Failed dialog box appears. This dialog box shows the services that failed the test and their error details.

  6. After the connection test is successful, click Save to create the Hadoop cluster.

Manage Hadoop clusters

  1. In the top menu bar of the Dataphin home page, choose Planning > Compute Source.

  2. On the Compute Source page, click Manage Hadoop Clusters.

  3. In the Manage Hadoop Clusters dialog box, view the list of Hadoop clusters. The list shows information such as Cluster Name, Cluster Administrator, Associated Compute Sources, Creation Information, and Modification Information.

    • Associated Compute Sources: Displays the total number of associated compute sources. Click the image icon to view the list of associated compute sources. Click a compute source name to go to its details page.

    • Creation Information: Displays the user who created the cluster and the creation time.

    • Modification Information: Displays the user who last edited the cluster and the modification time.

    Note

    Compute tasks can run on only one cluster. Therefore, you cannot join data from different Hadoop clusters.

  4. (Optional) You can enter a cluster name in the search box to perform a fuzzy search.

  5. In the Actions column of the Hadoop cluster list, you can perform management operations on a cluster. The supported operations are as follows.

    Operation

    Description

    View

    Click the image icon in the Actions column of the target cluster to view details about the current version of the cluster. Users with the Manage Hadoop Cluster permission can download the cluster configuration file.

    Edit

    Click the image icon in the Actions column of the target cluster to open the Edit Hadoop Cluster page. You can modify existing configurations on the Edit Hadoop Cluster page. If the service configurations for Spark Jar, Spark SQL, or Impala tasks are in the enabling status and the corresponding services have been enabled on the compute source associated with the cluster, you cannot shut down the related services.

    After editing, if you only modified the cluster's Basic Information and Cluster Security Control, you do not need to test the connection. You can save directly. If you made other changes, you must test the connection again. After the test is successful, click Save. In the dialog box that appears, enter a Change Description and click OK.

    Clone

    Click the image icon in the Actions column of the target cluster. The system automatically clones all data of the cluster and opens the Create Hadoop Cluster page. You can then modify the existing configuration.

    Version History

    Click the image icon in the Actions column of the target cluster and select History. A dialog box opens and displays information about each version of the current cluster, such as the version name, modifier, and change description. In this dialog box, you can perform View, Compare, and Rollback operations.

    • View: Click the image icon in the Actions column for the target version. You are redirected to the View Hadoop Cluster page to view the details of the current cluster version. Users with the Hadoop Cluster-Manage permission can download the cluster configuration file.

    • Compare: Click the image icon in the Actions column of the target version to go to the version comparison page. On this page, you can select different versions from the filter drop-down list. By default, the current version of the Hadoop cluster is compared with the target version.

    • Rollback: Click the image icon in the Actions column of the target version, and in the pop-up dialog box, click OK.

      After you click OK, the system automatically tests the connection for the cluster information of that version. If the test is successful, the rollback proceeds. If the rollback fails, a message appears indicating the failure and the reason. If the connection test fails, the rollback is terminated, and a dialog box appears showing the services that failed the connection test.

    Delete

    Note
    • You can delete a Hadoop cluster only if it has no associated compute sources.

    • A deleted cluster cannot be restored.

    Click the image icon in the Actions column of the target cluster, select Delete, and in the dialog box that appears, click OK.