All Products
Search
Document Center

Dataphin:Create and manage Hadoop clusters

Last Updated:Jan 21, 2025

Limits

Hadoop cluster management is only supported with compute engines such as CDH5.x, CDH6.x, Cloudera Data Platform 7.x, E-MapReduce3.x, E-MapReduce5.x, AsiaInfo DP5.3, or Huawei FusionInsight 8.x.

Permission description

  • Super administrators, system administrators, and custom global roles with Hadoop cluster - Management permission can create and manage Hadoop clusters. They can also designate cluster administrators and set user permissions for referencing the cluster during Hadoop compute source creation.

  • Cluster administrators have management authority over their assigned clusters.

  • Users with the global role Compute source management - Create can reference Hadoop clusters they have access to when creating Hadoop compute sources.

Create a Hadoop cluster

  1. On the Dataphin home page, navigate to Planning > Compute Source via the top menu bar.

  2. On the Compute Source page, click Manage Hadoop Cluster.

  3. In the Manage Hadoop Cluster dialog box, click + Create New Hadoop Cluster.

  4. On the Create Hadoop Cluster page, configure the following parameters:

    • Basic Information

      Parameter

      Description

      Cluster Name

      Enter the name of the current cluster. Only Chinese, English, numbers, underscores (_), and hyphens (-) are supported, with a length not exceeding 128 characters.

      Cluster Administrator

      Select one or more members under the current tenant to become the cluster administrators of the current cluster. Cluster administrators can manage the current cluster, supporting operations such as editing, viewing historical versions, and deleting.

      Description (Optional)

      Enter a brief description of the current cluster, with a length not exceeding 128 characters.

    • Cluster Security Control

      Usable Members: When setting up a compute source, define which users are authorized to access the cluster's configuration details. You can choose either Roles with "Create Compute Source" permission or Specified Users.

      • Roles with "Create Compute Source" Permission are selected by default.

      • Specified Users: Allows selection of individual personal accounts or user groups.

    • Cluster Configuration

      Parameter

      Description

      Cluster Storage

      Select HDFS or OSS-HDFS.

      • When selecting HDFS, you can add a NameNode in the NameNode configuration item.

      • When selecting OSS-HDFS, you also need to configure the cluster storage root directory, AccessKey ID, and AccessKey Secret.

      Note

      This item can only be configured when the compute engine is E-MapReduce5.x. Other compute engines default to HDFS.

      NameNode

      Click +add to configure related parameters in the Add Namenode dialog box. Multiple NameNodes can be added.

      NameNode refers to the HostName or IP and port of the NameNode node in the HDFS cluster. Configuration example:

      • NameNode: 193.168.xx.xx

      • Web UI Port: 50070

      • IPC Port: 8020

      At least one of the Web UI Port or IPC Port must be selected. After configuration, the NameNode is host=192.168.xx.xx,webUiPort=50070,ipcPort=8020.

      Note

      This item can be configured when the cluster storage is HDFS.

      Cluster Storage Root Directory

      You can view the basic information of the EMR cluster to obtain it. The format is oss://<Bucket>.<Endpoint>/.

      Note

      This option is supported only when the compute engine is E-MapReduce 5.x and the cluster storage is set to OSS-HDFS.

      AccessKey ID, AccessKey Secret

      Enter the AccessKey ID and AccessKey Secret for OSS access.

      Note
      • The configuration entered here takes precedence over the AccessKey configured in core-site.xml.

      • This option is supported only when the compute engine is E-MapReduce 5.x and the cluster storage is selected as OSS-HDFS.

      core-site.xml

      Upload the core-site.xml, hdfs-site.xml, hive-site.xml, hivemetastore-site.xml, yarn-site.xml, and other configuration files for the current compute engine.

      Note
      • Only when the compute engine is E-MapReduce 5.x and the cluster storage is selected as OSS-HDFS, there is no need to upload hdfs-site.xml file.

      • When the Hive metadata retrieval method is set to HMS, you must upload the hive-site.xml file.

      • When the compute engine is E-MapReduce 5.x or Huawei FusionInsight 8.x, and the Hive metadata retrieval method is selected as HMS, you must upload the hivemetastore-site.xml file.

      hdfs-site.xml

      hive-site.xml (Optional)

      hivemetastore-site.xml (Optional)

      yarn-site.xml (Optional)

      Other Configuration Files (Optional)

      Task Execution Machine

      Configure the connection address of the execution machine for MapReduce or Spark Jar. The format is hostname:port or ip:port, with the default port being 22 and not required.

      Execution Username, Password

      Enter the username and password for logging into the task execution machine for MR task execution and HDFS read and write. Ensure task submission permissions are available.

      Authentication Type

      Supports No Authentication and Kerberos authentication methods.

      Kerberos is an identity authentication protocol based on symmetric key technology that can provide identity authentication functions for other services and supports SSO (Single Sign-On), allowing access to multiple services such as HBase and HDFS after client authentication.

      When the authentication method is selected as Kerberos, you also need to select the Kerberos Configuration Method.

      • Krb5 Authentication File: Upload the Krb5 file for Kerberos authentication.

      • KDC Server Address: KDC server address to assist in completing Kerberos authentication. Supports configuring multiple KDC Server service addresses, separated by semicolons (;).

      Note

      When the compute engine type is E-MapReduce 5.x, only the Krb5 File Configuration method is supported.

    • HDFS Information Configuration

      Parameter

      Description

      Authentication Type

      Supports No Authentication and Kerberos.

      If the Hadoop cluster has Kerberos authentication, you need to enable HDFS Kerberos and upload the Keytab File authentication file and configure the Principal.

      • Keytab File: Upload the keytab file, which can be obtained from the HDFS Server.

      • Principal: Enter the Kerberos authentication username corresponding to the HDFS Keytab File.

      Note

      This configuration is not required only when the compute engine is E-MapReduce 5.x and the cluster storage is selected as OSS-HDFS.

      HDFS User (Optional)

      Specify the username for file uploads. If not filled, the default is the execution username.

      Note

      This item can only be configured when the authentication method is selected as No Authentication.

    • Hive Compute Engine Configuration

      Parameter

      Description

      JDBC URL

      Supports configuring the following three types of connection addresses:

      • Connection address of Hive Server, format: jdbc:hive://{connection address}:{port}/{database name}.

      • Connection address of ZooKeeper. For example: jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2.

      • Connection address with Kerberos enabled, format: jdbc:hive2://{connection address}:{port}/{database name};principal=hive/_HOST@xx.com.

      Authentication Type

      Supports No Authentication, LDAP, and Kerberos authentication methods.

      • No Authentication: The no authentication method requires entering the username of the Hive service.

      • LDAP: The LDAP authentication method requires entering the username and password of the Hive service.

      • Kerberos: If the Hadoop cluster has Kerberos authentication, you need to enable Hive Kerberos and upload the Keytab File authentication file and configure the Principal.

        • Keytab File: Upload the keytab file, which can be obtained from the Hive Server.

        • Principal: Enter the Kerberos authentication username corresponding to the Hive Keytab File.

      Username

      Enter the username of the Hive service.

      Note

      This item can only be configured when the authentication method is selected as No Authentication or LDAP. The entered user must ensure task execution permissions to ensure normal task execution.

      Password

      Enter the password of the Hive service user.

      Note

      This item can only be configured when the authentication method is selected as LDAP.

      Execution Engine

      Default: Default: Tasks (including logical table tasks) under the project bound to this compute source use this execution engine by default.

      Custom: Select other types of compute engines.

    • Hive Metadata Configuration

      Metadata Retrieval Method: This supports Metadata Database, HMS, and DLF as the three methods for sourcing metadata. Each method requires specific configuration details.

      Important
      • You can select this item only if the compute engine is E-MapReduce5.x, which allows for the selection of the DLF retrieval method.

      • When retrieving metadata using the DLF method, first upload the hive-site.xml configuration file.

      Metadata Retrieval Method

      Parameter

      Description

      Metadata Database

      Database Type

      Select the database according to the type of metadatabase used in the cluster. Dataphin supports selecting MySQL.

      Supported versions of MySQL include MySQL 5.1.43, MySQL 5.6/5.7, and MySQL 8.

      JDBC URL

      Enter the JDBC connection address of the target database. For example:

      MySQL: Format: jdbc:mysql://{connection address}[,failoverhost...]{port}/{database name} [?propertyName1][=propertyValue1][&propertyName2][=propertyValue2]....

      Username, Password

      Enter the username and password for logging into the metadatabase.

      HMS

      Authentication Type

      The HMS retrieval method supports No Authentication, LDAP, and Kerberos authentication methods. The Kerberos authentication method requires uploading the Keytab File file and configuring the Principal.

      DLF

      Endpoint

      Enter the Endpoint of the cluster in the region of the DLF data center. For more information, see DLF Region and Endpoint Comparison Table.

      AccessKey ID, AccessKey Secret

      Enter the AccessKey ID and AccessKey Secret of the account where the cluster is located.

      You can obtain the account's AccessKey ID and AccessKey Secret on the User Information Management page.

    • Spark Jar Service Configuration

      Important

      For performance reasons, changes to the Spark execution machine or local client configuration will not trigger connectivity tests. Please run a test program in the development module after changes to verify Spark service availability.

      Parameter

      Description

      Spark Execution Machine

      If the Hadoop cluster has Spark deployed, it supports enabling Spark SQL Tasks.

      Note

      When using the Spark client on the task execution machine in the Cluster Configuration, you need to deploy and complete related settings (such as setting environment variables, creating execution users, authorizing execution users, etc.) on the execution machine in advance. However, only one execution machine is supported, and high availability and load balancing cannot be achieved. After the Spark Job is submitted, Dataphin cannot view logs or terminate it.

      Execution Username, Password

      Enter the username and password for logging into the compute execution machine.

      Note

      Ensure that the user has Spark-submit permissions.

      Spark Local Client

      Supports enabling or disabling the Spark local client. Once enabled, if there are tasks referencing the Spark local client, disabling is not supported. To run Spark programs using the Spark local client, you must upload the yarn-site.xml configuration file and ensure that the connection between Dataphin and yarn is normal.

      Click +add Client and enter the Client Name and upload the Client File in the Add Client dialog box.

      • Client Name: Only letters, numbers, underscores (_), hyphens (-), and half-width periods (.) are supported, with a length not exceeding 32 characters.

        The client name is unique within the same Hadoop cluster (case-sensitive).

      • Client File: Upload the client file. The file format only supports .tgz and .zip.

        Note

        You can go to https://spark.apache.org/downloads.html to download the corresponding version of the Spark client. The self-owned client must have the same directory structure as the Community Edition, include the Hadoop client, and upload the complete compressed package (format: .tgz or .zip). Dataphin uses the uploaded client to submit jobs through the scheduling cluster, enabling full lifecycle management of jobs.

      After the client is uploaded, you can click the image icon in the client list to edit the corresponding client. If a new client file is uploaded, the new client file will overwrite the existing file. Click the image icon to delete the corresponding client.

      Note

      If the uploaded client is referenced by tasks (including draft tasks), editing the client name and deleting the client are not supported.

      Authentication Type

      Supports No Authentication or Kerberos authentication methods.

      If the Hadoop cluster has Kerberos authentication, you need to enable Spark Kerberos and upload the Keytab File authentication file and configure the Principal.

      • Keytab File: Upload the keytab file, which can be obtained from the Spark Server.

      • Principal: Enter the Kerberos authentication username corresponding to the Spark Keytab File.

    • Spark SQL Service Configuration

      Parameter

      Description

      Spark SQL Tasks

      If the Hadoop cluster has Spark deployed, it supports enabling Spark SQL Tasks.

      Spark Version

      Currently, only 3.x is supported.

      Service Type

      Select the target server type for Spark JDBC access.

      JDBC URL

      Enter the JDBC connection method, for example: jdbc:hive2://host1:port1/ or jdbc:kyuubi://host1:port1/, no need to enter the database name.

      Authentication Type

      Supports No Authentication, LDAP, and Kerberos authentication methods.

      • No Authentication: The no authentication method requires entering the username of the Spark service.

      • LDAP: The LDAP authentication method requires entering the username and password of the Spark service.

      • Kerberos: If the Hadoop cluster has Kerberos authentication, you need to enable Spark Kerberos and upload the Keytab File authentication file and configure the Principal.

        • Keytab File: Upload the keytab file, which can be obtained from the Spark Server.

        • Principal: Enter the Kerberos authentication username corresponding to the Spark Keytab File.

      Note

      The user entered for No Authentication and LDAP methods must ensure task execution permissions to ensure normal task execution.

      Username

      Enter the username of the Spark service.

      Note

      This item can only be configured when the authentication method is selected as No Authentication or LDAP. The entered user must ensure task execution permissions to ensure normal task execution.

      Password

      Enter the password of the Spark service user.

      Note

      This item can only be configured when the authentication method is selected as LDAP.

      SQL Task Queue Settings

      Different service types use different SQL task queues. Details are as follows:

      • Spark Thrift Server: Does not support setting task queues.

      • Kyuubi: Uses the priority queue settings of the HDFS connection information. It only takes effect when Kyuubi uses Yarn as resource scheduling. Production tasks use the Connection sharing level.

      • Livy: Uses the priority queue settings of the HDFS connection information. It only takes effect when Livy uses Yarn as resource scheduling. Ad hoc queries and production tasks use a new Connection for execution.

      • MapReduce (MRS): Uses the priority queue settings of the HDFS connection information.

    • Impala Task Configuration

      Parameter

      Description

      Impala Tasks

      If the Hadoop cluster has Impala deployed, it supports enabling Impala tasks.

      JDBC URL

      Enter the JDBC connection method for Impala, for example: jdbc:impala://host:port/, no need to enter the schema.

      Authentication Type

      Supports No Authentication, LDAP, and Kerberos authentication methods.

      • No Authentication: The no authentication method requires entering the Impala username.

      • LDAP: The LDAP authentication method requires entering the username and password of Impala.

      • Kerberos: The Kerberos authentication method requires uploading the Keytab File authentication file and configuring the Principal.

      Username

      Enter the Impala username.

      Note

      This item can only be configured when the authentication method is selected as No Authentication or LDAP. The entered user must ensure task execution permissions to ensure normal task execution.

      Password

      Enter the password of the Impala user.

      Note

      This item can only be configured when the authentication method is selected as LDAP.

      Development Task Request Pool

      Enter the name of the Impala request pool (request pool) used for development tasks.

      Periodic Task Request Pool

      Enter the name of the Impala request pool (request pool) used for periodic tasks.

  5. Click Test Connection to initiate automatic service connection tests.

    If the connection test is successful, you can save the settings. A failed test will prompt a Test Connection Failed dialog box, displaying the services that failed and their error details.

  6. Once the connection test is successful, click Save to finalize the Hadoop cluster creation.

Manage Hadoop clusters

  1. On the Dataphin home page, navigate to Planning > Compute Source via the top menu bar.

  2. On the Compute Source page, click Manage Hadoop Clusters.

  3. In the Manage Hadoop Clusters dialog box, you can view the list of Hadoop clusters, which includes details such as cluster name, cluster administrator, associated compute source, and creation and modification information.

    • Associated Compute Source: Shows the total number of associated compute sources. Click the image icon to view the list and navigate to the compute source page by clicking on a source name.

    • Creation Information: Documents the user and time of creation.

    • Modification Information: Documents the username and time of the last edit for the cluster.

    Note

    Compute tasks can only run within a single cluster. Data cannot be joined across different Hadoop clusters.

  4. (Optional) Use the search box to perform a fuzzy search by entering the cluster name.

  5. In the Actions column of the cluster list, you can manage the selected cluster with the available operations.

    Operation

    Description

    View

    Click the Actions column of the target image icon to view detailed information about the current version of the cluster. Users with Hadoop Cluster - Management permission can download the cluster configuration file.

    Edit

    Click the Actions column of the target cluster's image icon to open the edit Hadoop cluster page. You can modify the existing configuration on the edit Hadoop cluster page. If the service configuration for Spark Jar tasks, Spark SQL tasks, or Impala tasks is enabled and the associated computing source of the cluster has already enabled the corresponding service, you cannot shut down the related service.

    After editing, if you only modify the cluster's basic information and cluster security control information, testing the connection is not required. You can directly save. If there are other modifications, you still need to test the connection. After a successful connection test, click Save. In the pop-up dialog box, fill in the Change Description and click OK.

    Clone

    Click the Actions column of the target cluster and the image icon. The system automatically clones all data of the current cluster and opens the Create Hadoop Cluster page, where you can modify the configuration based on the existing settings.

    History

    Click the Actions icon in the target cluster's image column, and select History. The dialog box displays the version information of the current cluster, including the version name, modifier, and change description. You can View, Compare, and Roll Back historical versions.

    • View: You can click the Actions column of the target version image icon to navigate to the view Hadoop cluster page and view detailed information about the current version of the cluster. Users with Hadoop Cluster - Management permission can download the cluster configuration file.

    • Comparison: Click the Actions column image icon for the target version to navigate to the version comparison page. On the comparison page, you can select different versions from the filter dropdown list. By default, the current version's Hadoop cluster is compared with the target version.

    • Rollback: Click the Actions column of the target version image icon. In the dialog box that appears, click OK.

      After clicking OK, the system will automatically test the connection for the cluster information of that version. If the test passes, the rollback proceeds normally. If the rollback fails, the system will pop up a rollback failure prompt, where you can view the specific reason for the failure. If the connection test fails, the rollback ends, and you can view the services that failed the connection test in the pop-up dialog box.

    Delete

    Note
    • You can delete the current cluster only when there are no associated compute sources under the current Hadoop cluster.

    • Once the cluster is deleted, it cannot be restored.

    To delete the target cluster, click the Actions column, then select the image icon. Choose Delete and, in the confirmation dialog box that appears, click Confirm.