All Products
Search
Document Center

Dataphin:Create a Hadoop compute source

Last Updated:Sep 30, 2025

A Hadoop compute source connects a Dataphin project to a Hadoop project. It provides the compute resources that a Dataphin project requires to process offline computing tasks. If the Dataphin compute engine is set to Hadoop, the project must have a Hadoop compute source to support features such as standard modeling, ad hoc queries, Hive tasks, and general scripts. This topic describes how to create a Hadoop compute source.

Prerequisites

Before you start, make sure that the following requirements are met:

  • The Dataphin compute engine is set to Hadoop. For more information, see Set the compute engine to Hadoop.

  • The Hive user has the following permissions:

    • CREATEFUNCTION permission.

      Important

      The CREATEFUNCTION permission is required to register user-defined functions (UDFs) in Hive through Dataphin. Without this permission, you cannot create UDFs in Dataphin or use the Dataphin asset security features.

    • Read, write, and execute permissions for the directory where UDFs are stored in the Hadoop Distributed File System (HDFS).

      The default HDFS directory for UDFs is /tmp/dataphin. You can change this directory as needed.

  • To enable Impala tasks for saved searches and data analysis, you must deploy Impala V2.5 or later on your Hadoop cluster.

  • If you use the E-MapReduce 5.x compute engine and need to use Hive foreign tables based on OSS for offline integration, you must complete the required configurations. For more information, see Use Hive foreign tables based on OSS for offline integration.

Impala task limits

To enable Impala tasks for saved searches and data analysis, the following limits apply in Dataphin:

  • Only Impala V2.5 or later is supported.

  • Logical tables do not support the Impala engine. However, you can use Impala to query logical tables.

  • Impala data sources and compute sources in Dataphin use the Impala Java Database Connectivity (JDBC) client to connect to the Impala JDBC port, which is 21050 by default. The Hive JDBC port is not supported. If you want to create Impala tasks or data sources in Dataphin, contact your cluster provider to confirm that Impala JDBC connections are supported.

  • Hive cannot access Kudu tables. This leads to the following limits:

    • You cannot use Hive SQL to access Kudu tables. Attempting to do so causes the SQL statement to fail and returns the following error: FAILED: RuntimeException java.lang.ClassNotFoundException: org.apache.Hadoop.hive.kudu.KuduInputFormat.

    • You cannot use Kudu tables as source tables for modeling. If a source table is a Kudu table, the execution fails.

    • Asset security scan tasks use Impala SQL to scan Kudu tables. If Impala is not enabled for the project where the scan task is located, Kudu tables cannot be scanned.

    • When a quality rule is executed for a Kudu table, Impala SQL is used for verification. If Impala is not enabled, the verification task fails.

    • The label platform does not support using Kudu tables as offline view tables.

  • The storage usage of Kudu tables cannot be retrieved.

    • The storage usage of Kudu tables is not available in asset details.

    • The empty table administration feature in resource administration does not support Kudu tables.

    • Quality rules for table size and partition size do not support Kudu tables.

Spark SQL service limits

To enable the Spark SQL service, the following limits apply in Dataphin:

  • Only Spark V3.x is supported.

  • Spark Thrift Server, Kyuubi, or Livy services must be deployed and enabled on the Hadoop cluster.

  • Dataphin does not verify database permissions for Spark Call commands. Use this feature with caution.

  • The service configurations for Spark SQL must be the same in the development and production compute sources. If they are different, you cannot configure Spark resource settings for Spark SQL tasks.

Compute engines and supported service types

Different compute engines support different service types. The following table provides details:

Compute engine type

Spark Thrift Server

Kyuubi

Livy

MapReduce (MRS)

E-MapReduce 3.x

Supported

Supported

Not supported

Not supported

E-MapReduce 5.x

Supported

Supported

Not supported

Not supported

CDH 5.X, CDH 6.X

Not supported

Supported

Not supported

Not supported

Cloudera Data Platform

Not supported

Supported

Supported

Not supported

FusionInsight 8.X

Not supported

Not supported

Not supported

Supported

AsiaInfo DP 5.3

Supported

Supported

Not supported

Not supported

Procedure

  1. In the top menu bar of the Dataphin homepage, choose Planning > Compute Source.

  2. On the Compute Source page, click + Add Compute Source, and then choose Hadoop Compute Source.

  3. On the Create Compute Source page, set the following parameters.

    You can configure the compute source by selecting Reference specified cluster or Configure separately. The available configuration items vary depending on the method you select.

    Reference specified cluster configuration

    • Basic information of the compute source

      Parameter

      Description

      Compute Source Type

      The default value is Hadoop.

      Compute Source Name

      Observe the following naming conventions:

      • The name can contain only Chinese characters, letters, digits, underscores (_), and hyphens (-).

      • The name can be up to 64 characters in length.

      Configuration Method

      Select Reference Specified Cluster.

      Data Lake Table Format

      This feature is disabled by default. After you enable it, you can select a data lake table format.

      • If the compute engine is Cloudera Data Platform 7.x, the Hudi table format is supported.

      • If the compute engine is E-MapReduce 5.x, the Iceberg and Paimon table formats are supported.

      Note

      This parameter is available only if the compute engine is Cloudera Data Platform 7.x or E-MapReduce 5.x.

      Compute Source Description

      A brief description of the compute source. The description can be up to 128 characters in length.

    • Queue information configuration

      Parameter

      Description

      Production Task Queue

      Enter the YARN resource queue. This queue is used to run manual and auto triggered tasks in the production environment.

      Other Task Queue

      Enter the YARN resource queue. This queue is used for other tasks, such as ad hoc queries, data previews, and JDBC driver access.

      Priority Task Queue

      You can select Use Production Task Default Queue or Custom.

      If you select Custom, you must enter the YARN resource queues that correspond to the highest, high, medium, low, and lowest priorities.

    • Hive compute engine configuration

      Parameter

      Description

      Connection Information

      You can select Reference Cluster Configuration or Configure Separately.

      JDBC URL

      You can configure one of the following types of endpoints:

      • The endpoint of the Hive server. Format: jdbc:hive://{endpoint}:{port}/{database_name}.

      • The endpoint of ZooKeeper. Example: jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2.

      • The endpoint with Kerberos authentication enabled. Format: jdbc:hive2://{endpoint}:{port}/{database_name};principal=hive/_HOST@xx.com.

      Note

      You can modify the JDBC URL if you select Configure separately for Connection information. If you select Reference cluster configuration, the JDBC URL is view-only.

      Database

      Note

      This parameter is available only if you select Reference cluster configuration for Connection information.

      Enter the database name. The name cannot contain periods (.) and can be up to 256 characters in length.

      Authentication Type

      Note

      This parameter is available only if you select Configure separately for Connection information.

      The supported authentication methods are No Authentication, LDAP, and Kerberos.

      • No authentication: Enter the username for the Hive service.

      • LDAP: Enter the username and password for the Hive service.

        Note

        For the No authentication and LDAP methods, make sure that the specified user has the permissions to execute tasks.

      • Kerberos: If the Hadoop cluster uses Kerberos authentication, enable Hive Kerberos, upload the keytab file, and configure the principal.

        • Keytab File: Upload the keytab file. You can obtain this file from the Hive server.

        • Principal: Enter the Kerberos authentication username that corresponds to the Hive keytab file.

      Execution Engine

      • Default: Nodes, including logical table tasks, in the project that is attached to this compute source use this execution engine by default.

      • Custom: Select another type of execution engine.

    • Spark Jar service configuration

      Note

      If you select Reference cluster configuration and the referenced cluster does not have the Spark local client enabled, you cannot configure the Spark Jar service.

      Parameter

      Description

      Spark Execution Machine

      If Spark is deployed on the Hadoop cluster, you can enable Spark Jar Tasks.

      Spark Local Client

      If the referenced cluster has the Spark local client enabled, this option is enabled by default.

      After you disable it, the project that corresponds to the current compute source cannot use the Spark local client. If the project contains nodes, including draft nodes, that use the Spark local client, you cannot disable this option.

    • Spark SQL service configuration

      Note

      If you select Reference cluster configuration and the referenced cluster does not have the Spark SQL service enabled, you cannot configure the Spark SQL service.

      Parameter

      Description

      Spark SQL task

      If Spark is deployed on the Hadoop cluster, you can enable Spark SQL Tasks.

      Note

      If you select Paimon for Data lake table format, you cannot disable Spark SQL tasks.

      Connection information

      You can select Reference Cluster Configuration or Configure Separately.

      Spark version

      Only Spark V3.x is supported.

      Service type

      Select the type of the destination server for Spark JDBC access. Different compute engines support different service types. For more information, see Compute engines and supported service types.

      JDBC URL

      The JDBC URL of Spark. The database in the URL must be the same as the database specified in the Hive JDBC URL.

      Note

      You can modify the JDBC URL if you select Configure separately for Connection information. If you select Reference cluster configuration, the JDBC URL is view-only.

      Database

      Note

      This parameter is available only if you select Reference cluster configuration for Connection information.

      Enter the database name. The name cannot contain periods (.) and can be up to 256 characters in length.

      Authentication method

      Note

      This parameter is available only if you select Configure separately for Connection information.

      The supported authentication methods are No Authentication, LDAP, and Kerberos.

      • No authentication: Enter the username for the Spark service.

      • LDAP: Enter the username and password for the Spark service.

        Note

        For the No authentication and LDAP methods, make sure that the specified user has the permissions to execute tasks.

      • Kerberos: If the Hadoop cluster uses Kerberos authentication, enable Spark Kerberos, upload the keytab file, and configure the principal.

        • Keytab File: Upload the keytab file. You can obtain this file from the Spark server.

        • Principal: Enter the Kerberos authentication username that corresponds to the Spark keytab file.

      SQL task queue settings

      Different service types use different SQL task queues. Details are as follows:

      • Spark Thrift Server: You cannot set a task queue.

      • Kyuubi: Uses the priority queue settings from the HDFS information configuration. This takes effect only when Kyuubi uses YARN for resource scheduling. Production tasks use the connection sharing level.

      • Livy: Uses the priority queue settings from the HDFS information configuration. This takes effect only when Livy uses YARN for resource scheduling. Ad hoc queries and production tasks are executed using a new connection.

      • MapReduce (MRS): Uses the priority queue settings from the HDFS information configuration.

    • Impala task configuration

      Note

      If you select Reference cluster configuration and the referenced cluster does not have Impala tasks enabled, you cannot configure the Impala task service.

      Parameter

      Description

      Impala Task

      If Impala is deployed on the Hadoop cluster, you can enable Impala tasks.

      Connection Information

      You can select Reference Cluster Configuration or Configure Separately.

      JDBC URL

      Enter the JDBC endpoint of Impala. Example: jdbc:Impala://host:port/database. The database in the JDBC URL must be the same as the database in the Hive JDBC URL.

      Note

      You can modify the JDBC URL if you select Configure separately for Connection information. If you select Reference cluster configuration, the JDBC URL is view-only.

      Database

      Note

      This parameter is available only if you select Reference cluster configuration for Connection information.

      Enter the database name. The name cannot contain periods (.) and can be up to 256 characters in length.

      Authentication Type

      Note

      This parameter is available only if you select Configure separately for Connection information.

      The supported authentication methods are No Authentication, LDAP, and Kerberos.

      • No authentication: Enter the Impala username.

      • LDAP: Enter the username and password for Impala.

      • Kerberos: Upload the keytab file and configure the principal.

      Development Task Request Pool

      Enter the name of the Impala request pool for development tasks.

      Auto Triggered Task Request Pool

      Enter the name of the Impala request pool for auto triggered tasks.

      Priority Task Queue

      Supports Use Auto Triggered Task Default Queue and Custom.

      • When Dataphin schedules Impala SQL tasks, it sends the tasks to the corresponding queues for execution based on their priorities. The priorities are highest, high, medium, low, and lowest.

      • If you customize the priority task queue, logical table tasks that are scheduled to run daily use the medium-priority task queue by default. Logical table tasks that are scheduled to run yearly or monthly use the low-priority task queue by default.

    Configure separately

    • Basic information of the compute source

      Parameter

      Description

      Compute Source Type

      The default value is Hadoop.

      Compute Source Name

      Observe the following naming conventions:

      • The name can contain only Chinese characters, letters, digits, underscores (_), and hyphens (-).

      • The name can be up to 64 characters in length.

      Configuration Method

      Select Configure Separately.

      Data Lake Table Format

      This feature is disabled by default. After you enable it, you can select a data lake table format. Currently, only Hudi is supported.

      Note

      This parameter is available only if the compute engine is Cloudera Data Platform 7.x.

      Compute Source Description

      A brief description of the compute source. The description can be up to 128 characters in length.

    • Basic information of the cluster

      Note

      You can configure the basic information of the cluster only if you select Configure separately.

      Parameter

      Description

      Cluster Storage

      This parameter is set to the value configured in compute settings and cannot be changed. This parameter is not available for clusters that do not use OSS-HDFS storage.

      NameNode

      Click + Add. In the Add NameNode dialog box, configure the parameters. You can add multiple NameNodes.

      The NameNode is the hostname or IP address and port of the NameNode in the HDFS cluster. Example:

      • NameNode: 193.168.xx.xx

      • Web UI Port: 50070

      • IPC Port: 8020

      You must select at least one of the Web UI Port and IPC Port. After configuration, the NameNode is host=192.168.xx.xx,webUiPort=50070,ipcPort=8020.

      Note

      This parameter is available only if you set Cluster storage to HDFS.

      Cluster Storage Root Directory

      This parameter is set to the value configured in compute settings and cannot be changed. This parameter is not available for clusters that do not use OSS-HDFS storage.

      AccessKey ID, AccessKey Secret

      If the cluster storage type is OSS-HDFS, enter the AccessKey ID and AccessKey secret that are used to access the OSS of the cluster. For more information about how to view an AccessKey pair, see View the AccessKey pair of a RAM user.

      Important

      The configuration that you enter here has a higher priority than the AccessKey pair configured in the core-site.xml file.

      core-site.xml

      Upload the core-site.xml configuration file of the Hadoop cluster.

      hdfs-site.xml

      Upload the hdfs-site.xml configuration file of HDFS in the Hadoop cluster.

      Note

      You cannot upload the hdfs-site.xml configuration file if the cluster storage type is OSS-HDFS.

      hive-site.xml

      Upload the hive-site.xml configuration file of Hive in the Hadoop cluster.

      yarn-site.xml

      Upload the yarn-site.xml configuration file of Hive in the Hadoop cluster.

      Other Configuration Files

      Upload the keytab file. You can run the ipa-getkeytab command on a NameNode in the HDFS cluster to obtain the file.

      Task Execution Machine

      Configure the endpoint of the machine that executes MapReduce or Spark Jar tasks. Format: hostname:port or ip:port. The default port is 22.

      Authentication Type

      The supported authentication methods are No authentication and Kerberos.

      Kerberos is an identity authentication protocol that is based on symmetric key technology. It provides identity authentication for other services and supports single sign-on (SSO). After a client is authenticated, it can access multiple services, such as HBase and HDFS.

      If the Hadoop cluster uses Kerberos authentication, enable cluster Kerberos and upload the krb5.conf file or configure the KDC server address.

      Important

      When the compute engine type is E-MapReduce 5.x, only the Krb5 File Configuration method is supported.

      • Krb5 authentication file: Upload the krb5.conf file for Kerberos authentication.

      • KDC server address: The address of the Key Distribution Center (KDC) server, which assists with Kerberos authentication.

      Note

      You can configure multiple KDC server addresses. Separate them with semicolons (;).

    • HDFS information configuration

      Parameter

      Description

      Execution Username, Password

      The username and password to log on to the task execution machine. They are used to execute MapReduce tasks and read data from and write data to HDFS.

      Important

      Make sure that you have the permissions to submit MapReduce tasks.

      Authentication Type

      The supported methods are No Authentication and Kerberos.

      Note

      If the cluster storage is OSS-HDFS, you cannot configure an HDFS authentication method. The AccessKey pair in the core-site.xml file is used by default.

      If the Hadoop cluster uses Kerberos authentication, enable HDFS Kerberos, upload the keytab file, and configure the principal.

      • Keytab File: Upload the keytab file. You can obtain this file from the HDFS server.

      • Principal: Enter the Kerberos authentication username that corresponds to the HDFS keytab file.

      HDFS User

      The username for file uploads. If you leave this empty, the execution username is used by default. You can set this parameter when Kerberos is disabled.

      Production Task Default Queue

      Enter the YARN resource queue. This queue is used to run manual and auto triggered tasks in the production environment.

      Other Task Queue

      Enter the YARN resource queue. This queue is used for other tasks, such as ad hoc queries, data previews, and JDBC driver access.

      Task Priority Queue

      You can select Use Production Task Default Queue or Custom.

      • When Dataphin schedules Hive SQL tasks, it sends the tasks to the corresponding queues for execution based on their priorities. The priorities are highest, high, medium, low, and lowest.

      • If you set the Hive execution engine to Tez or Spark, you must configure different priority queues for the task priority settings to take effect.

        Note
        • Logical table tasks that are scheduled to run daily or hourly use the medium-priority task queue by default.

        • Logical table tasks that are scheduled to run yearly or monthly use the low-priority task queue by default.

    • Hive compute engine configuration

      Parameter

      Description

      JDBC URL

      You can configure one of the following types of endpoints:

      • The endpoint of the Hive server. Format: jdbc:hive://{endpoint}:{port}/{database_name}.

      • The endpoint of ZooKeeper. Example: jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2.

      • The endpoint with Kerberos authentication enabled. Format: jdbc:hive2://{endpoint}:{port}/{database_name};principal=hive/_HOST@xx.com.

      Authentication Type

      Note

      This parameter is available only if you select Configure separately for Connection information.

      The supported authentication methods are No Authentication, LDAP, and Kerberos.

      • No authentication: Enter the username for the Hive service.

      • LDAP: Enter the username and password for the Hive service.

        Note

        For the No authentication and LDAP methods, make sure that the specified user has the permissions to execute tasks.

      • Kerberos: If the Hadoop cluster uses Kerberos authentication, enable Hive Kerberos, upload the keytab file, and configure the principal.

        • Keytab File: Upload the keytab file. You can obtain this file from the Hive server.

        • Principal: Enter the Kerberos authentication username that corresponds to the Hive keytab file.

      Execution Engine

      • Default: Nodes, including logical table tasks, in the project that is attached to this compute source use this execution engine by default.

      • Custom: Select another type of execution engine.

    • Hive metadata configuration

      Metadata retrieval method: Three metadata retrieval methods are supported: Metadata Database, HMS, and DLF. Each method requires different configuration information.

      Important
      • The DLF retrieval method is supported only for clusters that use E-MapReduce 5.x Hadoop as the compute engine.

      • To use the DLF method to retrieve metadata, you must first upload the hive-site.xml configuration file.

      Metadata retrieval method

      Parameter

      Description

      Metadata Database

      Database Type

      Select a database based on the metadatabase type used in the cluster. Dataphin supports MySQL.

      The supported MySQL versions are MySQL 5.1.43, MySQL 5.6/5.7, and MySQL 8.

      JDBC URL

      Enter the JDBC endpoint of the destination database. Example:

      MySQL: The format is jdbc:mysql://{connection address}[,failoverhost...]{port}/{database name} [?propertyName1][=propertyValue1][&propertyName2][=propertyValue2]....

      Username, Password

      Enter the username and password to log on to the metadatabase.

      HMS

      Authentication Type

      The HMS retrieval method supports No authentication, LDAP, and Kerberos. The Kerberos authentication method requires you to upload a keytab file and configure a principal.

      DLF

      Endpoint

      Enter the endpoint of the region where the data center of DLF for the cluster is located. To obtain the endpoint, see DLF regions and endpoints.

      AccessKey ID, AccessKey Secret

      Enter the AccessKey ID and AccessKey secret of the account to which the cluster belongs.

      You can obtain the AccessKey ID and AccessKey secret of your account on the User Information Management page.

    • Spark Jar service configuration

      Parameter

      Description

      Spark Execution Machine

      If Spark is deployed on the Hadoop cluster, you can enable Spark Jar Tasks.

      Execution Username, Password

      Enter the username and password to log on to the task execution machine.

      Important

      The permission to submit MapReduce tasks is granted.

      Authentication Type

      The supported authentication methods are No Authentication and Kerberos.

      If the Hadoop cluster uses Kerberos authentication, enable Spark Kerberos, upload the keytab file, and configure the principal.

      • Keytab File: Upload the keytab file. You can obtain this file from the Spark server.

      • Principal: Enter the Kerberos authentication username that corresponds to the Spark keytab file.

    • Spark SQL service configuration

      Parameter

      Description

      Spark SQL task

      If Spark is deployed on the Hadoop cluster, you can enable Spark SQL Tasks.

      Spark version

      Only Spark V3.x is supported.

      Service type

      Select the type of the destination server for Spark JDBC access. Different compute engines support different service types. For more information, see Compute engines and supported service types.

      JDBC URL

      The JDBC URL of Spark. The database in the URL must be the same as the database specified in the Hive JDBC URL.

      Authentication method

      The supported authentication methods are No Authentication, LDAP, and Kerberos.

      • No authentication: Enter the username for the Spark service.

      • LDAP: Enter the username and password for the Spark service.

        Note

        For the No authentication and LDAP methods, make sure that the specified user has the permissions to execute tasks.

      • Kerberos: If the Hadoop cluster uses Kerberos authentication, enable Spark Kerberos, upload the keytab file, and configure the principal.

        • Keytab File: Upload the keytab file. You can obtain this file from the Spark server.

        • Principal: Enter the Kerberos authentication username that corresponds to the Spark keytab file.

      SQL task queue settings

      Different service types use different SQL task queues. Details are as follows:

      • Spark Thrift Server: You cannot set a task queue.

      • Kyuubi: Uses the priority queue settings from the HDFS information configuration. This takes effect only when Kyuubi uses YARN for resource scheduling. Production tasks use the connection sharing level.

      • Livy: Uses the priority queue settings from the HDFS information configuration. This takes effect only when Livy uses YARN for resource scheduling. Ad hoc queries and production tasks are executed using a new connection.

      • MapReduce (MRS): Uses the priority queue settings from the HDFS information configuration.

    • Impala task configuration

      Parameter

      Description

      Impala Task

      If Impala is deployed on the Hadoop cluster, you can enable Impala tasks.

      JDBC URL

      Enter the JDBC endpoint of Impala. Example: jdbc:Impala://host:port/database. The database in the JDBC URL must be the same as the database in the Hive JDBC URL.

      Note

      If you select Reference cluster configuration for the connection information, the JDBC URL is view-only.

      Authentication Type

      The supported authentication methods are No Authentication, LDAP, and Kerberos.

      • No authentication: Enter the Impala username.

      • LDAP: Enter the username and password for Impala.

      • Kerberos: Upload the keytab file and configure the principal.

      Development Task Request Pool

      Enter the name of the Impala request pool for development tasks.

      Auto Triggered Task Request Pool

      Enter the name of the Impala request pool for auto triggered tasks.

      Priority Task Queue

      Supports Use Auto Triggered Task Default Queue and Custom.

      • When Dataphin schedules Impala SQL tasks, it sends the tasks to the corresponding queues for execution based on their priorities. The priorities are highest, high, medium, low, and lowest.

      • If you customize the priority task queue, logical table tasks that are scheduled to run daily use the medium-priority task queue by default. Logical table tasks that are scheduled to run yearly or monthly use the low-priority task queue by default.

  4. Click Test Connection to test the connection to the compute source.

  5. After the connection test succeeds, click Submit.

What to do next

After you create a Hadoop compute source, you can attach it to a project. For more information, see Create a general-purpose project.