All Products
Search
Document Center

Dataphin:Create a Hadoop compute source

Last Updated:Feb 10, 2026

A Hadoop compute source binds a Dataphin project to a Hadoop cluster. It provides the compute resources needed to run offline computing tasks in Dataphin. If you set the Dataphin compute engine to Hadoop, only projects with a Hadoop compute source can use features such as standard modeling, ad hoc queries, Hive tasks, and generic scripts. This topic describes how to create a Hadoop compute source.

Prerequisites

Before you begin, make sure that the following requirements are met:

  • Set the Dataphin compute engine to Hadoop. For more information, see Set the compute engine to Hadoop.

  • Make sure the Hive user has the following permissions:

    • CREATEFUNCTION permission.

      Important

      You need this permission to register user-defined functions (UDFs) in Hive through Dataphin. Without this permission, you cannot create UDFs in Dataphin or use Dataphin's asset security features.

    • Read, write, and execute permissions on the HDFS directory where UDFs are stored.

      The default UDF directory in HDFS is /tmp/dataphin. You can change this directory.

  • If you plan to run Impala tasks for fast queries and data analysis, you must first deploy Impala (version 2.5 or later) on your Hadoop cluster.

  • If you use E-MapReduce 5.x as the compute engine and want to use Hive external tables based on OSS for offline integration, you must first configure your environment. For more information, see Use Hive external tables based on OSS for offline integration.

Impala task limits

If you enable Impala tasks for fast queries and data analysis, the following limits apply in Dataphin:

  • Only Impala version 2.5 or later is supported.

  • Logical tables do not support the Impala execution engine. However, you can query logical tables using Impala.

  • Dataphin connects to the Impala JDBC port (default: 21050) using the Impala JDBC client. Hive JDBC ports are not supported. Before you create an Impala task or data source in Dataphin, check with your cluster provider to confirm that Impala JDBC connections are supported.

  • Because Hive cannot access Kudu tables, the following limits apply:

    • Hive SQL cannot access Kudu tables. If you attempt to access them, the SQL execution fails with the following error: FAILED: RuntimeException java.lang.ClassNotFoundException: org.apache.Hadoop.hive.kudu.KuduInputFormat.

    • You cannot use Kudu tables as source tables for modeling. Tasks that use Kudu source tables will fail.

    • Security scan tasks use Impala SQL to scan Kudu tables. If Impala is not enabled in the project where the scan task runs, Kudu table scanning is not supported.

    • Quality rule checks use Impala SQL for Kudu tables. If Impala is not enabled, the quality check fails.

    • The tag platform does not support Kudu tables as offline view tables.

  • Dataphin does not support retrieving the storage size of Kudu tables.

    • Storage size information for Kudu tables is not available in asset details.

    • The empty-table governance feature in resource administration does not support Kudu tables.

    • Kudu tables do not support quality rules for table size or table partition size.

Spark SQL service limits

If you enable the Spark SQL service, the following limits apply in Dataphin:

  • Only Spark version 3.x is supported.

  • You must deploy and start one of the following services on your Hadoop cluster: Spark Thrift Server, Kyuubi, or Livy.

  • Dataphin does not validate data permissions for Spark Call commands. Use them with caution.

  • The Spark SQL service configuration must be identical for both production and development compute sources. If the configurations differ, you cannot configure Spark resource settings for Spark SQL tasks.

Compute engines and supported service types

The supported service types vary based on the compute engine.

Compute engine type

Spark Thrift Server

Kyuubi

Livy

MapReduce (MRS)

E-MapReduce 3.x

Supported

Supported

Not supported

Not supported

E-MapReduce 5.x

Supported

Supported

Not supported

Not supported

CDH 5.x, CDH 6.x

Not supported

Supported

Not supported

Not supported

Cloudera Data Platform

Not supported

Supported

Supported

Not supported

FusionInsight 8.x

Not supported

Not supported

Not supported

Supported

AsiaInfo DP 5.3

Supported

Supported

Not supported

Not supported

Amazon EMR

Supported

Not supported

Not supported

Supported

Procedure

  1. In the top menu bar on the Dataphin homepage, choose Planning > Compute Source.

  2. On the Compute Source page, click + Add compute source, and then select Hadoop compute source.

  3. On the Create Compute Source page, configure the parameters.

    You can configure the compute source by either referencing a specified cluster or using a standalone configuration. The available parameters depend on the method that you select.

    Reference a specified cluster

    • Basic compute source information

      Parameter

      Description

      Compute source type

      Default: Hadoop.

      Compute source name

      Naming rules:

      • The value can contain only English letters, digits, underscores (_), hyphens (-), and Chinese characters.

      • Maximum length: 64 characters.

      Configuration method

      Select Reference a specified cluster.

      Data lake table format

      Disabled by default. Enable it to select a data lake table format.

      • For Cloudera Data Platform 7.x, supported formats: Hudi.

      • For E-MapReduce 5.x, supported formats: Iceberg and Paimon.

      Note

      This option is supported only for Cloudera Data Platform 7.x or E-MapReduce 5.x.

      Compute source description

      A brief description. Maximum length: 128 characters.

    • Queue information

      Parameter

      Description

      Production task queue

      Enter the YARN resource queue used for manual and scheduled tasks in production environments.

      Other task queues

      Enter the YARN resource queue used for other tasks, such as ad hoc queries, data previews, and JDBC Driver access.

      Priority task queue

      Select Use production task default queue or Custom.

      If you select Custom, enter separate YARN resource queues for highest, high, medium, low, and lowest priority tasks.

    • Hive compute engine configuration

      Parameter

      Description

      Connection information

      Select Reference cluster configuration or Standalone configuration.

      JDBC URL

      Supports three connection address formats:

      • HiveServer connection address: jdbc:hive://{connection address}:{port}/{database name}.

      • ZooKeeper connection address. Example: jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2.

      • Kerberos-enabled connection address: jdbc:hive2://{connection address}:{port}/{database name};principal=hive/_HOST@xx.com.

      Note
      • If you select Standalone configuration, you can edit the JDBC URL. If you select Reference cluster configuration, you can only view the JDBC URL.

      • For E-MapReduce 3.x, E-MapReduce 5.x, or Cloudera Data Platform, Kerberos-enabled JDBC URLs cannot contain multiple IP addresses.

      Database

      Note

      You can configure the database only if you select Reference cluster configuration.

      Enter the database name. Do not use periods (.). Maximum length: 256 characters.

      Authentication Type

      Note

      You can configure authentication only if you select Standalone configuration.

      Supported methods: No authentication, LDAP, and Kerberos.

      • No authentication: Enter the Hive service username.

      • LDAP: Enter the Hive service username and password.

        Note

        The users you specify for no authentication or LDAP must have task execution permissions.

      • Kerberos: If your Hadoop cluster uses Kerberos, enable Hive Kerberos and upload the Keytab File and Principal.

        • Keytab File: Upload the keytab file. Get it from the Hive Server.

        • Principal: Enter the Kerberos username for the Hive keytab file.

      Execution engine

      • Default: Tasks in projects bound to this compute source—including logical table tasks—use this execution engine by default.

      • Custom: Select another compute engine type.

    • Spark JAR service configuration

      Note

      You cannot configure the Spark JAR service if you select Reference cluster configuration and the referenced cluster does not have the Spark local client enabled.

      Parameter

      Description

      Spark execution machine

      If Spark is deployed on your Hadoop cluster, you can enable Spark JAR tasks.

      Spark local client

      If the referenced cluster has Spark local client enabled, this option is enabled by default.

      Click to disable it. After disabling, projects linked to this compute source cannot use Spark local client. You cannot disable it if any task in the linked project—including draft tasks—uses Spark local client.

    • Spark SQL service configuration

      Note

      You cannot configure the Spark SQL service if you select Reference cluster configuration and the referenced cluster does not have the Spark SQL service enabled.

      Parameter

      Description

      Spark SQL tasks

      If Spark is deployed on your Hadoop cluster, you can enable Spark SQL tasks.

      Note

      If you selected Paimon as the data lake table format, you cannot disable Spark SQL tasks.

      Connection information

      Select Reference cluster configuration or Standalone configuration.

      Spark version

      Only version 3.x is supported.

      Service type

      Select the server type for Spark JDBC access. Supported service types vary by compute engine. For more information, see Compute engines and supported service types.

      JDBC URL

      The Spark JDBC URL. Its database must match the database in the Hive JDBC URL.

      Note

      If you select Standalone configuration, you can edit the JDBC URL. If you select Reference cluster configuration, you can only view the JDBC URL.

      Database

      Note

      You can configure the database only if you select Reference cluster configuration.

      Enter the database name. Do not use periods (.). Maximum length: 256 characters.

      Authentication Type

      Note

      You can configure the database only when the link information is set to Separate Configuration.

      Supported methods: No authentication, LDAP, and Kerberos.

      • No authentication: Enter the Spark service username.

      • LDAP: Enter the Spark service username and password.

        Note

        The users you specify for no authentication or LDAP must have task execution permissions.

      • Kerberos: If your Hadoop cluster uses Kerberos, enable Spark Kerberos and upload the Keytab File and Principal.

        • Keytab File: Upload the keytab file. Get it from the Spark Server.

        • Principal: Enter the Kerberos username for the Spark keytab file.

      SQL task queue settings

      Different service types use different SQL task queues. Details:

      • Spark Thrift Server: Task queues are not supported.

      • Kyuubi: Uses the priority queue configured in HDFS settings. Applies only when Kyuubi uses YARN for resource scheduling. Production tasks use shared connections.

      • Livy: Uses the priority queue configured in HDFS settings. Applies only when Livy uses YARN for resource scheduling. Ad hoc queries and production tasks use new connections.

      • MapReduce (MRS): Uses the priority queue configured in HDFS settings.

    • Impala task configuration

      Note

      You cannot configure the Impala task service if you select Reference cluster configuration and the referenced cluster does not have Impala tasks enabled.

      Parameter

      Description

      Impala tasks

      If Impala is deployed on your Hadoop cluster, you can enable Impala tasks.

      Connection information

      Select Reference cluster configuration or Standalone configuration.

      JDBC URL

      Enter the Impala JDBC connection address. Example: jdbc:Impala://host:port/database. The database in this URL must match the database in the Hive JDBC URL.

      Note

      If you select Standalone configuration, you can edit the JDBC URL. If you select Reference cluster configuration, you can only view the JDBC URL.

      Database

      Note

      You can configure the database only if you select Reference cluster configuration.

      Enter the database name. Do not use periods (.). Maximum length: 256 characters.

      Authentication Type

      Note

      You can configure the database only when Link Information is set to Separate Configuration.

      Supported methods: No authentication, LDAP, and Kerberos.

      • No authentication: Enter the Impala username.

      • LDAP: Enter the Impala username and password.

      • Kerberos: Upload the Keytab File and configure the Principal.

      Development task request pool

      Enter the Impala request pool name for development tasks.

      Scheduled task request pool

      Enter the Impala request pool name for scheduled tasks.

      Priority task queue

      Choose Use scheduled task default queue or Custom.

      • Dataphin routes Impala SQL tasks to queues based on priority: highest, high, medium, low, and lowest.

      • When customizing, daily logical table tasks use the medium-priority queue by default. Yearly and monthly logical table tasks use the low-priority queue by default.

    Standalone configuration

    • Basic compute source information

      Parameter

      Description

      Compute source type

      Default: Hadoop.

      Compute source name

      Naming rules:

      • Use only letters, digits, underscores (_), and hyphens (-).

      • Maximum length: 64 characters.

      Configuration method

      Select Standalone configuration.

      Data lake table format

      Disabled by default. Enable it to select a data lake table format. Only Hudi is supported.

      Note

      This option is supported only for Cloudera Data Platform 7.x.

      Compute source description

      A brief description. Maximum length: 128 characters.

    • Cluster basic information

      Note

      You can configure basic cluster information only if you select Standalone configuration.

      Parameter

      Description

      Cluster storage

      Uses default values from compute settings. Not configurable. Not applicable for non-OSS-HDFS clusters.

      NameNode

      Click + Add. In the Add NameNode dialog box, configure parameters. You can add multiple NameNodes.

      A NameNode is the host name or IP address and port of a NameNode node in an HDFS cluster. Example:

      • NameNode: 193.168.xx.xx

      • Web UI Port: 50070

      • IPC Port: 8020

      At least one of Web UI Port or IPC Port is required. After configuration, the NameNode appears as:host=192.168.xx.xx,webUiPort=50070,ipcPort=8020.

      Note

      This option is supported only for HDFS clusters.

      Cluster storage root directory

      Uses default values from compute settings. Not configurable. Not applicable for non-OSS-HDFS clusters.

      AccessKey ID and AccessKey Secret

      The cluster storage type is OSS-HDFS, so you must specify the AccessKey ID and AccessKey secret to access the cluster's OSS. Use an existing AccessKey or refer to Create an AccessKey pair to create a new one.

      Important
      • To reduce the risk of AccessKey exposure, the AccessKey Secret appears only once during creation and cannot be viewed later. Store it securely.

      • Settings here override those in core-site.xml.

      core-site.xml

      Upload the core-site.xml configuration file from your Hadoop cluster.

      hdfs-site.xml

      Upload the hdfs-site.xml configuration file from your Hadoop cluster’s HDFS.

      Note

      The hdfs-site.xml configuration file cannot be uploaded to the OSS-HDFS cluster storage class.

      hive-site.xml

      Upload the hive-site.xml configuration file from your Hadoop cluster’s Hive.

      yarn-site.xml

      Upload the yarn-site.xml configuration file from your Hadoop cluster’s Hive.

      Other configuration files

      Upload the keytab file. Get it from the NameNode in your HDFS cluster using the ipa-getkeytab command.

      Task execution machine

      Configure the connection address for MapReduce or Spark JAR execution machines. Format: hostname:port or ip:port. Default port: 22.

      Authentication Type

      Supported methods: No authentication and Kerberos.

      Kerberos is a symmetric-key-based identity authentication protocol. It supports single sign-on (SSO), letting authenticated clients access multiple services such as HBase and HDFS.

      If your Hadoop cluster uses Kerberos, enable cluster Kerberos and upload the krb5 file or configure the KDC server address:

      Important

      For E-MapReduce 5.x, only krb5 file configuration is supported.

      • Krb5 authentication file: Upload the krb5 file for Kerberos authentication.

      • KDC server address: The KDC server address used to complete Kerberos authentication.

      Note

      You can configure multiple KDC server addresses, separated by semicolons (;).

    • HDFS Configuration

      Parameter

      Description

      Execution username and Password

      Username and password to log on to the compute execution machine. Used for running MapReduce tasks and reading or writing HDFS storage.

      Important

      Ensure the user has permission to submit MapReduce tasks.

      Authentication Type

      Supported methods: No authentication and Kerberos.

      Note

      HDFS authentication is not supported for OSS-HDFS clusters. core-site.xml AccessKeys are used by default.

      If your Hadoop cluster uses Kerberos, enable HDFS Kerberos and upload the Keytab File and Principal.

      • Keytab File: Upload the keytab file. Get it from the HDFS Server.

      • Principal: Enter the Kerberos username for the HDFS keytab file.

      HDFS User

      Specify the username for file uploads. If left blank, the execution username is used. Fill this in only when Kerberos is disabled.

      Production task default queue

      Enter the YARN resource queue used for manual and scheduled tasks in production environments.

      Other task queues

      Enter the YARN resource queue used for other tasks, such as ad hoc queries, data previews, and JDBC Driver access.

      Task priority queue

      Select Use production task default queue or Custom.

      • Dataphin routes Hive SQL tasks to queues based on priority: highest, high, medium, low, and lowest.

      • If Hive uses Tez or Spark as the execution engine, you must assign different priority queues for task priorities to take effect.

        Note
        • Daily and hourly logical table tasks use the medium-priority queue by default.

        • Yearly and monthly logical table tasks use the low-priority queue by default.

    • Hive compute engine configuration

      Parameter

      Description

      JDBC URL

      Supports three connection address formats:

      • HiveServer connection address: jdbc:hive://{connection address}:{port}/{database name}.

      • ZooKeeper connection address. Example: jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2.

      • Kerberos-enabled connection address: jdbc:hive2://{connection address}:{port}/{database name};principal=hive/_HOST@xx.com.

      Note

      For E-MapReduce 3.x, E-MapReduce 5.x, or Cloudera Data Platform, Kerberos-enabled JDBC URLs cannot contain multiple IP addresses.

      Authentication Type

      Note

      You can configure authentication only if you select Standalone configuration.

      Supported methods: No authentication, LDAP, and Kerberos.

      • No authentication: Enter the Hive service username.

      • LDAP: Enter the Hive service username and password.

        Note

        The users you specify for no authentication or LDAP must have task execution permissions.

      • Kerberos: If your Hadoop cluster uses Kerberos, enable Hive Kerberos and upload the Keytab File and Principal.

        • Keytab File: Upload the keytab file. Get it from the Hive Server.

        • Principal: Enter the Kerberos username for the Hive keytab file.

      Execution engine

      • Default: Tasks in projects bound to this compute source—including logical table tasks—use this execution engine by default.

      • Custom: Select another compute engine type.

    • Hive metadata configuration

      Metadata retrieval method: You can choose from three methods: Metadata Database, HMS, and DLF. The required parameters depend on the method that you select.

      Important
      • DLF is supported only for clusters that use E-MapReduce 5.x.

      • To use DLF, you must first upload the hive-site.xml configuration file.

      Metadata retrieval method

      Parameter

      Description

      Metadata Database

      Database type

      Select the database type used in your cluster. Dataphin supports MySQL.

      Supported MySQL versions include MySQL 5.1.43, MySQL 5.6/5.7, and MySQL 8.

      JDBC URL

      Enter the JDBC connection address for the target database. Example:

      MySQL: jdbc:mysql://{connection address}[,failoverhost...]{port}/{database name} [?propertyName1][=propertyValue1][&propertyName2][=propertyValue2]....

      Username and Password

      Enter the username and password to log on to the metadata database.

      HMS

      Authentication Type

      HMS supports No authentication, LDAP, and Kerberos. For Kerberos, upload the Keytab File and configure the Principal.

      DLF

      Endpoint

      Enter the DLF endpoint for the region where your cluster resides. For instructions, see Supported regions and endpoints.

      AccessKey ID and AccessKey Secret

      Enter the AccessKey ID and AccessKey secret of the account that owns the cluster. Use an existing AccessKey or see Create an AccessKey pair to create a new one.

      Note

      To reduce the risk of AccessKey exposure, the AccessKey Secret appears only once during creation and cannot be viewed later. Store it securely.

    • Spark JAR service configuration

      Parameter

      Description

      Spark Executor

      If Spark is deployed on your Hadoop cluster, you can enable Spark JAR tasks.

      Execution username and Password

      Enter the username and password to log on to the compute execution machine.

      Important

      The user must have permission to submit MapReduce tasks.

      Authentication Type

      Supported methods: No authentication or Kerberos.

      If your Hadoop cluster uses Kerberos, enable Spark Kerberos and upload the Keytab File and Principal.

      • Keytab File: Upload the keytab file. Get it from the Spark Server.

      • Principal: Enter the Kerberos username for the Spark keytab file.

    • Spark SQL service configuration

      Parameter

      Description

      Spark SQL tasks

      If Spark is deployed on your Hadoop cluster, you can enable Spark SQL tasks.

      Spark version

      Only version 3.x is supported.

      Service type

      Select the server type for Spark JDBC access. Supported service types vary by compute engine. For more information, see Compute engines and supported service types.

      JDBC URL

      The Spark JDBC URL. Its database must match the database in the Hive JDBC URL.

      Authentication Type

      Supported methods: No authentication, LDAP, and Kerberos.

      • No authentication: Enter the Spark service username.

      • LDAP: Enter the Spark service username and password.

        Note

        The users you specify for no authentication or LDAP must have task execution permissions.

      • Kerberos: If your Hadoop cluster uses Kerberos, enable Spark Kerberos and upload the Keytab File and Principal.

        • Keytab File: Upload the keytab file. Get it from the Spark Server.

        • Principal: Enter the Kerberos username for the Spark keytab file.

      SQL task queue settings

      Different service types use different SQL task queues. Details:

      • Spark Thrift Server: Task queues are not supported.

      • Kyuubi: Uses the priority queue configured in HDFS settings. Applies only when Kyuubi uses YARN for resource scheduling. Production tasks use shared connections.

      • Livy: Uses the priority queue configured in HDFS settings. Applies only when Livy uses YARN for resource scheduling. Ad hoc queries and production tasks use new connections.

      • MapReduce (MRS): Uses the priority queue configured in HDFS settings.

    • Impala task configuration

      Parameter

      Description

      Impala tasks

      If Impala is deployed on your Hadoop cluster, you can enable Impala tasks.

      JDBC URL

      Enter the Impala JDBC connection address. Example: jdbc:Impala://host:port/database. The database in this URL must match the database in the Hive JDBC URL.

      Note

      If you select Reference cluster configuration, you can only view the JDBC URL.

      Authentication Type

      Supported methods: No authentication, LDAP, and Kerberos.

      • No authentication: Enter the Impala username.

      • LDAP: Enter the Impala username and password.

      • Kerberos: Upload the Keytab File and configure the Principal.

      Development task request pool

      Enter the Impala request pool name for development tasks.

      Scheduled task request pool

      Enter the Impala request pool name for scheduled tasks.

      Priority task queue

      Choose Use scheduled task default queue or Custom.

      • Dataphin routes Impala SQL tasks to queues based on priority: highest, high, medium, low, and lowest.

      • When customizing, daily logical table tasks use the medium-priority queue by default. Yearly and monthly logical table tasks use the low-priority queue by default.

  4. Click Test Connection to verify the connection to the compute source.

  5. After the connection test is successful, click Submit.

What to do next

After you create the Hadoop compute source, you must bind it to a project. For more information, see Create a general project.