All Products
Search
Document Center

Dataphin:Create Hadoop Compute Source

Last Updated:Jan 23, 2025

The Hadoop compute source allows you to integrate Dataphin project spaces with Hadoop projects, providing essential computing resources for Dataphin to manage offline computing tasks. When Dataphin's compute engine is set to Hadoop, adding a Hadoop compute source to the project space is necessary to enable standard modeling, ad hoc queries, Hive tasks, and general scripting. This topic guides you through the creation of a new Hadoop compute source.

Prerequisites

Before beginning, ensure the following conditions are met:

  • The Dataphin compute engine is set to Hadoop. For instructions, see Set the compute engine to Hadoop.

  • The Hive user must have the following permissions:

    • CREATEFUNCTION permission.

      Important

      The CREATEFUNCTION permission is required to register user-defined functions in Hive through Dataphin. Without this permission, creating user-defined functions or using Dataphin's asset security feature is not possible.

    • Read, write, and execute permissions for the HDFS directory where UDFs are stored.

      The default HDFS directory for UDFs is /tmp/dataphin, but you may modify it as needed.

  • Deploy Impala (version 2.5 or higher) on the Hadoop cluster beforehand if you plan to enable Impala tasks for saved searches and data analysis.

  • For E-MapReduce 5.x compute engine users who want to use Hive foreign tables based on OSS for offline integration, pre-configuration is required. For configuration steps, see Use Hive foreign tables based on OSS for offline integration.

Impala tasks Limits

The following limitations apply to Impala tasks in Dataphin:

  • Only Impala version 2.5 and above is supported.

  • Logical tables currently do not support the Impala engine; however, Impala can be used to query logical tables.

  • Dataphin's Impala data and compute sources connect using the Impala JDBC client to the Impala JDBC port (default 21050), not the Hive JDBC port. Consult your cluster provider to ensure Impala JDBC connectivity is supported if you intend to create Impala tasks or data sources in Dataphin.

  • Hive cannot access Kudu tables, resulting in the following restrictions:

    • Hive SQL cannot query Kudu tables. Attempts to do so will result in execution failure and an error message: FAILED: RuntimeException java.lang.ClassNotFoundException: org.apache.hadoop.hive.kudu.KuduInputFormat.

    • Modeling cannot use Kudu tables as source tables. If a source table is a Kudu table, it will not function properly.

    • Security detection scan tasks that use Impala SQL to scan Kudu tables are not supported unless Impala is enabled in the project where the scan task is located.

    • Quality rule executions for Kudu tables use Impala SQL for verification. Without Impala enabled, these tasks will fail and report an error.

    • The label platform does not support Kudu tables as offline view tables.

  • Kudu table storage volume cannot be obtained.

    • Asset details do not include storage volume information for Kudu tables.

    • Resource administration's empty table management does not support Kudu tables.

    • Kudu tables are not supported by table size and partition size quality rules.

Spark SQL Service Limits

The following limitations apply to Spark SQL services in Dataphin:

  • Only Spark version 3.x is supported.

  • Spark Thrift Server, Kyuubi, or Livy services must be deployed and enabled on the Hadoop cluster beforehand.

  • Dataphin does not verify database permissions for Spark Call commands. Exercise caution when enabling and using this feature.

  • Spark SQL configurations in production and development compute sources must be consistent. Inconsistencies will prevent configuration of Spark resource settings for Spark SQL tasks.

Compute engine and supported service types

Different compute engines support various service types. The details are as follows:

Compute Engine Type

Spark Thrift Server

Kyuubi

Livy

MapReduce (MRS)

E-MapReduce 3.x

Supported

Supported

Not Supported

Not Supported

E-MapReduce 5.x

Supported

Supported

Not Supported

Not Supported

CDH 5.X, CDH 6.X

Not Supported

Supported

Not Supported

Not Supported

Cloudera Data Platform

Not Supported

Supported

Supported

Not Supported

FusionInsight 8.X

Not Supported

Not Supported

Not Supported

Supported

AsiaInfo DP 5.3

Supported

Supported

Not Supported

Not Supported

Procedure

  1. On the Dataphin homepage, navigate to the top menu bar and select Planning > Compute Source.

  2. On the Compute Source page, click + Add Compute Source and choose Hadoop Compute Source.

  3. On the New Compute Source page, configure the parameters as follows.

    Configure each parameter of the compute source using either Reference Specified Cluster or Separate Configuration. The supported configuration items vary depending on the method chosen.

    Reference Specified Cluster Configuration

    • Basic Information of Compute Source

      Parameter

      Description

      Compute Source Type

      Default is Hadoop.

      Compute Source Name

      The naming convention is as follows:

      • Can only contain Chinese, English, numbers, underscores (_), and hyphens (-).

      • Length cannot exceed 64 characters.

      Configuration Method

      Select Reference Specified Cluster.

      Data Lake Table Format

      Default is off. After enabling, you can select the data lake table format. Currently, only Hudi is supported.

      Note

      This item is supported only when the compute engine is Cloudera Data Platform 7.x.

      Compute Source Description

      A brief description of the compute source, within 128 characters.

    • Queue Information Configuration

      Parameter

      Description

      Production Task Queue

      Fill in the Yarn resource queue. The Yarn queue is used for manual and periodic task execution in the production environment.

      Other Task Queue

      Fill in the Yarn resource queue. The Yarn queue is used for other tasks such as ad hoc queries, data preview, and JDBC Driver access.

      Priority Task Queue

      You can choose Use Production Task Default Queue or Custom.

      When selecting custom, you need to fill in the Yarn resource queue corresponding to the highest priority, high priority, medium priority, low priority, and lowest priority separately.

    • Hive Compute Engine Configuration

      Parameter

      Description

      Link Information

      You can choose Reference Cluster Configuration or Separate Configuration.

      JDBC URL

      Supports configuring the following three types of connection addresses:

      • Connection address of Hive Server, format is jdbc:hive://<connection address>:<port>/<database name>.

      • Connection address of ZooKeeper. For example, jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2.

      • Connection address with Kerberos enabled, format is jdbc:hive2://<connection address>:<port>/<database name>;principal=hive/_HOST@xx.com.

      Note

      When link information is selected as Separate Configuration, modifying the JDBC URL is supported. When link information is selected as Reference Cluster Configuration, JDBC URL only supports viewing and does not support modification.

      Database

      Note

      Only when link information is selected as Reference Cluster Configuration, configuring the database is supported.

      Enter the database name. Half-width periods (.) are not supported, and the length does not exceed 256 characters.

      Authentication Type

      Note

      Only when link information is selected as Separate Configuration, configuring the authentication type is supported.

      Supports No Authentication, LDAP, and Kerberos authentication types.

      • No Authentication: No authentication type requires filling in the username of the Hive service.

      • LDAP: LDAP authentication type requires filling in the username and password of the Hive service.

        Note

        The users filled in for No Authentication and LDAP types must ensure they have task execution permissions to ensure normal task execution.

      • Kerberos: If the Hadoop cluster has Kerberos authentication, you need to enable Hive Kerberos and upload the Keytab File authentication file and configure the Principal.

        • Keytab File: Upload the keytab file, which you can obtain from the Hive Server.

        • Principal: Fill in the Kerberos authentication username corresponding to the Hive Keytab File.

      Execution Engine

      • Default: Default: Tasks (including logical table tasks) under the project bound to this compute source use this execution engine by default.

      • Custom: Select other compute engine types.

    • Spark Jar Service Configuration

      Note

      If you select 'Reference Cluster Configuration' as the configuration method and the referenced cluster has not enabled the Spark local client, configuring the Spark Jar service is not possible.

      Parameter

      Description

      Spark Execution Machine

      If Spark is deployed in the Hadoop cluster, enabling Spark Jar Task is supported.

      Spark Local Client

      If the referenced cluster has enabled the Spark local client, it is enabled by default here.

      After clicking shutdown, the project corresponding to the current compute source cannot use the Spark local client. If there are tasks (including draft tasks) using the Spark local client in the project corresponding to the current compute source, shutdown is not supported.

    • Spark SQL Service Configuration

      Note

      If you select 'Reference Cluster Configuration' as the configuration method and the referenced cluster has not enabled the Spark SQL service, configuring the Spark SQL service is not possible.

      Parameter

      Description

      Spark SQL Task

      If Spark is deployed in the Hadoop cluster, enabling Spark SQL Task is supported.

      Link Information

      You can choose Reference Cluster Configuration or Separate Configuration.

      Spark Version

      Currently, only version 3.x is supported.

      Service Type

      Select the target server type for Spark JDBC access. Different compute engines support different service types. For more information, see Compute Engine and Supported Service Types.

      JDBC URL

      Spark's JDBC URL address. The database in the URL needs to be the same as the database specified in the Hive JDBC URL.

      Note

      When link information is selected as Separate Configuration, modifying the JDBC URL is supported. When link information is selected as Reference Cluster Configuration, JDBC URL only supports viewing and does not support modification.

      Database

      Note

      Only when link information is selected as Reference Cluster Configuration, configuring the database is supported.

      Enter the database name. Half-width periods (.) are not supported, and the length does not exceed 256 characters.

      Authentication Type

      Note

      Only when link information is selected as Separate Configuration, configuring the database is supported.

      Supports No Authentication, LDAP, and Kerberos authentication types.

      • No Authentication: No authentication type requires filling in the username of the Spark service.

      • LDAP: LDAP authentication type requires filling in the username and password of the Spark service.

        Note

        The users filled in for No Authentication and LDAP types must ensure they have task execution permissions to ensure normal task execution.

      • Kerberos: If the Hadoop cluster has Kerberos authentication, you need to enable Spark Kerberos and upload the Keytab File authentication file and configure the Principal.

        • Keytab File: Upload the keytab file, which you can obtain from the Spark Server.

        • Principal: Fill in the Kerberos authentication username corresponding to the Spark Keytab File.

      SQL Task Queue Settings

      Different service types use different SQL task queues. Details are as follows:

      • Spark Thrift Server: Task queue setting is not supported.

      • Kyuubi: Uses the priority queue setting configured with HDFS information. It is effective only when Kyuubi uses Yarn as resource scheduling. Production tasks use Connection sharing level.

      • Livy: Uses the priority queue setting configured with HDFS information. It is effective only when Livy uses Yarn as resource scheduling. Ad hoc queries and production tasks use the execution of a new Connection.

      • MapReduce (MRS): Uses the priority queue setting configured with HDFS information.

    • Impala Task Configuration

      Note

      If you select 'Reference Cluster Configuration' as the configuration method and the referenced cluster has not enabled Impala tasks, configuring the Impala task service is unsupported.

      Parameter

      Description

      Impala Task

      If Impala is deployed in the Hadoop cluster, enabling Impala tasks is supported.

      Link Information

      You can choose Reference Cluster Configuration or Separate Configuration.

      JDBC URL

      Enter the JDBC connection address of Impala. For example, jdbc:Impala://host:port/database. The Database in the JDBC URL needs to be consistent with the Database in the Hive JDBC.

      Note

      When link information is selected as Separate Configuration, modifying the JDBC URL is supported. When link information is selected as Reference Cluster Configuration, JDBC URL only supports viewing and does not support modification.

      Database

      Note

      Only when link information is selected as Reference Cluster Configuration, configuring the database is supported.

      Enter the database name. Half-width periods (.) are not supported, and the length does not exceed 256 characters.

      Authentication Type

      Note

      Only when link information is selected as Separate Configuration, configuring the database is supported.

      Supports No Authentication, LDAP, and Kerberos authentication types.

      • No Authentication: No authentication type requires filling in the Impala username.

      • LDAP: LDAP authentication type requires filling in the username and password of Impala.

      • Kerberos: Kerberos authentication type requires uploading the Keytab File authentication file and configuring the Principal.

      Development Task Request Pool

      Enter the name of the Impala request pool (request pool) used for development tasks.

      Periodic Task Request Pool

      Enter the name of the Impala request pool (request pool) used for periodic tasks.

      Priority Task Queue

      Supports Use Periodic Task Default Queue and Custom.

      • When scheduling Impala SQL tasks, Dataphin will send tasks to the corresponding queue for execution based on the task's priority. Priorities include highest priority, high priority, medium priority, low priority, and lowest priority.

      • When customizing the priority task queue, daily scheduled logical table tasks use the medium priority task queue by default; yearly and monthly scheduled logical table tasks use the low priority task queue by default.

    Separate Configuration

    • Basic Information of Compute Source

      Parameter

      Description

      Compute Source Type

      Default is Hadoop.

      Compute Source Name

      The naming convention is as follows:

      • Can only contain Chinese, English, numbers, underscores (_), and hyphens (-).

      • Length cannot exceed 64 characters.

      Configuration Method

      Select Separate Configuration.

      Data Lake Table Format

      Default is off. After turning on the switch, you can select the data lake table format. Currently, only Hudi is supported.

      Note

      This item is supported only when the compute engine is Cloudera Data Platform 7.x.

      Compute Source Description

      A brief description of the compute source, within 128 characters.

    • Basic Information of Cluster

      Note

      You can configure the basic information of the cluster only if the Separate Configuration method is selected.

      Parameter

      Description

      Cluster Storage

      Default is the parameter information configured by compute settings, configuration is not supported. Non-OSS-HDFS cluster storage does not have this parameter item.

      NameNode

      Click + Add, configure related parameters in the Add Namenode dialog box, and support adding multiple NameNodes.

      NameNode is the HostName or IP and port of the NameNode node in the HDFS cluster. Configuration example:

      • NameNode: 193.168.xx.xx

      • Web UI Port: 50070

      • IPC Port: 8020

      At least one of the Web UI Port and IPC Port must be selected. After configuration, the NameNode is host=192.168.xx.xx,webUiPort=50070,ipcPort=8020.

      Note

      When the cluster storage is HDFS, this item is supported.

      Cluster Storage Root Directory

      Default is the parameter information configured by compute settings, configuration is not supported. Non-OSS-HDFS cluster storage does not have this parameter item.

      AccessKey ID, AccessKey Secret

      When the cluster storage type is OSS-HDFS, you need to fill in the AccessKey ID and AccessKey Secret for accessing the cluster OSS. For how to view AccessKey, see View AccessKey.

      Important

      The configuration filled in here has a higher priority than the AccessKey configured in core-site.xml.

      core-site.xml

      Upload the core-site.xml configuration file of the Hadoop cluster.

      hdfs-site.xml

      Upload the hdfs-site.xml configuration file of HDFS under the Hadoop cluster.

      Note

      OSS-HDFS cluster storage type does not support uploading hdfs-site.xml configuration files.

      hive-site.xml

      Upload the hive-site.xml configuration file of Hive under the Hadoop cluster.

      yarn-site.xml

      Upload the yarn-site.xml configuration file of Hive under the Hadoop cluster.

      Other Configuration Files

      Upload the keytab file, which you can obtain using the ipa-getkeytab command on the NameNode node in the HDFS cluster.

      Task Execution Machine

      Configure the connection address of the execution machine for MapReduce or Spark Jar. The format is hostname:port or ip:port, and the default port is 22.

      Authentication Type

      Supports No Authentication and Kerberos authentication types.

      Kerberos is an identity authentication protocol based on symmetric-key cryptography that can provide identity authentication functions for other services and supports SSO (that is, after client identity authentication, multiple services such as HBase and HDFS can be accessed).

      If the Hadoop cluster has Kerberos authentication, you need to enable cluster Kerberos and upload the Krb5 authentication file or configure the KDC Server address:

      Important

      When the compute engine type is E-MapReduce 5.x, only the Krb5 File Configuration method is supported.

      • Krb5 Authentication File: You need to upload the Krb5 file for Kerberos authentication.

      • KDC Server Address: The KDC server address assists in completing Kerberos authentication.

      Note

      Supports configuring multiple KDC Server service addresses, separated by English semicolons (;).

    • HDFS Information Configuration

      Parameter

      Description

      Execution Username, Password

      Username and password for logging into the compute execution machine, used for executing MapReduce tasks, reading and writing HDFS storage, etc.

      Important

      Please ensure you have permission to submit MapReduce tasks.

      Authentication Type

      Supports No Authentication and Kerberos.

      Note

      When the cluster storage is OSS-HDFS, configuring HDFS authentication methods is not supported. The AccessKey in the core-site.xml file will be used by default.

      If the Hadoop cluster has Kerberos authentication, you need to enable HDFS Kerberos and upload the Keytab File authentication file and configure the Principal.

      • Keytab File: Upload the keytab file, which you can obtain from the HDFS Server.

      • Principal: Fill in the Kerberos authentication username corresponding to the HDFS Keytab File.

      HDFS User

      Specify the username for file uploads. If not filled in, the default is the execution username. Kerberos can be turned off when filling in.

      Production Task Default Task Queue

      Fill in the Yarn resource queue. The Yarn queue is used for manual and periodic task execution in the production environment.

      Other Task Queue

      Fill in the Yarn resource queue. The Yarn queue is used for other tasks such as ad hoc queries, data preview, and JDBC Driver access.

      Task Priority Queue

      Supports Use Production Task Default Queue or Custom.

      • When scheduling Hive SQL tasks, Dataphin will send tasks to the corresponding queue for execution based on the task's priority. Priorities include highest priority, high priority, medium priority, low priority, and lowest priority.

      • When the execution engine of Hive is set to Tez or Spark, different priority queues must be set for the task priority settings to take effect.

        Note
        • Daily and hourly scheduled logical table tasks use the medium priority task queue by default.

        • Yearly and monthly scheduled logical table tasks use the low priority task queue by default.

    • Hive Compute Engine Configuration

      Parameter

      Description

      JDBC URL

      Supports configuring the following three types of connection addresses:

      • Connection address of Hive Server, format is jdbc:hive://<connection address>:<port>/<database name>.

      • Connection address of ZooKeeper. For example, jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2.

      • Connection address with Kerberos enabled, format is jdbc:hive2://<connection address>:<port>/<database name>;principal=hive/_HOST@xx.com.

      Authentication Type

      Note

      Only when link information is selected as Separate Configuration, configuring the authentication type is supported.

      Supports No Authentication, LDAP, and Kerberos authentication types.

      • No Authentication: No authentication type requires filling in the username of the Hive service.

      • LDAP: LDAP authentication type requires filling in the username and password of the Hive service.

        Note

        The users filled in for No Authentication and LDAP types must ensure they have task execution permissions to ensure normal task execution.

      • Kerberos: If the Hadoop cluster has Kerberos authentication, you need to enable Hive Kerberos and upload the Keytab File authentication file and configure the Principal.

        • Keytab File: Upload the keytab file, which you can obtain from the Hive Server.

        • Principal: Fill in the Kerberos authentication username corresponding to the Hive Keytab File.

      Execution Engine

      • Default: Default: Tasks (including logical table tasks) under the project bound to this compute source use this execution engine by default.

      • Custom: Select other compute engine types.

    • Hive Metadata Configuration

      Metadata Retrieval Method: This supports Metadata Database, HMS, and DLF as three distinct methods for sourcing metadata. Each method necessitates specific configuration details.

      Important
      • The DLF method is only compatible with clusters configured with E-MapReduce 5.x Hadoop.

      • To use the DLF method for metadata retrieval, first upload the hive-site.xml configuration file.

      Metadata Retrieval Method

      Parameter

      Description

      Metadata Database

      Database Type

      Select the database according to the type of metadatabase used in the cluster. Dataphin supports selecting MySQL.

      Supported versions of MySQL include MySQL 5.1.43, MySQL 5.6/5.7, and MySQL 8.

      JDBC URL

      Fill in the JDBC connection address of the target database. For example:

      MySQL: Format is jdbc:mysql://{connection address}[,failoverhost...]{port}/{database name} [?propertyName1][=propertyValue1][&propertyName2][=propertyValue2]....

      Username, Password

      Fill in the username and password for logging into the metadatabase.

      HMS

      Authentication Type

      HMS retrieval methods support No Authentication, LDAP, and Kerberos authentication methods. The Kerberos authentication method requires you to upload the Keytab file and configure the Principal.

      DLF

      Endpoint

      Fill in the Endpoint of the cluster in the region where the DLF data center is located. For how to obtain it, see Supported regions and endpoints.

      AccessKey ID, AccessKey Secret

      Fill in the AccessKey ID and AccessKey Secret of the account where the cluster is located.

      You can obtain the AccessKey ID and AccessKey Secret of the account on the User Information Management page.

    • Spark Jar Service Configuration

      Parameter

      Description

      Spark Execution Machine

      If Spark is deployed in the Hadoop cluster, enabling Spark Jar Task is supported.

      Execution Username, Password

      Fill in the username and password for logging into the compute execution machine.

      Important

      Permission to submit MapReduce tasks has been enabled.

      Authentication Type

      Supports No Authentication or Kerberos authentication types.

      If the Hadoop cluster has Kerberos authentication, you need to enable Spark Kerberos and upload the Keytab File authentication file and configure the Principal.

      • Keytab File: Upload the keytab file, which you can obtain from the Spark Server.

      • Principal: Fill in the Kerberos authentication username corresponding to the Spark Keytab File.

    • Spark SQL Service Configuration

      Parameter

      Description

      Spark SQL Task

      If Spark is deployed in the Hadoop cluster, enabling Spark SQL Task is supported.

      Spark Version

      Currently, only version 3.x is supported.

      Service Type

      Select the target server type for Spark JDBC access. Different compute engines support different service types. For more information, see Compute Engine and Supported Service Types.

      JDBC URL

      Spark's JDBC URL address. The database in the URL needs to be the same as the database specified in the Hive JDBC URL.

      Authentication Type

      Supports No Authentication, LDAP, and Kerberos authentication types.

      • No Authentication: No authentication type requires filling in the username of the Spark service.

      • LDAP: LDAP authentication type requires filling in the username and password of the Spark service.

        Note

        The users filled in for No Authentication and LDAP types must ensure they have task execution permissions to ensure normal task execution.

      • Kerberos: If the Hadoop cluster has Kerberos authentication, you need to enable Spark Kerberos and upload the Keytab File authentication file and configure the Principal.

        • Keytab File: Upload the keytab file, which you can obtain from the Spark Server.

        • Principal: Fill in the Kerberos authentication username corresponding to the Spark Keytab File.

      SQL Task Queue Settings

      Different service types use different SQL task queues. Details are as follows:

      • Spark Thrift Server: Task queue setting is not supported.

      • Kyuubi: Uses the priority queue setting configured with HDFS information. It is effective only when Kyuubi uses Yarn as resource scheduling. Production tasks use Connection sharing level.

      • Livy: Uses the priority queue setting configured with HDFS information. It is effective only when Livy uses Yarn as resource scheduling. Ad hoc queries and production tasks use the execution of a new Connection.

      • MapReduce (MRS): Uses the priority queue setting configured with HDFS information.

    • Impala Task Configuration

      ```html

      Parameter

      Description

      Impala Task

      If Impala is deployed in the Hadoop cluster, enabling Impala tasks is supported.

      JDBC URL

      Enter the JDBC connection address of Impala. For example, jdbc:Impala://host:port/database. The Database in the JDBC URL needs to be consistent with the Database in the Hive JDBC.

      Note

      When link information is selected as Reference Cluster Configuration, JDBC URL only supports viewing and does not support modification.

      Authentication Type

      Supports No Authentication, LDAP, and Kerberos authentication types.

      • No Authentication: No authentication type requires filling in the Impala username.

      • LDAP: LDAP authentication type requires filling in the username and password of Impala.

      • Kerberos: Kerberos authentication type requires uploading the Keytab File authentication file and configuring the Principal.

      Development Task Request Pool

      Enter the name of the Impala request pool (request pool) used for development tasks.

      Periodic Task Request Pool

      Enter the name of the Impala request pool (request pool) used for periodic tasks.

      Priority Task Queue

      Supports Use Periodic Task Default Queue and Custom.

      • When scheduling Impala SQL tasks, Dataphin will send tasks to the corresponding queue for execution based on the task's priority. Priorities include highest priority, high priority, medium priority, low priority, and lowest priority.

      • When customizing the priority task queue, daily scheduled logical table tasks use the medium priority task queue by default; yearly and monthly scheduled logical table tasks use the low priority task queue by default.

  4. Click Test Connection to verify the connectivity of the compute source.

  5. Once the connection test is successful, click Submit to finalize the creation of the Hadoop compute source.

What to Do Next

Once you have created the Hadoop compute source, you can associate it with your project. For more information, see Create a General Project.