Create a Hadoop compute source - Dataphin - Alibaba Cloud Documentation Center

A Hadoop compute source connects a Dataphin project to a Hadoop project. It provides the compute resources that a Dataphin project requires to process offline computing tasks. If the Dataphin compute engine is set to Hadoop, the project must have a Hadoop compute source to support features such as standard modeling, ad hoc queries, Hive tasks, and general scripts. This topic describes how to create a Hadoop compute source.

Prerequisites

Before you start, make sure that the following requirements are met:

The Dataphin compute engine is set to Hadoop. For more information, see Set the compute engine to Hadoop.
The Hive user has the following permissions:
- CREATEFUNCTION permission.
  Important
  The CREATEFUNCTION permission is required to register user-defined functions (UDFs) in Hive through Dataphin. Without this permission, you cannot create UDFs in Dataphin or use the Dataphin asset security features.
- Read, write, and execute permissions for the directory where UDFs are stored in the Hadoop Distributed File System (HDFS).
  The default HDFS directory for UDFs is /tmp/dataphin. You can change this directory as needed.
To enable Impala tasks for saved searches and data analysis, you must deploy Impala V2.5 or later on your Hadoop cluster.
If you use the E-MapReduce 5.x compute engine and need to use Hive foreign tables based on OSS for offline integration, you must complete the required configurations. For more information, see Use Hive foreign tables based on OSS for offline integration.

Impala task limits

To enable Impala tasks for saved searches and data analysis, the following limits apply in Dataphin:

Only Impala V2.5 or later is supported.
Logical tables do not support the Impala engine. However, you can use Impala to query logical tables.
Impala data sources and compute sources in Dataphin use the Impala Java Database Connectivity (JDBC) client to connect to the Impala JDBC port, which is 21050 by default. The Hive JDBC port is not supported. If you want to create Impala tasks or data sources in Dataphin, contact your cluster provider to confirm that Impala JDBC connections are supported.
Hive cannot access Kudu tables. This leads to the following limits:
- You cannot use Hive SQL to access Kudu tables. Attempting to do so causes the SQL statement to fail and returns the following error: FAILED: RuntimeException java.lang.ClassNotFoundException: org.apache.Hadoop.hive.kudu.KuduInputFormat.
- You cannot use Kudu tables as source tables for modeling. If a source table is a Kudu table, the execution fails.
- Asset security scan tasks use Impala SQL to scan Kudu tables. If Impala is not enabled for the project where the scan task is located, Kudu tables cannot be scanned.
- When a quality rule is executed for a Kudu table, Impala SQL is used for verification. If Impala is not enabled, the verification task fails.
- The label platform does not support using Kudu tables as offline view tables.
The storage usage of Kudu tables cannot be retrieved.
- The storage usage of Kudu tables is not available in asset details.
- The empty table administration feature in resource administration does not support Kudu tables.
- Quality rules for table size and partition size do not support Kudu tables.

Spark SQL service limits

To enable the Spark SQL service, the following limits apply in Dataphin:

Only Spark V3.x is supported.
Spark Thrift Server, Kyuubi, or Livy services must be deployed and enabled on the Hadoop cluster.
Dataphin does not verify database permissions for Spark Call commands. Use this feature with caution.
The service configurations for Spark SQL must be the same in the development and production compute sources. If they are different, you cannot configure Spark resource settings for Spark SQL tasks.

Compute engines and supported service types

Different compute engines support different service types. The following table provides details:

Compute engine type	Spark Thrift Server	Kyuubi	Livy	MapReduce (MRS)
E-MapReduce 3.x	Supported	Supported	Not supported	Not supported
E-MapReduce 5.x	Supported	Supported	Not supported	Not supported
CDH 5.X, CDH 6.X	Not supported	Supported	Not supported	Not supported
Cloudera Data Platform	Not supported	Supported	Supported	Not supported
FusionInsight 8.X	Not supported	Not supported	Not supported	Supported
AsiaInfo DP 5.3	Supported	Supported	Not supported	Not supported

Procedure

In the top menu bar of the Dataphin homepage, choose Planning > Compute Source.
On the Compute Source page, click + Add Compute Source, and then choose Hadoop Compute Source.

On the Create Compute Source page, set the following parameters.

You can configure the compute source by selecting Reference specified cluster or Configure separately. The available configuration items vary depending on the method you select.

Reference specified cluster configuration

Basic information of the compute source

Parameter	Description
Compute Source Type	The default value is Hadoop.
Compute Source Name	Observe the following naming conventions: The name can contain only Chinese characters, letters, digits, underscores (_), and hyphens (-). The name can be up to 64 characters in length.
Configuration Method	Select Reference Specified Cluster.
Data Lake Table Format	This feature is disabled by default. After you enable it, you can select a data lake table format. If the compute engine is Cloudera Data Platform 7.x, the Hudi table format is supported. If the compute engine is E-MapReduce 5.x, the Iceberg and Paimon table formats are supported. Note This parameter is available only if the compute engine is Cloudera Data Platform 7.x or E-MapReduce 5.x.
Compute Source Description	A brief description of the compute source. The description can be up to 128 characters in length.

Queue information configuration

Parameter	Description
Production Task Queue	Enter the YARN resource queue. This queue is used to run manual and auto triggered tasks in the production environment.
Other Task Queue	Enter the YARN resource queue. This queue is used for other tasks, such as ad hoc queries, data previews, and JDBC driver access.
Priority Task Queue	You can select Use Production Task Default Queue or Custom. If you select Custom, you must enter the YARN resource queues that correspond to the highest, high, medium, low, and lowest priorities.

Hive compute engine configuration

Parameter	Description
Connection Information	You can select Reference Cluster Configuration or Configure Separately.
JDBC URL	You can configure one of the following types of endpoints: The endpoint of the Hive server. Format: `jdbc:hive://{endpoint}:{port}/{database_name}`. The endpoint of ZooKeeper. Example: `jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2`. The endpoint with Kerberos authentication enabled. Format: `jdbc:hive2://{endpoint}:{port}/{database_name};principal=hive/_HOST@xx.com`. Note You can modify the JDBC URL if you select Configure separately for Connection information. If you select Reference cluster configuration, the JDBC URL is view-only.
Database	Note This parameter is available only if you select Reference cluster configuration for Connection information. Enter the database name. The name cannot contain periods (.) and can be up to 256 characters in length.
Authentication Type	Note This parameter is available only if you select Configure separately for Connection information. The supported authentication methods are No Authentication, LDAP, and Kerberos. No authentication: Enter the username for the Hive service. LDAP: Enter the username and password for the Hive service. Note For the No authentication and LDAP methods, make sure that the specified user has the permissions to execute tasks. Kerberos: If the Hadoop cluster uses Kerberos authentication, enable Hive Kerberos, upload the keytab file, and configure the principal. Keytab File: Upload the keytab file. You can obtain this file from the Hive server. Principal: Enter the Kerberos authentication username that corresponds to the Hive keytab file.
Execution Engine	Default: Nodes, including logical table tasks, in the project that is attached to this compute source use this execution engine by default. Custom: Select another type of execution engine.

Spark Jar service configuration

Note

If you select Reference cluster configuration and the referenced cluster does not have the Spark local client enabled, you cannot configure the Spark Jar service.

Parameter

Description

Spark Execution Machine

If Spark is deployed on the Hadoop cluster, you can enable Spark Jar Tasks.

Spark Local Client

If the referenced cluster has the Spark local client enabled, this option is enabled by default.

After you disable it, the project that corresponds to the current compute source cannot use the Spark local client. If the project contains nodes, including draft nodes, that use the Spark local client, you cannot disable this option.

Spark SQL service configuration

Note

If you select Reference cluster configuration and the referenced cluster does not have the Spark SQL service enabled, you cannot configure the Spark SQL service.

Parameter	Description
Spark SQL task	If Spark is deployed on the Hadoop cluster, you can enable Spark SQL Tasks. Note If you select Paimon for Data lake table format, you cannot disable Spark SQL tasks.
Connection information	You can select Reference Cluster Configuration or Configure Separately.
Spark version	Only Spark V3.x is supported.
Service type	Select the type of the destination server for Spark JDBC access. Different compute engines support different service types. For more information, see Compute engines and supported service types.
JDBC URL	The JDBC URL of Spark. The database in the URL must be the same as the database specified in the Hive JDBC URL. Note You can modify the JDBC URL if you select Configure separately for Connection information. If you select Reference cluster configuration, the JDBC URL is view-only.
Database	Note This parameter is available only if you select Reference cluster configuration for Connection information. Enter the database name. The name cannot contain periods (.) and can be up to 256 characters in length.
Authentication method	Note This parameter is available only if you select Configure separately for Connection information. The supported authentication methods are No Authentication, LDAP, and Kerberos. No authentication: Enter the username for the Spark service. LDAP: Enter the username and password for the Spark service. Note For the No authentication and LDAP methods, make sure that the specified user has the permissions to execute tasks. Kerberos: If the Hadoop cluster uses Kerberos authentication, enable Spark Kerberos, upload the keytab file, and configure the principal. Keytab File: Upload the keytab file. You can obtain this file from the Spark server. Principal: Enter the Kerberos authentication username that corresponds to the Spark keytab file.
SQL task queue settings	Different service types use different SQL task queues. Details are as follows: Spark Thrift Server: You cannot set a task queue. Kyuubi: Uses the priority queue settings from the HDFS information configuration. This takes effect only when Kyuubi uses YARN for resource scheduling. Production tasks use the connection sharing level. Livy: Uses the priority queue settings from the HDFS information configuration. This takes effect only when Livy uses YARN for resource scheduling. Ad hoc queries and production tasks are executed using a new connection. MapReduce (MRS): Uses the priority queue settings from the HDFS information configuration.

Impala task configuration

Note

If you select Reference cluster configuration and the referenced cluster does not have Impala tasks enabled, you cannot configure the Impala task service.

Parameter	Description
Impala Task	If Impala is deployed on the Hadoop cluster, you can enable Impala tasks.
Connection Information	You can select Reference Cluster Configuration or Configure Separately.
JDBC URL	Enter the JDBC endpoint of Impala. Example: `jdbc:Impala://host:port/database`. The database in the JDBC URL must be the same as the database in the Hive JDBC URL. Note You can modify the JDBC URL if you select Configure separately for Connection information. If you select Reference cluster configuration, the JDBC URL is view-only.
Database	Note This parameter is available only if you select Reference cluster configuration for Connection information. Enter the database name. The name cannot contain periods (.) and can be up to 256 characters in length.
Authentication Type	Note This parameter is available only if you select Configure separately for Connection information. The supported authentication methods are No Authentication, LDAP, and Kerberos. No authentication: Enter the Impala username. LDAP: Enter the username and password for Impala. Kerberos: Upload the keytab file and configure the principal.
Development Task Request Pool	Enter the name of the Impala request pool for development tasks.
Auto Triggered Task Request Pool	Enter the name of the Impala request pool for auto triggered tasks.
Priority Task Queue	Supports Use Auto Triggered Task Default Queue and Custom. When Dataphin schedules Impala SQL tasks, it sends the tasks to the corresponding queues for execution based on their priorities. The priorities are highest, high, medium, low, and lowest. If you customize the priority task queue, logical table tasks that are scheduled to run daily use the medium-priority task queue by default. Logical table tasks that are scheduled to run yearly or monthly use the low-priority task queue by default.

Configure separately

Basic information of the compute source

Parameter	Description
Compute Source Type	The default value is Hadoop.
Compute Source Name	Observe the following naming conventions: The name can contain only Chinese characters, letters, digits, underscores (_), and hyphens (-). The name can be up to 64 characters in length.
Configuration Method	Select Configure Separately.
Data Lake Table Format	This feature is disabled by default. After you enable it, you can select a data lake table format. Currently, only Hudi is supported. Note This parameter is available only if the compute engine is Cloudera Data Platform 7.x.
Compute Source Description	A brief description of the compute source. The description can be up to 128 characters in length.

Basic information of the cluster

Note

You can configure the basic information of the cluster only if you select Configure separately.

Parameter	Description
Cluster Storage	This parameter is set to the value configured in compute settings and cannot be changed. This parameter is not available for clusters that do not use OSS-HDFS storage.
NameNode	Click + Add. In the Add NameNode dialog box, configure the parameters. You can add multiple NameNodes. The NameNode is the hostname or IP address and port of the NameNode in the HDFS cluster. Example: NameNode: 193.168.xx.xx Web UI Port: 50070 IPC Port: 8020 You must select at least one of the Web UI Port and IPC Port. After configuration, the NameNode is `host=192.168.xx.xx,webUiPort=50070,ipcPort=8020`. Note This parameter is available only if you set Cluster storage to HDFS.
Cluster Storage Root Directory	This parameter is set to the value configured in compute settings and cannot be changed. This parameter is not available for clusters that do not use OSS-HDFS storage.
AccessKey ID, AccessKey Secret	If the cluster storage type is OSS-HDFS, enter the AccessKey ID and AccessKey secret that are used to access the OSS of the cluster. For more information about how to view an AccessKey pair, see View the AccessKey pair of a RAM user. Important The configuration that you enter here has a higher priority than the AccessKey pair configured in the core-site.xml file.
core-site.xml	Upload the core-site.xml configuration file of the Hadoop cluster.
hdfs-site.xml	Upload the hdfs-site.xml configuration file of HDFS in the Hadoop cluster. Note You cannot upload the hdfs-site.xml configuration file if the cluster storage type is OSS-HDFS.
hive-site.xml	Upload the hive-site.xml configuration file of Hive in the Hadoop cluster.
yarn-site.xml	Upload the yarn-site.xml configuration file of Hive in the Hadoop cluster.
Other Configuration Files	Upload the keytab file. You can run the ipa-getkeytab command on a NameNode in the HDFS cluster to obtain the file.
Task Execution Machine	Configure the endpoint of the machine that executes MapReduce or Spark Jar tasks. Format: `hostname:port` or `ip:port`. The default port is 22.
Authentication Type	The supported authentication methods are No authentication and Kerberos. Kerberos is an identity authentication protocol that is based on symmetric key technology. It provides identity authentication for other services and supports single sign-on (SSO). After a client is authenticated, it can access multiple services, such as HBase and HDFS. If the Hadoop cluster uses Kerberos authentication, enable cluster Kerberos and upload the krb5.conf file or configure the KDC server address. Important When the compute engine type is E-MapReduce 5.x, only the Krb5 File Configuration method is supported. Krb5 authentication file: Upload the krb5.conf file for Kerberos authentication. KDC server address: The address of the Key Distribution Center (KDC) server, which assists with Kerberos authentication. Note You can configure multiple KDC server addresses. Separate them with semicolons (;).

HDFS information configuration

Parameter	Description
Execution Username, Password	The username and password to log on to the task execution machine. They are used to execute MapReduce tasks and read data from and write data to HDFS. Important Make sure that you have the permissions to submit MapReduce tasks.
Authentication Type	The supported methods are No Authentication and Kerberos. Note If the cluster storage is OSS-HDFS, you cannot configure an HDFS authentication method. The AccessKey pair in the core-site.xml file is used by default. If the Hadoop cluster uses Kerberos authentication, enable HDFS Kerberos, upload the keytab file, and configure the principal. Keytab File: Upload the keytab file. You can obtain this file from the HDFS server. Principal: Enter the Kerberos authentication username that corresponds to the HDFS keytab file.
HDFS User	The username for file uploads. If you leave this empty, the execution username is used by default. You can set this parameter when Kerberos is disabled.
Production Task Default Queue	Enter the YARN resource queue. This queue is used to run manual and auto triggered tasks in the production environment.
Other Task Queue	Enter the YARN resource queue. This queue is used for other tasks, such as ad hoc queries, data previews, and JDBC driver access.
Task Priority Queue	You can select Use Production Task Default Queue or Custom. When Dataphin schedules Hive SQL tasks, it sends the tasks to the corresponding queues for execution based on their priorities. The priorities are highest, high, medium, low, and lowest. If you set the Hive execution engine to Tez or Spark, you must configure different priority queues for the task priority settings to take effect. Note Logical table tasks that are scheduled to run daily or hourly use the medium-priority task queue by default. Logical table tasks that are scheduled to run yearly or monthly use the low-priority task queue by default.

Hive compute engine configuration

Parameter	Description
JDBC URL	You can configure one of the following types of endpoints: The endpoint of the Hive server. Format: `jdbc:hive://{endpoint}:{port}/{database_name}`. The endpoint of ZooKeeper. Example: `jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2`. The endpoint with Kerberos authentication enabled. Format: `jdbc:hive2://{endpoint}:{port}/{database_name};principal=hive/_HOST@xx.com`.
Authentication Type	Note This parameter is available only if you select Configure separately for Connection information. The supported authentication methods are No Authentication, LDAP, and Kerberos. No authentication: Enter the username for the Hive service. LDAP: Enter the username and password for the Hive service. Note For the No authentication and LDAP methods, make sure that the specified user has the permissions to execute tasks. Kerberos: If the Hadoop cluster uses Kerberos authentication, enable Hive Kerberos, upload the keytab file, and configure the principal. Keytab File: Upload the keytab file. You can obtain this file from the Hive server. Principal: Enter the Kerberos authentication username that corresponds to the Hive keytab file.
Execution Engine	Default: Nodes, including logical table tasks, in the project that is attached to this compute source use this execution engine by default. Custom: Select another type of execution engine.

Hive metadata configuration

Metadata retrieval method: Three metadata retrieval methods are supported: Metadata Database, HMS, and DLF. Each method requires different configuration information.

Important

The DLF retrieval method is supported only for clusters that use E-MapReduce 5.x Hadoop as the compute engine.
To use the DLF method to retrieve metadata, you must first upload the hive-site.xml configuration file.

Metadata retrieval method	Parameter	Description
Metadata Database	Database Type	Select a database based on the metadatabase type used in the cluster. Dataphin supports MySQL. The supported MySQL versions are MySQL 5.1.43, MySQL 5.6/5.7, and MySQL 8.
	JDBC URL	Enter the JDBC endpoint of the destination database. Example: MySQL: The format is `jdbc:mysql://{connection address}[,failoverhost...]{port}/{database name} [?propertyName1][=propertyValue1][&propertyName2][=propertyValue2]...`.
	Username, Password	Enter the username and password to log on to the metadatabase.
HMS	Authentication Type	The HMS retrieval method supports No authentication, LDAP, and Kerberos. The Kerberos authentication method requires you to upload a keytab file and configure a principal.
DLF	Endpoint	Enter the endpoint of the region where the data center of DLF for the cluster is located. To obtain the endpoint, see DLF regions and endpoints.
DLF	AccessKey ID, AccessKey Secret	Enter the AccessKey ID and AccessKey secret of the account to which the cluster belongs. You can obtain the AccessKey ID and AccessKey secret of your account on the User Information Management page.

Spark Jar service configuration

Parameter	Description
Spark Execution Machine	If Spark is deployed on the Hadoop cluster, you can enable Spark Jar Tasks.
Execution Username, Password	Enter the username and password to log on to the task execution machine. Important The permission to submit MapReduce tasks is granted.
Authentication Type	The supported authentication methods are No Authentication and Kerberos. If the Hadoop cluster uses Kerberos authentication, enable Spark Kerberos, upload the keytab file, and configure the principal. Keytab File: Upload the keytab file. You can obtain this file from the Spark server. Principal: Enter the Kerberos authentication username that corresponds to the Spark keytab file.

Spark SQL service configuration

Parameter	Description
Spark SQL task	If Spark is deployed on the Hadoop cluster, you can enable Spark SQL Tasks.
Spark version	Only Spark V3.x is supported.
Service type	Select the type of the destination server for Spark JDBC access. Different compute engines support different service types. For more information, see Compute engines and supported service types.
JDBC URL	The JDBC URL of Spark. The database in the URL must be the same as the database specified in the Hive JDBC URL.
Authentication method	The supported authentication methods are No Authentication, LDAP, and Kerberos. No authentication: Enter the username for the Spark service. LDAP: Enter the username and password for the Spark service. Note For the No authentication and LDAP methods, make sure that the specified user has the permissions to execute tasks. Kerberos: If the Hadoop cluster uses Kerberos authentication, enable Spark Kerberos, upload the keytab file, and configure the principal. Keytab File: Upload the keytab file. You can obtain this file from the Spark server. Principal: Enter the Kerberos authentication username that corresponds to the Spark keytab file.
SQL task queue settings	Different service types use different SQL task queues. Details are as follows: Spark Thrift Server: You cannot set a task queue. Kyuubi: Uses the priority queue settings from the HDFS information configuration. This takes effect only when Kyuubi uses YARN for resource scheduling. Production tasks use the connection sharing level. Livy: Uses the priority queue settings from the HDFS information configuration. This takes effect only when Livy uses YARN for resource scheduling. Ad hoc queries and production tasks are executed using a new connection. MapReduce (MRS): Uses the priority queue settings from the HDFS information configuration.

Impala task configuration

Parameter	Description
Impala Task	If Impala is deployed on the Hadoop cluster, you can enable Impala tasks.
JDBC URL	Enter the JDBC endpoint of Impala. Example: `jdbc:Impala://host:port/database`. The database in the JDBC URL must be the same as the database in the Hive JDBC URL. Note If you select Reference cluster configuration for the connection information, the JDBC URL is view-only.
Authentication Type	The supported authentication methods are No Authentication, LDAP, and Kerberos. No authentication: Enter the Impala username. LDAP: Enter the username and password for Impala. Kerberos: Upload the keytab file and configure the principal.
Development Task Request Pool	Enter the name of the Impala request pool for development tasks.
Auto Triggered Task Request Pool	Enter the name of the Impala request pool for auto triggered tasks.
Priority Task Queue	Supports Use Auto Triggered Task Default Queue and Custom. When Dataphin schedules Impala SQL tasks, it sends the tasks to the corresponding queues for execution based on their priorities. The priorities are highest, high, medium, low, and lowest. If you customize the priority task queue, logical table tasks that are scheduled to run daily use the medium-priority task queue by default. Logical table tasks that are scheduled to run yearly or monthly use the low-priority task queue by default.

Click Test Connection to test the connection to the compute source.
After the connection test succeeds, click Submit.

What to do next

After you create a Hadoop compute source, you can attach it to a project. For more information, see Create a general-purpose project.