A Hadoop compute source connects a Dataphin project to a Hadoop project. It provides the compute resources that a Dataphin project requires to process offline computing tasks. If the Dataphin compute engine is set to Hadoop, the project must have a Hadoop compute source to support features such as standard modeling, ad hoc queries, Hive tasks, and general scripts. This topic describes how to create a Hadoop compute source.
Prerequisites
Before you start, make sure that the following requirements are met:
The Dataphin compute engine is set to Hadoop. For more information, see Set the compute engine to Hadoop.
The Hive user has the following permissions:
CREATEFUNCTION permission.
ImportantThe CREATEFUNCTION permission is required to register user-defined functions (UDFs) in Hive through Dataphin. Without this permission, you cannot create UDFs in Dataphin or use the Dataphin asset security features.
Read, write, and execute permissions for the directory where UDFs are stored in the Hadoop Distributed File System (HDFS).
The default HDFS directory for UDFs is
/tmp/dataphin. You can change this directory as needed.
To enable Impala tasks for saved searches and data analysis, you must deploy Impala V2.5 or later on your Hadoop cluster.
If you use the E-MapReduce 5.x compute engine and need to use Hive foreign tables based on OSS for offline integration, you must complete the required configurations. For more information, see Use Hive foreign tables based on OSS for offline integration.
Impala task limits
To enable Impala tasks for saved searches and data analysis, the following limits apply in Dataphin:
Only Impala V2.5 or later is supported.
Logical tables do not support the Impala engine. However, you can use Impala to query logical tables.
Impala data sources and compute sources in Dataphin use the Impala Java Database Connectivity (JDBC) client to connect to the Impala JDBC port, which is 21050 by default. The Hive JDBC port is not supported. If you want to create Impala tasks or data sources in Dataphin, contact your cluster provider to confirm that Impala JDBC connections are supported.
Hive cannot access Kudu tables. This leads to the following limits:
You cannot use Hive SQL to access Kudu tables. Attempting to do so causes the SQL statement to fail and returns the following error:
FAILED: RuntimeException java.lang.ClassNotFoundException: org.apache.Hadoop.hive.kudu.KuduInputFormat.You cannot use Kudu tables as source tables for modeling. If a source table is a Kudu table, the execution fails.
Asset security scan tasks use Impala SQL to scan Kudu tables. If Impala is not enabled for the project where the scan task is located, Kudu tables cannot be scanned.
When a quality rule is executed for a Kudu table, Impala SQL is used for verification. If Impala is not enabled, the verification task fails.
The label platform does not support using Kudu tables as offline view tables.
The storage usage of Kudu tables cannot be retrieved.
The storage usage of Kudu tables is not available in asset details.
The empty table administration feature in resource administration does not support Kudu tables.
Quality rules for table size and partition size do not support Kudu tables.
Spark SQL service limits
To enable the Spark SQL service, the following limits apply in Dataphin:
Only Spark V3.x is supported.
Spark Thrift Server, Kyuubi, or Livy services must be deployed and enabled on the Hadoop cluster.
Dataphin does not verify database permissions for Spark Call commands. Use this feature with caution.
The service configurations for Spark SQL must be the same in the development and production compute sources. If they are different, you cannot configure Spark resource settings for Spark SQL tasks.
Compute engines and supported service types
Different compute engines support different service types. The following table provides details:
Compute engine type | Spark Thrift Server | Kyuubi | Livy | MapReduce (MRS) |
E-MapReduce 3.x | Supported | Supported | Not supported | Not supported |
E-MapReduce 5.x | Supported | Supported | Not supported | Not supported |
CDH 5.X, CDH 6.X | Not supported | Supported | Not supported | Not supported |
Cloudera Data Platform | Not supported | Supported | Supported | Not supported |
FusionInsight 8.X | Not supported | Not supported | Not supported | Supported |
AsiaInfo DP 5.3 | Supported | Supported | Not supported | Not supported |
Procedure
In the top menu bar of the Dataphin homepage, choose Planning > Compute Source.
On the Compute Source page, click + Add Compute Source, and then choose Hadoop Compute Source.
On the Create Compute Source page, set the following parameters.
You can configure the compute source by selecting Reference specified cluster or Configure separately. The available configuration items vary depending on the method you select.
Reference specified cluster configuration
Basic information of the compute source
Parameter
Description
Compute Source Type
The default value is Hadoop.
Compute Source Name
Observe the following naming conventions:
The name can contain only Chinese characters, letters, digits, underscores (_), and hyphens (-).
The name can be up to 64 characters in length.
Configuration Method
Select Reference Specified Cluster.
Data Lake Table Format
This feature is disabled by default. After you enable it, you can select a data lake table format.
If the compute engine is Cloudera Data Platform 7.x, the Hudi table format is supported.
If the compute engine is E-MapReduce 5.x, the Iceberg and Paimon table formats are supported.
NoteThis parameter is available only if the compute engine is Cloudera Data Platform 7.x or E-MapReduce 5.x.
Compute Source Description
A brief description of the compute source. The description can be up to 128 characters in length.
Queue information configuration
Parameter
Description
Production Task Queue
Enter the YARN resource queue. This queue is used to run manual and auto triggered tasks in the production environment.
Other Task Queue
Enter the YARN resource queue. This queue is used for other tasks, such as ad hoc queries, data previews, and JDBC driver access.
Priority Task Queue
You can select Use Production Task Default Queue or Custom.
If you select Custom, you must enter the YARN resource queues that correspond to the highest, high, medium, low, and lowest priorities.
Hive compute engine configuration
Parameter
Description
Connection Information
You can select Reference Cluster Configuration or Configure Separately.
JDBC URL
You can configure one of the following types of endpoints:
The endpoint of the Hive server. Format:
jdbc:hive://{endpoint}:{port}/{database_name}.The endpoint of ZooKeeper. Example:
jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2.The endpoint with Kerberos authentication enabled. Format:
jdbc:hive2://{endpoint}:{port}/{database_name};principal=hive/_HOST@xx.com.
NoteYou can modify the JDBC URL if you select Configure separately for Connection information. If you select Reference cluster configuration, the JDBC URL is view-only.
Database
NoteThis parameter is available only if you select Reference cluster configuration for Connection information.
Enter the database name. The name cannot contain periods (.) and can be up to 256 characters in length.
Authentication Type
NoteThis parameter is available only if you select Configure separately for Connection information.
The supported authentication methods are No Authentication, LDAP, and Kerberos.
No authentication: Enter the username for the Hive service.
LDAP: Enter the username and password for the Hive service.
NoteFor the No authentication and LDAP methods, make sure that the specified user has the permissions to execute tasks.
Kerberos: If the Hadoop cluster uses Kerberos authentication, enable Hive Kerberos, upload the keytab file, and configure the principal.
Keytab File: Upload the keytab file. You can obtain this file from the Hive server.
Principal: Enter the Kerberos authentication username that corresponds to the Hive keytab file.
Execution Engine
Default: Nodes, including logical table tasks, in the project that is attached to this compute source use this execution engine by default.
Custom: Select another type of execution engine.
Spark Jar service configuration
NoteIf you select Reference cluster configuration and the referenced cluster does not have the Spark local client enabled, you cannot configure the Spark Jar service.
Parameter
Description
Spark Execution Machine
If Spark is deployed on the Hadoop cluster, you can enable Spark Jar Tasks.
Spark Local Client
If the referenced cluster has the Spark local client enabled, this option is enabled by default.
After you disable it, the project that corresponds to the current compute source cannot use the Spark local client. If the project contains nodes, including draft nodes, that use the Spark local client, you cannot disable this option.
Spark SQL service configuration
NoteIf you select Reference cluster configuration and the referenced cluster does not have the Spark SQL service enabled, you cannot configure the Spark SQL service.
Parameter
Description
Spark SQL task
If Spark is deployed on the Hadoop cluster, you can enable Spark SQL Tasks.
NoteIf you select Paimon for Data lake table format, you cannot disable Spark SQL tasks.
Connection information
You can select Reference Cluster Configuration or Configure Separately.
Spark version
Only Spark V3.x is supported.
Service type
Select the type of the destination server for Spark JDBC access. Different compute engines support different service types. For more information, see Compute engines and supported service types.
JDBC URL
The JDBC URL of Spark. The database in the URL must be the same as the database specified in the Hive JDBC URL.
NoteYou can modify the JDBC URL if you select Configure separately for Connection information. If you select Reference cluster configuration, the JDBC URL is view-only.
Database
NoteThis parameter is available only if you select Reference cluster configuration for Connection information.
Enter the database name. The name cannot contain periods (.) and can be up to 256 characters in length.
Authentication method
NoteThis parameter is available only if you select Configure separately for Connection information.
The supported authentication methods are No Authentication, LDAP, and Kerberos.
No authentication: Enter the username for the Spark service.
LDAP: Enter the username and password for the Spark service.
NoteFor the No authentication and LDAP methods, make sure that the specified user has the permissions to execute tasks.
Kerberos: If the Hadoop cluster uses Kerberos authentication, enable Spark Kerberos, upload the keytab file, and configure the principal.
Keytab File: Upload the keytab file. You can obtain this file from the Spark server.
Principal: Enter the Kerberos authentication username that corresponds to the Spark keytab file.
SQL task queue settings
Different service types use different SQL task queues. Details are as follows:
Spark Thrift Server: You cannot set a task queue.
Kyuubi: Uses the priority queue settings from the HDFS information configuration. This takes effect only when Kyuubi uses YARN for resource scheduling. Production tasks use the connection sharing level.
Livy: Uses the priority queue settings from the HDFS information configuration. This takes effect only when Livy uses YARN for resource scheduling. Ad hoc queries and production tasks are executed using a new connection.
MapReduce (MRS): Uses the priority queue settings from the HDFS information configuration.
Impala task configuration
NoteIf you select Reference cluster configuration and the referenced cluster does not have Impala tasks enabled, you cannot configure the Impala task service.
Parameter
Description
Impala Task
If Impala is deployed on the Hadoop cluster, you can enable Impala tasks.
Connection Information
You can select Reference Cluster Configuration or Configure Separately.
JDBC URL
Enter the JDBC endpoint of Impala. Example:
jdbc:Impala://host:port/database. The database in the JDBC URL must be the same as the database in the Hive JDBC URL.NoteYou can modify the JDBC URL if you select Configure separately for Connection information. If you select Reference cluster configuration, the JDBC URL is view-only.
Database
NoteThis parameter is available only if you select Reference cluster configuration for Connection information.
Enter the database name. The name cannot contain periods (.) and can be up to 256 characters in length.
Authentication Type
NoteThis parameter is available only if you select Configure separately for Connection information.
The supported authentication methods are No Authentication, LDAP, and Kerberos.
No authentication: Enter the Impala username.
LDAP: Enter the username and password for Impala.
Kerberos: Upload the keytab file and configure the principal.
Development Task Request Pool
Enter the name of the Impala request pool for development tasks.
Auto Triggered Task Request Pool
Enter the name of the Impala request pool for auto triggered tasks.
Priority Task Queue
Supports Use Auto Triggered Task Default Queue and Custom.
When Dataphin schedules Impala SQL tasks, it sends the tasks to the corresponding queues for execution based on their priorities. The priorities are highest, high, medium, low, and lowest.
If you customize the priority task queue, logical table tasks that are scheduled to run daily use the medium-priority task queue by default. Logical table tasks that are scheduled to run yearly or monthly use the low-priority task queue by default.
Configure separately
Basic information of the compute source
Parameter
Description
Compute Source Type
The default value is Hadoop.
Compute Source Name
Observe the following naming conventions:
The name can contain only Chinese characters, letters, digits, underscores (_), and hyphens (-).
The name can be up to 64 characters in length.
Configuration Method
Select Configure Separately.
Data Lake Table Format
This feature is disabled by default. After you enable it, you can select a data lake table format. Currently, only Hudi is supported.
NoteThis parameter is available only if the compute engine is Cloudera Data Platform 7.x.
Compute Source Description
A brief description of the compute source. The description can be up to 128 characters in length.
Basic information of the cluster
NoteYou can configure the basic information of the cluster only if you select Configure separately.
Parameter
Description
Cluster Storage
This parameter is set to the value configured in compute settings and cannot be changed. This parameter is not available for clusters that do not use OSS-HDFS storage.
NameNode
Click + Add. In the Add NameNode dialog box, configure the parameters. You can add multiple NameNodes.
The NameNode is the hostname or IP address and port of the NameNode in the HDFS cluster. Example:
NameNode: 193.168.xx.xx
Web UI Port: 50070
IPC Port: 8020
You must select at least one of the Web UI Port and IPC Port. After configuration, the NameNode is
host=192.168.xx.xx,webUiPort=50070,ipcPort=8020.NoteThis parameter is available only if you set Cluster storage to HDFS.
Cluster Storage Root Directory
This parameter is set to the value configured in compute settings and cannot be changed. This parameter is not available for clusters that do not use OSS-HDFS storage.
AccessKey ID, AccessKey Secret
If the cluster storage type is OSS-HDFS, enter the AccessKey ID and AccessKey secret that are used to access the OSS of the cluster. For more information about how to view an AccessKey pair, see View the AccessKey pair of a RAM user.
ImportantThe configuration that you enter here has a higher priority than the AccessKey pair configured in the core-site.xml file.
core-site.xml
Upload the core-site.xml configuration file of the Hadoop cluster.
hdfs-site.xml
Upload the hdfs-site.xml configuration file of HDFS in the Hadoop cluster.
NoteYou cannot upload the hdfs-site.xml configuration file if the cluster storage type is OSS-HDFS.
hive-site.xml
Upload the hive-site.xml configuration file of Hive in the Hadoop cluster.
yarn-site.xml
Upload the yarn-site.xml configuration file of Hive in the Hadoop cluster.
Other Configuration Files
Upload the keytab file. You can run the ipa-getkeytab command on a NameNode in the HDFS cluster to obtain the file.
Task Execution Machine
Configure the endpoint of the machine that executes MapReduce or Spark Jar tasks. Format:
hostname:portorip:port. The default port is 22.Authentication Type
The supported authentication methods are No authentication and Kerberos.
Kerberos is an identity authentication protocol that is based on symmetric key technology. It provides identity authentication for other services and supports single sign-on (SSO). After a client is authenticated, it can access multiple services, such as HBase and HDFS.
If the Hadoop cluster uses Kerberos authentication, enable cluster Kerberos and upload the krb5.conf file or configure the KDC server address.
ImportantWhen the compute engine type is E-MapReduce 5.x, only the Krb5 File Configuration method is supported.
Krb5 authentication file: Upload the krb5.conf file for Kerberos authentication.
KDC server address: The address of the Key Distribution Center (KDC) server, which assists with Kerberos authentication.
NoteYou can configure multiple KDC server addresses. Separate them with semicolons (;).
HDFS information configuration
Parameter
Description
Execution Username, Password
The username and password to log on to the task execution machine. They are used to execute MapReduce tasks and read data from and write data to HDFS.
ImportantMake sure that you have the permissions to submit MapReduce tasks.
Authentication Type
The supported methods are No Authentication and Kerberos.
NoteIf the cluster storage is OSS-HDFS, you cannot configure an HDFS authentication method. The AccessKey pair in the core-site.xml file is used by default.
If the Hadoop cluster uses Kerberos authentication, enable HDFS Kerberos, upload the keytab file, and configure the principal.
Keytab File: Upload the keytab file. You can obtain this file from the HDFS server.
Principal: Enter the Kerberos authentication username that corresponds to the HDFS keytab file.
HDFS User
The username for file uploads. If you leave this empty, the execution username is used by default. You can set this parameter when Kerberos is disabled.
Production Task Default Queue
Enter the YARN resource queue. This queue is used to run manual and auto triggered tasks in the production environment.
Other Task Queue
Enter the YARN resource queue. This queue is used for other tasks, such as ad hoc queries, data previews, and JDBC driver access.
Task Priority Queue
You can select Use Production Task Default Queue or Custom.
When Dataphin schedules Hive SQL tasks, it sends the tasks to the corresponding queues for execution based on their priorities. The priorities are highest, high, medium, low, and lowest.
If you set the Hive execution engine to Tez or Spark, you must configure different priority queues for the task priority settings to take effect.
NoteLogical table tasks that are scheduled to run daily or hourly use the medium-priority task queue by default.
Logical table tasks that are scheduled to run yearly or monthly use the low-priority task queue by default.
Hive compute engine configuration
Parameter
Description
JDBC URL
You can configure one of the following types of endpoints:
The endpoint of the Hive server. Format:
jdbc:hive://{endpoint}:{port}/{database_name}.The endpoint of ZooKeeper. Example:
jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2.The endpoint with Kerberos authentication enabled. Format:
jdbc:hive2://{endpoint}:{port}/{database_name};principal=hive/_HOST@xx.com.
Authentication Type
NoteThis parameter is available only if you select Configure separately for Connection information.
The supported authentication methods are No Authentication, LDAP, and Kerberos.
No authentication: Enter the username for the Hive service.
LDAP: Enter the username and password for the Hive service.
NoteFor the No authentication and LDAP methods, make sure that the specified user has the permissions to execute tasks.
Kerberos: If the Hadoop cluster uses Kerberos authentication, enable Hive Kerberos, upload the keytab file, and configure the principal.
Keytab File: Upload the keytab file. You can obtain this file from the Hive server.
Principal: Enter the Kerberos authentication username that corresponds to the Hive keytab file.
Execution Engine
Default: Nodes, including logical table tasks, in the project that is attached to this compute source use this execution engine by default.
Custom: Select another type of execution engine.
Hive metadata configuration
Metadata retrieval method: Three metadata retrieval methods are supported: Metadata Database, HMS, and DLF. Each method requires different configuration information.
ImportantThe DLF retrieval method is supported only for clusters that use E-MapReduce 5.x Hadoop as the compute engine.
To use the DLF method to retrieve metadata, you must first upload the hive-site.xml configuration file.
Metadata retrieval method
Parameter
Description
Metadata Database
Database Type
Select a database based on the metadatabase type used in the cluster. Dataphin supports MySQL.
The supported MySQL versions are MySQL 5.1.43, MySQL 5.6/5.7, and MySQL 8.
JDBC URL
Enter the JDBC endpoint of the destination database. Example:
MySQL: The format is
jdbc:mysql://{connection address}[,failoverhost...]{port}/{database name} [?propertyName1][=propertyValue1][&propertyName2][=propertyValue2]....Username, Password
Enter the username and password to log on to the metadatabase.
HMS
Authentication Type
The HMS retrieval method supports No authentication, LDAP, and Kerberos. The Kerberos authentication method requires you to upload a keytab file and configure a principal.
DLF
Endpoint
Enter the endpoint of the region where the data center of DLF for the cluster is located. To obtain the endpoint, see DLF regions and endpoints.
AccessKey ID, AccessKey Secret
Enter the AccessKey ID and AccessKey secret of the account to which the cluster belongs.
You can obtain the AccessKey ID and AccessKey secret of your account on the User Information Management page.
Spark Jar service configuration
Parameter
Description
Spark Execution Machine
If Spark is deployed on the Hadoop cluster, you can enable Spark Jar Tasks.
Execution Username, Password
Enter the username and password to log on to the task execution machine.
ImportantThe permission to submit MapReduce tasks is granted.
Authentication Type
The supported authentication methods are No Authentication and Kerberos.
If the Hadoop cluster uses Kerberos authentication, enable Spark Kerberos, upload the keytab file, and configure the principal.
Keytab File: Upload the keytab file. You can obtain this file from the Spark server.
Principal: Enter the Kerberos authentication username that corresponds to the Spark keytab file.
Spark SQL service configuration
Parameter
Description
Spark SQL task
If Spark is deployed on the Hadoop cluster, you can enable Spark SQL Tasks.
Spark version
Only Spark V3.x is supported.
Service type
Select the type of the destination server for Spark JDBC access. Different compute engines support different service types. For more information, see Compute engines and supported service types.
JDBC URL
The JDBC URL of Spark. The database in the URL must be the same as the database specified in the Hive JDBC URL.
Authentication method
The supported authentication methods are No Authentication, LDAP, and Kerberos.
No authentication: Enter the username for the Spark service.
LDAP: Enter the username and password for the Spark service.
NoteFor the No authentication and LDAP methods, make sure that the specified user has the permissions to execute tasks.
Kerberos: If the Hadoop cluster uses Kerberos authentication, enable Spark Kerberos, upload the keytab file, and configure the principal.
Keytab File: Upload the keytab file. You can obtain this file from the Spark server.
Principal: Enter the Kerberos authentication username that corresponds to the Spark keytab file.
SQL task queue settings
Different service types use different SQL task queues. Details are as follows:
Spark Thrift Server: You cannot set a task queue.
Kyuubi: Uses the priority queue settings from the HDFS information configuration. This takes effect only when Kyuubi uses YARN for resource scheduling. Production tasks use the connection sharing level.
Livy: Uses the priority queue settings from the HDFS information configuration. This takes effect only when Livy uses YARN for resource scheduling. Ad hoc queries and production tasks are executed using a new connection.
MapReduce (MRS): Uses the priority queue settings from the HDFS information configuration.
Impala task configuration
Parameter
Description
Impala Task
If Impala is deployed on the Hadoop cluster, you can enable Impala tasks.
JDBC URL
Enter the JDBC endpoint of Impala. Example:
jdbc:Impala://host:port/database. The database in the JDBC URL must be the same as the database in the Hive JDBC URL.NoteIf you select Reference cluster configuration for the connection information, the JDBC URL is view-only.
Authentication Type
The supported authentication methods are No Authentication, LDAP, and Kerberos.
No authentication: Enter the Impala username.
LDAP: Enter the username and password for Impala.
Kerberos: Upload the keytab file and configure the principal.
Development Task Request Pool
Enter the name of the Impala request pool for development tasks.
Auto Triggered Task Request Pool
Enter the name of the Impala request pool for auto triggered tasks.
Priority Task Queue
Supports Use Auto Triggered Task Default Queue and Custom.
When Dataphin schedules Impala SQL tasks, it sends the tasks to the corresponding queues for execution based on their priorities. The priorities are highest, high, medium, low, and lowest.
If you customize the priority task queue, logical table tasks that are scheduled to run daily use the medium-priority task queue by default. Logical table tasks that are scheduled to run yearly or monthly use the low-priority task queue by default.
Click Test Connection to test the connection to the compute source.
After the connection test succeeds, click Submit.
What to do next
After you create a Hadoop compute source, you can attach it to a project. For more information, see Create a general-purpose project.