The Hadoop compute source allows you to integrate Dataphin project spaces with Hadoop projects, providing essential computing resources for Dataphin to manage offline computing tasks. When Dataphin's compute engine is set to Hadoop, adding a Hadoop compute source to the project space is necessary to enable standard modeling, ad hoc queries, Hive tasks, and general scripting. This topic guides you through the creation of a new Hadoop compute source.
Prerequisites
Before beginning, ensure the following conditions are met:
The Dataphin compute engine is set to Hadoop. For instructions, see Set the compute engine to Hadoop.
The Hive user must have the following permissions:
CREATEFUNCTION permission.
ImportantThe CREATEFUNCTION permission is required to register user-defined functions in Hive through Dataphin. Without this permission, creating user-defined functions or using Dataphin's asset security feature is not possible.
Read, write, and execute permissions for the HDFS directory where UDFs are stored.
The default HDFS directory for UDFs is
/tmp/dataphin
, but you may modify it as needed.
Deploy Impala (version 2.5 or higher) on the Hadoop cluster beforehand if you plan to enable Impala tasks for saved searches and data analysis.
For E-MapReduce 5.x compute engine users who want to use Hive foreign tables based on OSS for offline integration, pre-configuration is required. For configuration steps, see Use Hive foreign tables based on OSS for offline integration.
Impala tasks Limits
The following limitations apply to Impala tasks in Dataphin:
Only Impala version 2.5 and above is supported.
Logical tables currently do not support the Impala engine; however, Impala can be used to query logical tables.
Dataphin's Impala data and compute sources connect using the Impala JDBC client to the Impala JDBC port (default 21050), not the Hive JDBC port. Consult your cluster provider to ensure Impala JDBC connectivity is supported if you intend to create Impala tasks or data sources in Dataphin.
Hive cannot access Kudu tables, resulting in the following restrictions:
Hive SQL cannot query Kudu tables. Attempts to do so will result in execution failure and an error message:
FAILED: RuntimeException java.lang.ClassNotFoundException: org.apache.hadoop.hive.kudu.KuduInputFormat
.Modeling cannot use Kudu tables as source tables. If a source table is a Kudu table, it will not function properly.
Security detection scan tasks that use Impala SQL to scan Kudu tables are not supported unless Impala is enabled in the project where the scan task is located.
Quality rule executions for Kudu tables use Impala SQL for verification. Without Impala enabled, these tasks will fail and report an error.
The label platform does not support Kudu tables as offline view tables.
Kudu table storage volume cannot be obtained.
Asset details do not include storage volume information for Kudu tables.
Resource administration's empty table management does not support Kudu tables.
Kudu tables are not supported by table size and partition size quality rules.
Spark SQL Service Limits
The following limitations apply to Spark SQL services in Dataphin:
Only Spark version 3.x is supported.
Spark Thrift Server, Kyuubi, or Livy services must be deployed and enabled on the Hadoop cluster beforehand.
Dataphin does not verify database permissions for Spark Call commands. Exercise caution when enabling and using this feature.
Spark SQL configurations in production and development compute sources must be consistent. Inconsistencies will prevent configuration of Spark resource settings for Spark SQL tasks.
Compute engine and supported service types
Different compute engines support various service types. The details are as follows:
Compute Engine Type | Spark Thrift Server | Kyuubi | Livy | MapReduce (MRS) |
E-MapReduce 3.x | Supported | Supported | Not Supported | Not Supported |
E-MapReduce 5.x | Supported | Supported | Not Supported | Not Supported |
CDH 5.X, CDH 6.X | Not Supported | Supported | Not Supported | Not Supported |
Cloudera Data Platform | Not Supported | Supported | Supported | Not Supported |
FusionInsight 8.X | Not Supported | Not Supported | Not Supported | Supported |
AsiaInfo DP 5.3 | Supported | Supported | Not Supported | Not Supported |
Procedure
On the Dataphin homepage, navigate to the top menu bar and select Planning > Compute Source.
On the Compute Source page, click + Add Compute Source and choose Hadoop Compute Source.
On the New Compute Source page, configure the parameters as follows.
Configure each parameter of the compute source using either Reference Specified Cluster or Separate Configuration. The supported configuration items vary depending on the method chosen.
Reference Specified Cluster Configuration
Basic Information of Compute Source
Parameter
Description
Compute Source Type
Default is Hadoop.
Compute Source Name
The naming convention is as follows:
Can only contain Chinese, English, numbers, underscores (_), and hyphens (-).
Length cannot exceed 64 characters.
Configuration Method
Select Reference Specified Cluster.
Data Lake Table Format
Default is off. After enabling, you can select the data lake table format. Currently, only Hudi is supported.
NoteThis item is supported only when the compute engine is Cloudera Data Platform 7.x.
Compute Source Description
A brief description of the compute source, within 128 characters.
Queue Information Configuration
Parameter
Description
Production Task Queue
Fill in the Yarn resource queue. The Yarn queue is used for manual and periodic task execution in the production environment.
Other Task Queue
Fill in the Yarn resource queue. The Yarn queue is used for other tasks such as ad hoc queries, data preview, and JDBC Driver access.
Priority Task Queue
You can choose Use Production Task Default Queue or Custom.
When selecting custom, you need to fill in the Yarn resource queue corresponding to the highest priority, high priority, medium priority, low priority, and lowest priority separately.
Hive Compute Engine Configuration
Parameter
Description
Link Information
You can choose Reference Cluster Configuration or Separate Configuration.
JDBC URL
Supports configuring the following three types of connection addresses:
Connection address of Hive Server, format is
jdbc:hive://<connection address>:<port>/<database name>
.Connection address of ZooKeeper. For example,
jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2
.Connection address with Kerberos enabled, format is
jdbc:hive2://<connection address>:<port>/<database name>;principal=hive/_HOST@xx.com
.
NoteWhen link information is selected as Separate Configuration, modifying the JDBC URL is supported. When link information is selected as Reference Cluster Configuration, JDBC URL only supports viewing and does not support modification.
Database
NoteOnly when link information is selected as Reference Cluster Configuration, configuring the database is supported.
Enter the database name. Half-width periods (.) are not supported, and the length does not exceed 256 characters.
Authentication Type
NoteOnly when link information is selected as Separate Configuration, configuring the authentication type is supported.
Supports No Authentication, LDAP, and Kerberos authentication types.
No Authentication: No authentication type requires filling in the username of the Hive service.
LDAP: LDAP authentication type requires filling in the username and password of the Hive service.
NoteThe users filled in for No Authentication and LDAP types must ensure they have task execution permissions to ensure normal task execution.
Kerberos: If the Hadoop cluster has Kerberos authentication, you need to enable Hive Kerberos and upload the Keytab File authentication file and configure the Principal.
Keytab File: Upload the keytab file, which you can obtain from the Hive Server.
Principal: Fill in the Kerberos authentication username corresponding to the Hive Keytab File.
Execution Engine
Default: Default: Tasks (including logical table tasks) under the project bound to this compute source use this execution engine by default.
Custom: Select other compute engine types.
Spark Jar Service Configuration
NoteIf you select 'Reference Cluster Configuration' as the configuration method and the referenced cluster has not enabled the Spark local client, configuring the Spark Jar service is not possible.
Parameter
Description
Spark Execution Machine
If Spark is deployed in the Hadoop cluster, enabling Spark Jar Task is supported.
Spark Local Client
If the referenced cluster has enabled the Spark local client, it is enabled by default here.
After clicking shutdown, the project corresponding to the current compute source cannot use the Spark local client. If there are tasks (including draft tasks) using the Spark local client in the project corresponding to the current compute source, shutdown is not supported.
Spark SQL Service Configuration
NoteIf you select 'Reference Cluster Configuration' as the configuration method and the referenced cluster has not enabled the Spark SQL service, configuring the Spark SQL service is not possible.
Parameter
Description
Spark SQL Task
If Spark is deployed in the Hadoop cluster, enabling Spark SQL Task is supported.
Link Information
You can choose Reference Cluster Configuration or Separate Configuration.
Spark Version
Currently, only version 3.x is supported.
Service Type
Select the target server type for Spark JDBC access. Different compute engines support different service types. For more information, see Compute Engine and Supported Service Types.
JDBC URL
Spark's JDBC URL address. The database in the URL needs to be the same as the database specified in the Hive JDBC URL.
NoteWhen link information is selected as Separate Configuration, modifying the JDBC URL is supported. When link information is selected as Reference Cluster Configuration, JDBC URL only supports viewing and does not support modification.
Database
NoteOnly when link information is selected as Reference Cluster Configuration, configuring the database is supported.
Enter the database name. Half-width periods (.) are not supported, and the length does not exceed 256 characters.
Authentication Type
NoteOnly when link information is selected as Separate Configuration, configuring the database is supported.
Supports No Authentication, LDAP, and Kerberos authentication types.
No Authentication: No authentication type requires filling in the username of the Spark service.
LDAP: LDAP authentication type requires filling in the username and password of the Spark service.
NoteThe users filled in for No Authentication and LDAP types must ensure they have task execution permissions to ensure normal task execution.
Kerberos: If the Hadoop cluster has Kerberos authentication, you need to enable Spark Kerberos and upload the Keytab File authentication file and configure the Principal.
Keytab File: Upload the keytab file, which you can obtain from the Spark Server.
Principal: Fill in the Kerberos authentication username corresponding to the Spark Keytab File.
SQL Task Queue Settings
Different service types use different SQL task queues. Details are as follows:
Spark Thrift Server: Task queue setting is not supported.
Kyuubi: Uses the priority queue setting configured with HDFS information. It is effective only when Kyuubi uses Yarn as resource scheduling. Production tasks use Connection sharing level.
Livy: Uses the priority queue setting configured with HDFS information. It is effective only when Livy uses Yarn as resource scheduling. Ad hoc queries and production tasks use the execution of a new Connection.
MapReduce (MRS): Uses the priority queue setting configured with HDFS information.
Impala Task Configuration
NoteIf you select 'Reference Cluster Configuration' as the configuration method and the referenced cluster has not enabled Impala tasks, configuring the Impala task service is unsupported.
Parameter
Description
Impala Task
If Impala is deployed in the Hadoop cluster, enabling Impala tasks is supported.
Link Information
You can choose Reference Cluster Configuration or Separate Configuration.
JDBC URL
Enter the JDBC connection address of Impala. For example,
jdbc:Impala://host:port/database
. The Database in the JDBC URL needs to be consistent with the Database in the Hive JDBC.NoteWhen link information is selected as Separate Configuration, modifying the JDBC URL is supported. When link information is selected as Reference Cluster Configuration, JDBC URL only supports viewing and does not support modification.
Database
NoteOnly when link information is selected as Reference Cluster Configuration, configuring the database is supported.
Enter the database name. Half-width periods (.) are not supported, and the length does not exceed 256 characters.
Authentication Type
NoteOnly when link information is selected as Separate Configuration, configuring the database is supported.
Supports No Authentication, LDAP, and Kerberos authentication types.
No Authentication: No authentication type requires filling in the Impala username.
LDAP: LDAP authentication type requires filling in the username and password of Impala.
Kerberos: Kerberos authentication type requires uploading the Keytab File authentication file and configuring the Principal.
Development Task Request Pool
Enter the name of the Impala request pool (request pool) used for development tasks.
Periodic Task Request Pool
Enter the name of the Impala request pool (request pool) used for periodic tasks.
Priority Task Queue
Supports Use Periodic Task Default Queue and Custom.
When scheduling Impala SQL tasks, Dataphin will send tasks to the corresponding queue for execution based on the task's priority. Priorities include highest priority, high priority, medium priority, low priority, and lowest priority.
When customizing the priority task queue, daily scheduled logical table tasks use the medium priority task queue by default; yearly and monthly scheduled logical table tasks use the low priority task queue by default.
Separate Configuration
Basic Information of Compute Source
Parameter
Description
Compute Source Type
Default is Hadoop.
Compute Source Name
The naming convention is as follows:
Can only contain Chinese, English, numbers, underscores (_), and hyphens (-).
Length cannot exceed 64 characters.
Configuration Method
Select Separate Configuration.
Data Lake Table Format
Default is off. After turning on the switch, you can select the data lake table format. Currently, only Hudi is supported.
NoteThis item is supported only when the compute engine is Cloudera Data Platform 7.x.
Compute Source Description
A brief description of the compute source, within 128 characters.
Basic Information of Cluster
NoteYou can configure the basic information of the cluster only if the Separate Configuration method is selected.
Parameter
Description
Cluster Storage
Default is the parameter information configured by compute settings, configuration is not supported. Non-OSS-HDFS cluster storage does not have this parameter item.
NameNode
Click + Add, configure related parameters in the Add Namenode dialog box, and support adding multiple NameNodes.
NameNode is the HostName or IP and port of the NameNode node in the HDFS cluster. Configuration example:
NameNode: 193.168.xx.xx
Web UI Port: 50070
IPC Port: 8020
At least one of the Web UI Port and IPC Port must be selected. After configuration, the NameNode is
host=192.168.xx.xx,webUiPort=50070,ipcPort=8020
.NoteWhen the cluster storage is HDFS, this item is supported.
Cluster Storage Root Directory
Default is the parameter information configured by compute settings, configuration is not supported. Non-OSS-HDFS cluster storage does not have this parameter item.
AccessKey ID, AccessKey Secret
When the cluster storage type is OSS-HDFS, you need to fill in the AccessKey ID and AccessKey Secret for accessing the cluster OSS. For how to view AccessKey, see View AccessKey.
ImportantThe configuration filled in here has a higher priority than the AccessKey configured in core-site.xml.
core-site.xml
Upload the core-site.xml configuration file of the Hadoop cluster.
hdfs-site.xml
Upload the hdfs-site.xml configuration file of HDFS under the Hadoop cluster.
NoteOSS-HDFS cluster storage type does not support uploading hdfs-site.xml configuration files.
hive-site.xml
Upload the hive-site.xml configuration file of Hive under the Hadoop cluster.
yarn-site.xml
Upload the yarn-site.xml configuration file of Hive under the Hadoop cluster.
Other Configuration Files
Upload the keytab file, which you can obtain using the ipa-getkeytab command on the NameNode node in the HDFS cluster.
Task Execution Machine
Configure the connection address of the execution machine for MapReduce or Spark Jar. The format is
hostname:port
orip:port
, and the default port is 22.Authentication Type
Supports No Authentication and Kerberos authentication types.
Kerberos is an identity authentication protocol based on symmetric-key cryptography that can provide identity authentication functions for other services and supports SSO (that is, after client identity authentication, multiple services such as HBase and HDFS can be accessed).
If the Hadoop cluster has Kerberos authentication, you need to enable cluster Kerberos and upload the Krb5 authentication file or configure the KDC Server address:
ImportantWhen the compute engine type is E-MapReduce 5.x, only the Krb5 File Configuration method is supported.
Krb5 Authentication File: You need to upload the Krb5 file for Kerberos authentication.
KDC Server Address: The KDC server address assists in completing Kerberos authentication.
NoteSupports configuring multiple KDC Server service addresses, separated by English semicolons (;).
HDFS Information Configuration
Parameter
Description
Execution Username, Password
Username and password for logging into the compute execution machine, used for executing MapReduce tasks, reading and writing HDFS storage, etc.
ImportantPlease ensure you have permission to submit MapReduce tasks.
Authentication Type
Supports No Authentication and Kerberos.
NoteWhen the cluster storage is OSS-HDFS, configuring HDFS authentication methods is not supported. The AccessKey in the core-site.xml file will be used by default.
If the Hadoop cluster has Kerberos authentication, you need to enable HDFS Kerberos and upload the Keytab File authentication file and configure the Principal.
Keytab File: Upload the keytab file, which you can obtain from the HDFS Server.
Principal: Fill in the Kerberos authentication username corresponding to the HDFS Keytab File.
HDFS User
Specify the username for file uploads. If not filled in, the default is the execution username. Kerberos can be turned off when filling in.
Production Task Default Task Queue
Fill in the Yarn resource queue. The Yarn queue is used for manual and periodic task execution in the production environment.
Other Task Queue
Fill in the Yarn resource queue. The Yarn queue is used for other tasks such as ad hoc queries, data preview, and JDBC Driver access.
Task Priority Queue
Supports Use Production Task Default Queue or Custom.
When scheduling Hive SQL tasks, Dataphin will send tasks to the corresponding queue for execution based on the task's priority. Priorities include highest priority, high priority, medium priority, low priority, and lowest priority.
When the execution engine of Hive is set to Tez or Spark, different priority queues must be set for the task priority settings to take effect.
NoteDaily and hourly scheduled logical table tasks use the medium priority task queue by default.
Yearly and monthly scheduled logical table tasks use the low priority task queue by default.
Hive Compute Engine Configuration
Parameter
Description
JDBC URL
Supports configuring the following three types of connection addresses:
Connection address of Hive Server, format is
jdbc:hive://<connection address>:<port>/<database name>
.Connection address of ZooKeeper. For example,
jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2
.Connection address with Kerberos enabled, format is
jdbc:hive2://<connection address>:<port>/<database name>;principal=hive/_HOST@xx.com
.
Authentication Type
NoteOnly when link information is selected as Separate Configuration, configuring the authentication type is supported.
Supports No Authentication, LDAP, and Kerberos authentication types.
No Authentication: No authentication type requires filling in the username of the Hive service.
LDAP: LDAP authentication type requires filling in the username and password of the Hive service.
NoteThe users filled in for No Authentication and LDAP types must ensure they have task execution permissions to ensure normal task execution.
Kerberos: If the Hadoop cluster has Kerberos authentication, you need to enable Hive Kerberos and upload the Keytab File authentication file and configure the Principal.
Keytab File: Upload the keytab file, which you can obtain from the Hive Server.
Principal: Fill in the Kerberos authentication username corresponding to the Hive Keytab File.
Execution Engine
Default: Default: Tasks (including logical table tasks) under the project bound to this compute source use this execution engine by default.
Custom: Select other compute engine types.
Hive Metadata Configuration
Metadata Retrieval Method: This supports Metadata Database, HMS, and DLF as three distinct methods for sourcing metadata. Each method necessitates specific configuration details.
ImportantThe DLF method is only compatible with clusters configured with E-MapReduce 5.x Hadoop.
To use the DLF method for metadata retrieval, first upload the hive-site.xml configuration file.
Metadata Retrieval Method
Parameter
Description
Metadata Database
Database Type
Select the database according to the type of metadatabase used in the cluster. Dataphin supports selecting MySQL.
Supported versions of MySQL include MySQL 5.1.43, MySQL 5.6/5.7, and MySQL 8.
JDBC URL
Fill in the JDBC connection address of the target database. For example:
MySQL: Format is
jdbc:mysql://{connection address}[,failoverhost...]{port}/{database name} [?propertyName1][=propertyValue1][&propertyName2][=propertyValue2]...
.Username, Password
Fill in the username and password for logging into the metadatabase.
HMS
Authentication Type
HMS retrieval methods support No Authentication, LDAP, and Kerberos authentication methods. The Kerberos authentication method requires you to upload the Keytab file and configure the Principal.
DLF
Endpoint
Fill in the Endpoint of the cluster in the region where the DLF data center is located. For how to obtain it, see Supported regions and endpoints.
AccessKey ID, AccessKey Secret
Fill in the AccessKey ID and AccessKey Secret of the account where the cluster is located.
You can obtain the AccessKey ID and AccessKey Secret of the account on the User Information Management page.
Spark Jar Service Configuration
Parameter
Description
Spark Execution Machine
If Spark is deployed in the Hadoop cluster, enabling Spark Jar Task is supported.
Execution Username, Password
Fill in the username and password for logging into the compute execution machine.
ImportantPermission to submit MapReduce tasks has been enabled.
Authentication Type
Supports No Authentication or Kerberos authentication types.
If the Hadoop cluster has Kerberos authentication, you need to enable Spark Kerberos and upload the Keytab File authentication file and configure the Principal.
Keytab File: Upload the keytab file, which you can obtain from the Spark Server.
Principal: Fill in the Kerberos authentication username corresponding to the Spark Keytab File.
Spark SQL Service Configuration
Parameter
Description
Spark SQL Task
If Spark is deployed in the Hadoop cluster, enabling Spark SQL Task is supported.
Spark Version
Currently, only version 3.x is supported.
Service Type
Select the target server type for Spark JDBC access. Different compute engines support different service types. For more information, see Compute Engine and Supported Service Types.
JDBC URL
Spark's JDBC URL address. The database in the URL needs to be the same as the database specified in the Hive JDBC URL.
Authentication Type
Supports No Authentication, LDAP, and Kerberos authentication types.
No Authentication: No authentication type requires filling in the username of the Spark service.
LDAP: LDAP authentication type requires filling in the username and password of the Spark service.
NoteThe users filled in for No Authentication and LDAP types must ensure they have task execution permissions to ensure normal task execution.
Kerberos: If the Hadoop cluster has Kerberos authentication, you need to enable Spark Kerberos and upload the Keytab File authentication file and configure the Principal.
Keytab File: Upload the keytab file, which you can obtain from the Spark Server.
Principal: Fill in the Kerberos authentication username corresponding to the Spark Keytab File.
SQL Task Queue Settings
Different service types use different SQL task queues. Details are as follows:
Spark Thrift Server: Task queue setting is not supported.
Kyuubi: Uses the priority queue setting configured with HDFS information. It is effective only when Kyuubi uses Yarn as resource scheduling. Production tasks use Connection sharing level.
Livy: Uses the priority queue setting configured with HDFS information. It is effective only when Livy uses Yarn as resource scheduling. Ad hoc queries and production tasks use the execution of a new Connection.
MapReduce (MRS): Uses the priority queue setting configured with HDFS information.
Impala Task Configuration
```html
Parameter
Description
Impala Task
If Impala is deployed in the Hadoop cluster, enabling Impala tasks is supported.
JDBC URL
Enter the JDBC connection address of Impala. For example,
jdbc:Impala://host:port/database
. The Database in the JDBC URL needs to be consistent with the Database in the Hive JDBC.NoteWhen link information is selected as Reference Cluster Configuration, JDBC URL only supports viewing and does not support modification.
Authentication Type
Supports No Authentication, LDAP, and Kerberos authentication types.
No Authentication: No authentication type requires filling in the Impala username.
LDAP: LDAP authentication type requires filling in the username and password of Impala.
Kerberos: Kerberos authentication type requires uploading the Keytab File authentication file and configuring the Principal.
Development Task Request Pool
Enter the name of the Impala request pool (request pool) used for development tasks.
Periodic Task Request Pool
Enter the name of the Impala request pool (request pool) used for periodic tasks.
Priority Task Queue
Supports Use Periodic Task Default Queue and Custom.
When scheduling Impala SQL tasks, Dataphin will send tasks to the corresponding queue for execution based on the task's priority. Priorities include highest priority, high priority, medium priority, low priority, and lowest priority.
When customizing the priority task queue, daily scheduled logical table tasks use the medium priority task queue by default; yearly and monthly scheduled logical table tasks use the low priority task queue by default.
Click Test Connection to verify the connectivity of the compute source.
Once the connection test is successful, click Submit to finalize the creation of the Hadoop compute source.
What to Do Next
Once you have created the Hadoop compute source, you can associate it with your project. For more information, see Create a General Project.