A Hadoop compute source binds a Dataphin project to a Hadoop cluster. It provides the compute resources needed to run offline computing tasks in Dataphin. If you set the Dataphin compute engine to Hadoop, only projects with a Hadoop compute source can use features such as standard modeling, ad hoc queries, Hive tasks, and generic scripts. This topic describes how to create a Hadoop compute source.
Prerequisites
Before you begin, make sure that the following requirements are met:
Set the Dataphin compute engine to Hadoop. For more information, see Set the compute engine to Hadoop.
Make sure the Hive user has the following permissions:
CREATEFUNCTION permission.
ImportantYou need this permission to register user-defined functions (UDFs) in Hive through Dataphin. Without this permission, you cannot create UDFs in Dataphin or use Dataphin's asset security features.
Read, write, and execute permissions on the HDFS directory where UDFs are stored.
The default UDF directory in HDFS is
/tmp/dataphin. You can change this directory.
If you plan to run Impala tasks for fast queries and data analysis, you must first deploy Impala (version 2.5 or later) on your Hadoop cluster.
If you use E-MapReduce 5.x as the compute engine and want to use Hive external tables based on OSS for offline integration, you must first configure your environment. For more information, see Use Hive external tables based on OSS for offline integration.
Impala task limits
If you enable Impala tasks for fast queries and data analysis, the following limits apply in Dataphin:
Only Impala version 2.5 or later is supported.
Logical tables do not support the Impala execution engine. However, you can query logical tables using Impala.
Dataphin connects to the Impala JDBC port (default: 21050) using the Impala JDBC client. Hive JDBC ports are not supported. Before you create an Impala task or data source in Dataphin, check with your cluster provider to confirm that Impala JDBC connections are supported.
Because Hive cannot access Kudu tables, the following limits apply:
Hive SQL cannot access Kudu tables. If you attempt to access them, the SQL execution fails with the following error:
FAILED: RuntimeException java.lang.ClassNotFoundException: org.apache.Hadoop.hive.kudu.KuduInputFormat.You cannot use Kudu tables as source tables for modeling. Tasks that use Kudu source tables will fail.
Security scan tasks use Impala SQL to scan Kudu tables. If Impala is not enabled in the project where the scan task runs, Kudu table scanning is not supported.
Quality rule checks use Impala SQL for Kudu tables. If Impala is not enabled, the quality check fails.
The tag platform does not support Kudu tables as offline view tables.
Dataphin does not support retrieving the storage size of Kudu tables.
Storage size information for Kudu tables is not available in asset details.
The empty-table governance feature in resource administration does not support Kudu tables.
Kudu tables do not support quality rules for table size or table partition size.
Spark SQL service limits
If you enable the Spark SQL service, the following limits apply in Dataphin:
Only Spark version 3.x is supported.
You must deploy and start one of the following services on your Hadoop cluster: Spark Thrift Server, Kyuubi, or Livy.
Dataphin does not validate data permissions for Spark Call commands. Use them with caution.
The Spark SQL service configuration must be identical for both production and development compute sources. If the configurations differ, you cannot configure Spark resource settings for Spark SQL tasks.
Compute engines and supported service types
The supported service types vary based on the compute engine.
Compute engine type | Spark Thrift Server | Kyuubi | Livy | MapReduce (MRS) |
E-MapReduce 3.x | Supported | Supported | Not supported | Not supported |
E-MapReduce 5.x | Supported | Supported | Not supported | Not supported |
CDH 5.x, CDH 6.x | Not supported | Supported | Not supported | Not supported |
Cloudera Data Platform | Not supported | Supported | Supported | Not supported |
FusionInsight 8.x | Not supported | Not supported | Not supported | Supported |
AsiaInfo DP 5.3 | Supported | Supported | Not supported | Not supported |
Amazon EMR | Supported | Not supported | Not supported | Supported |
Procedure
In the top menu bar on the Dataphin homepage, choose Planning > Compute Source.
On the Compute Source page, click + Add compute source, and then select Hadoop compute source.
On the Create Compute Source page, configure the parameters.
You can configure the compute source by either referencing a specified cluster or using a standalone configuration. The available parameters depend on the method that you select.
Reference a specified cluster
Basic compute source information
Parameter
Description
Compute source type
Default: Hadoop.
Compute source name
Naming rules:
The value can contain only English letters, digits, underscores (_), hyphens (-), and Chinese characters.
Maximum length: 64 characters.
Configuration method
Select Reference a specified cluster.
Data lake table format
Disabled by default. Enable it to select a data lake table format.
For Cloudera Data Platform 7.x, supported formats: Hudi.
For E-MapReduce 5.x, supported formats: Iceberg and Paimon.
NoteThis option is supported only for Cloudera Data Platform 7.x or E-MapReduce 5.x.
Compute source description
A brief description. Maximum length: 128 characters.
Queue information
Parameter
Description
Production task queue
Enter the YARN resource queue used for manual and scheduled tasks in production environments.
Other task queues
Enter the YARN resource queue used for other tasks, such as ad hoc queries, data previews, and JDBC Driver access.
Priority task queue
Select Use production task default queue or Custom.
If you select Custom, enter separate YARN resource queues for highest, high, medium, low, and lowest priority tasks.
Hive compute engine configuration
Parameter
Description
Connection information
Select Reference cluster configuration or Standalone configuration.
JDBC URL
Supports three connection address formats:
HiveServer connection address:
jdbc:hive://{connection address}:{port}/{database name}.ZooKeeper connection address. Example:
jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2.Kerberos-enabled connection address:
jdbc:hive2://{connection address}:{port}/{database name};principal=hive/_HOST@xx.com.
NoteIf you select Standalone configuration, you can edit the JDBC URL. If you select Reference cluster configuration, you can only view the JDBC URL.
For E-MapReduce 3.x, E-MapReduce 5.x, or Cloudera Data Platform, Kerberos-enabled JDBC URLs cannot contain multiple IP addresses.
Database
NoteYou can configure the database only if you select Reference cluster configuration.
Enter the database name. Do not use periods (.). Maximum length: 256 characters.
Authentication Type
NoteYou can configure authentication only if you select Standalone configuration.
Supported methods: No authentication, LDAP, and Kerberos.
No authentication: Enter the Hive service username.
LDAP: Enter the Hive service username and password.
NoteThe users you specify for no authentication or LDAP must have task execution permissions.
Kerberos: If your Hadoop cluster uses Kerberos, enable Hive Kerberos and upload the Keytab File and Principal.
Keytab File: Upload the keytab file. Get it from the Hive Server.
Principal: Enter the Kerberos username for the Hive keytab file.
Execution engine
Default: Tasks in projects bound to this compute source—including logical table tasks—use this execution engine by default.
Custom: Select another compute engine type.
Spark JAR service configuration
NoteYou cannot configure the Spark JAR service if you select Reference cluster configuration and the referenced cluster does not have the Spark local client enabled.
Parameter
Description
Spark execution machine
If Spark is deployed on your Hadoop cluster, you can enable Spark JAR tasks.
Spark local client
If the referenced cluster has Spark local client enabled, this option is enabled by default.
Click to disable it. After disabling, projects linked to this compute source cannot use Spark local client. You cannot disable it if any task in the linked project—including draft tasks—uses Spark local client.
Spark SQL service configuration
NoteYou cannot configure the Spark SQL service if you select Reference cluster configuration and the referenced cluster does not have the Spark SQL service enabled.
Parameter
Description
Spark SQL tasks
If Spark is deployed on your Hadoop cluster, you can enable Spark SQL tasks.
NoteIf you selected Paimon as the data lake table format, you cannot disable Spark SQL tasks.
Connection information
Select Reference cluster configuration or Standalone configuration.
Spark version
Only version 3.x is supported.
Service type
Select the server type for Spark JDBC access. Supported service types vary by compute engine. For more information, see Compute engines and supported service types.
JDBC URL
The Spark JDBC URL. Its database must match the database in the Hive JDBC URL.
NoteIf you select Standalone configuration, you can edit the JDBC URL. If you select Reference cluster configuration, you can only view the JDBC URL.
Database
NoteYou can configure the database only if you select Reference cluster configuration.
Enter the database name. Do not use periods (.). Maximum length: 256 characters.
Authentication Type
NoteYou can configure the database only when the link information is set to Separate Configuration.
Supported methods: No authentication, LDAP, and Kerberos.
No authentication: Enter the Spark service username.
LDAP: Enter the Spark service username and password.
NoteThe users you specify for no authentication or LDAP must have task execution permissions.
Kerberos: If your Hadoop cluster uses Kerberos, enable Spark Kerberos and upload the Keytab File and Principal.
Keytab File: Upload the keytab file. Get it from the Spark Server.
Principal: Enter the Kerberos username for the Spark keytab file.
SQL task queue settings
Different service types use different SQL task queues. Details:
Spark Thrift Server: Task queues are not supported.
Kyuubi: Uses the priority queue configured in HDFS settings. Applies only when Kyuubi uses YARN for resource scheduling. Production tasks use shared connections.
Livy: Uses the priority queue configured in HDFS settings. Applies only when Livy uses YARN for resource scheduling. Ad hoc queries and production tasks use new connections.
MapReduce (MRS): Uses the priority queue configured in HDFS settings.
Impala task configuration
NoteYou cannot configure the Impala task service if you select Reference cluster configuration and the referenced cluster does not have Impala tasks enabled.
Parameter
Description
Impala tasks
If Impala is deployed on your Hadoop cluster, you can enable Impala tasks.
Connection information
Select Reference cluster configuration or Standalone configuration.
JDBC URL
Enter the Impala JDBC connection address. Example:
jdbc:Impala://host:port/database. The database in this URL must match the database in the Hive JDBC URL.NoteIf you select Standalone configuration, you can edit the JDBC URL. If you select Reference cluster configuration, you can only view the JDBC URL.
Database
NoteYou can configure the database only if you select Reference cluster configuration.
Enter the database name. Do not use periods (.). Maximum length: 256 characters.
Authentication Type
NoteYou can configure the database only when Link Information is set to Separate Configuration.
Supported methods: No authentication, LDAP, and Kerberos.
No authentication: Enter the Impala username.
LDAP: Enter the Impala username and password.
Kerberos: Upload the Keytab File and configure the Principal.
Development task request pool
Enter the Impala request pool name for development tasks.
Scheduled task request pool
Enter the Impala request pool name for scheduled tasks.
Priority task queue
Choose Use scheduled task default queue or Custom.
Dataphin routes Impala SQL tasks to queues based on priority: highest, high, medium, low, and lowest.
When customizing, daily logical table tasks use the medium-priority queue by default. Yearly and monthly logical table tasks use the low-priority queue by default.
Standalone configuration
Basic compute source information
Parameter
Description
Compute source type
Default: Hadoop.
Compute source name
Naming rules:
Use only letters, digits, underscores (_), and hyphens (-).
Maximum length: 64 characters.
Configuration method
Select Standalone configuration.
Data lake table format
Disabled by default. Enable it to select a data lake table format. Only Hudi is supported.
NoteThis option is supported only for Cloudera Data Platform 7.x.
Compute source description
A brief description. Maximum length: 128 characters.
Cluster basic information
NoteYou can configure basic cluster information only if you select Standalone configuration.
Parameter
Description
Cluster storage
Uses default values from compute settings. Not configurable. Not applicable for non-OSS-HDFS clusters.
NameNode
Click + Add. In the Add NameNode dialog box, configure parameters. You can add multiple NameNodes.
A NameNode is the host name or IP address and port of a NameNode node in an HDFS cluster. Example:
NameNode: 193.168.xx.xx
Web UI Port: 50070
IPC Port: 8020
At least one of Web UI Port or IPC Port is required. After configuration, the NameNode appears as:
host=192.168.xx.xx,webUiPort=50070,ipcPort=8020.NoteThis option is supported only for HDFS clusters.
Cluster storage root directory
Uses default values from compute settings. Not configurable. Not applicable for non-OSS-HDFS clusters.
AccessKey ID and AccessKey Secret
The cluster storage type is OSS-HDFS, so you must specify the AccessKey ID and AccessKey secret to access the cluster's OSS. Use an existing AccessKey or refer to Create an AccessKey pair to create a new one.
ImportantTo reduce the risk of AccessKey exposure, the AccessKey Secret appears only once during creation and cannot be viewed later. Store it securely.
Settings here override those in core-site.xml.
core-site.xml
Upload the core-site.xml configuration file from your Hadoop cluster.
hdfs-site.xml
Upload the hdfs-site.xml configuration file from your Hadoop cluster’s HDFS.
NoteThe hdfs-site.xml configuration file cannot be uploaded to the OSS-HDFS cluster storage class.
hive-site.xml
Upload the hive-site.xml configuration file from your Hadoop cluster’s Hive.
yarn-site.xml
Upload the yarn-site.xml configuration file from your Hadoop cluster’s Hive.
Other configuration files
Upload the keytab file. Get it from the NameNode in your HDFS cluster using the ipa-getkeytab command.
Task execution machine
Configure the connection address for MapReduce or Spark JAR execution machines. Format:
hostname:portorip:port. Default port: 22.Authentication Type
Supported methods: No authentication and Kerberos.
Kerberos is a symmetric-key-based identity authentication protocol. It supports single sign-on (SSO), letting authenticated clients access multiple services such as HBase and HDFS.
If your Hadoop cluster uses Kerberos, enable cluster Kerberos and upload the krb5 file or configure the KDC server address:
ImportantFor E-MapReduce 5.x, only krb5 file configuration is supported.
Krb5 authentication file: Upload the krb5 file for Kerberos authentication.
KDC server address: The KDC server address used to complete Kerberos authentication.
NoteYou can configure multiple KDC server addresses, separated by semicolons (;).
HDFS Configuration
Parameter
Description
Execution username and Password
Username and password to log on to the compute execution machine. Used for running MapReduce tasks and reading or writing HDFS storage.
ImportantEnsure the user has permission to submit MapReduce tasks.
Authentication Type
Supported methods: No authentication and Kerberos.
NoteHDFS authentication is not supported for OSS-HDFS clusters. core-site.xml AccessKeys are used by default.
If your Hadoop cluster uses Kerberos, enable HDFS Kerberos and upload the Keytab File and Principal.
Keytab File: Upload the keytab file. Get it from the HDFS Server.
Principal: Enter the Kerberos username for the HDFS keytab file.
HDFS User
Specify the username for file uploads. If left blank, the execution username is used. Fill this in only when Kerberos is disabled.
Production task default queue
Enter the YARN resource queue used for manual and scheduled tasks in production environments.
Other task queues
Enter the YARN resource queue used for other tasks, such as ad hoc queries, data previews, and JDBC Driver access.
Task priority queue
Select Use production task default queue or Custom.
Dataphin routes Hive SQL tasks to queues based on priority: highest, high, medium, low, and lowest.
If Hive uses Tez or Spark as the execution engine, you must assign different priority queues for task priorities to take effect.
NoteDaily and hourly logical table tasks use the medium-priority queue by default.
Yearly and monthly logical table tasks use the low-priority queue by default.
Hive compute engine configuration
Parameter
Description
JDBC URL
Supports three connection address formats:
HiveServer connection address:
jdbc:hive://{connection address}:{port}/{database name}.ZooKeeper connection address. Example:
jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2.Kerberos-enabled connection address:
jdbc:hive2://{connection address}:{port}/{database name};principal=hive/_HOST@xx.com.
NoteFor E-MapReduce 3.x, E-MapReduce 5.x, or Cloudera Data Platform, Kerberos-enabled JDBC URLs cannot contain multiple IP addresses.
Authentication Type
NoteYou can configure authentication only if you select Standalone configuration.
Supported methods: No authentication, LDAP, and Kerberos.
No authentication: Enter the Hive service username.
LDAP: Enter the Hive service username and password.
NoteThe users you specify for no authentication or LDAP must have task execution permissions.
Kerberos: If your Hadoop cluster uses Kerberos, enable Hive Kerberos and upload the Keytab File and Principal.
Keytab File: Upload the keytab file. Get it from the Hive Server.
Principal: Enter the Kerberos username for the Hive keytab file.
Execution engine
Default: Tasks in projects bound to this compute source—including logical table tasks—use this execution engine by default.
Custom: Select another compute engine type.
Hive metadata configuration
Metadata retrieval method: You can choose from three methods: Metadata Database, HMS, and DLF. The required parameters depend on the method that you select.
ImportantDLF is supported only for clusters that use E-MapReduce 5.x.
To use DLF, you must first upload the hive-site.xml configuration file.
Metadata retrieval method
Parameter
Description
Metadata Database
Database type
Select the database type used in your cluster. Dataphin supports MySQL.
Supported MySQL versions include MySQL 5.1.43, MySQL 5.6/5.7, and MySQL 8.
JDBC URL
Enter the JDBC connection address for the target database. Example:
MySQL:
jdbc:mysql://{connection address}[,failoverhost...]{port}/{database name} [?propertyName1][=propertyValue1][&propertyName2][=propertyValue2]....Username and Password
Enter the username and password to log on to the metadata database.
HMS
Authentication Type
HMS supports No authentication, LDAP, and Kerberos. For Kerberos, upload the Keytab File and configure the Principal.
DLF
Endpoint
Enter the DLF endpoint for the region where your cluster resides. For instructions, see Supported regions and endpoints.
AccessKey ID and AccessKey Secret
Enter the AccessKey ID and AccessKey secret of the account that owns the cluster. Use an existing AccessKey or see Create an AccessKey pair to create a new one.
NoteTo reduce the risk of AccessKey exposure, the AccessKey Secret appears only once during creation and cannot be viewed later. Store it securely.
Spark JAR service configuration
Parameter
Description
Spark Executor
If Spark is deployed on your Hadoop cluster, you can enable Spark JAR tasks.
Execution username and Password
Enter the username and password to log on to the compute execution machine.
ImportantThe user must have permission to submit MapReduce tasks.
Authentication Type
Supported methods: No authentication or Kerberos.
If your Hadoop cluster uses Kerberos, enable Spark Kerberos and upload the Keytab File and Principal.
Keytab File: Upload the keytab file. Get it from the Spark Server.
Principal: Enter the Kerberos username for the Spark keytab file.
Spark SQL service configuration
Parameter
Description
Spark SQL tasks
If Spark is deployed on your Hadoop cluster, you can enable Spark SQL tasks.
Spark version
Only version 3.x is supported.
Service type
Select the server type for Spark JDBC access. Supported service types vary by compute engine. For more information, see Compute engines and supported service types.
JDBC URL
The Spark JDBC URL. Its database must match the database in the Hive JDBC URL.
Authentication Type
Supported methods: No authentication, LDAP, and Kerberos.
No authentication: Enter the Spark service username.
LDAP: Enter the Spark service username and password.
NoteThe users you specify for no authentication or LDAP must have task execution permissions.
Kerberos: If your Hadoop cluster uses Kerberos, enable Spark Kerberos and upload the Keytab File and Principal.
Keytab File: Upload the keytab file. Get it from the Spark Server.
Principal: Enter the Kerberos username for the Spark keytab file.
SQL task queue settings
Different service types use different SQL task queues. Details:
Spark Thrift Server: Task queues are not supported.
Kyuubi: Uses the priority queue configured in HDFS settings. Applies only when Kyuubi uses YARN for resource scheduling. Production tasks use shared connections.
Livy: Uses the priority queue configured in HDFS settings. Applies only when Livy uses YARN for resource scheduling. Ad hoc queries and production tasks use new connections.
MapReduce (MRS): Uses the priority queue configured in HDFS settings.
Impala task configuration
Parameter
Description
Impala tasks
If Impala is deployed on your Hadoop cluster, you can enable Impala tasks.
JDBC URL
Enter the Impala JDBC connection address. Example:
jdbc:Impala://host:port/database. The database in this URL must match the database in the Hive JDBC URL.NoteIf you select Reference cluster configuration, you can only view the JDBC URL.
Authentication Type
Supported methods: No authentication, LDAP, and Kerberos.
No authentication: Enter the Impala username.
LDAP: Enter the Impala username and password.
Kerberos: Upload the Keytab File and configure the Principal.
Development task request pool
Enter the Impala request pool name for development tasks.
Scheduled task request pool
Enter the Impala request pool name for scheduled tasks.
Priority task queue
Choose Use scheduled task default queue or Custom.
Dataphin routes Impala SQL tasks to queues based on priority: highest, high, medium, low, and lowest.
When customizing, daily logical table tasks use the medium-priority queue by default. Yearly and monthly logical table tasks use the low-priority queue by default.
Click Test Connection to verify the connection to the compute source.
After the connection test is successful, click Submit.
What to do next
After you create the Hadoop compute source, you must bind it to a project. For more information, see Create a general project.