By creating an Impala data source, you can enable Dataphin to read business data from Impala or write data to Impala. This topic describes how to create an Impala data source.
Background information
Impala is a SQL query engine for processing large amounts of data stored in Hadoop clusters. If you use Impala and want to export Dataphin data to Impala, you need to create an Impala data source first. For more information about Impala, see Impala official website.
Permission management
Only custom global roles with the Create Data Source permission and users with the Super Administrator, Data Source Administrator, Business Unit Architect, or Project Administrator role can create data sources.
Limits
Dataphin uses Java Database Connectivity (JDBC) for Impala data source integration, which results in lower performance compared to Hive. If the table that you are integrating is not a Kudu table, use a Hive data source and its input and output components instead.
Metadata retrieval using DLF is supported only when connecting to an Impala data source of E-MapReduce 5.x.
Procedure
On the Dataphin homepage, click Management Center > Data Source Management in the top navigation bar.
On the Datasource page, click +Create Data Source.
On the Create Data Source page, under Big Data, select Impala.
If you have recently used Impala, you can select it from the Recently Used section or enter `Impala` in the search box to find it.
On the Create Impala Data Source page, configure the connection parameters for the data source.
Configure the basic information of the data source.
Parameter
Description
Datasource Name
The name must meet the following requirements:
It can contain only Chinese characters, letters, digits, underscores (_), and hyphens (-).
It cannot exceed 64 characters in length.
Datasource Code
After you configure the data source code, you can reference tables in Flink_SQL nodes using the
datasource_code.table_nameordatasource_code.schema.table_nameformat. To automatically access the data source of the current environment, use the${datasource_code}.tableor${datasource_code}.schema.tablevariable format. For more information, see Develop a Dataphin data source table.ImportantThe data source code cannot be modified after it is configured.
You can preview data on the object details page in the asset directory and asset checklist only after the data source code is configured.
In Flink SQL, only MySQL, Hologres, MaxCompute, Oracle, StarRocks, Hive, SelectDB, and GaussDB data warehouse service (DWS) data sources are currently supported.
Version
Select the version of the Impala data source. Supported versions include the following:
CDH5:2.11.0
CDH6:3.2.0
CDP7.1.3:3.4.0
E-MapReduce 3.x: 3.4.0
E-MapReduce 5.x: 3.4.0
E-MapReduce 5.x: 4.2.0
Data Source Description
A brief description of the data source. It cannot exceed 128 characters.
Data Source Configuration
Select the data source to configure:
If your business data source distinguishes between production and development data sources, select Production + Development Data Source.
If your business data source does not distinguish between production and development data sources, select Production Data Source.
Tag
You can categorize and tag data sources using tags. For information about how to create tags, see Manage data source tags.
Configure the connection parameters between the data source and Dataphin.
If you selected Production + Development Data Source for Data Source Configuration, you need to configure the connection information for both Production + Development Data Source. If you selected Production Data Source, you only need to configure the connection information for the Production Data Source.
NoteTypically, production and development data sources should be configured as separate data sources to isolate the development environment from the production environment and reduce the impact of the development data source on the production data source. However, Dataphin also supports configuring them as the same data source with identical parameter values.
Parameter
Description
JDBC URL
The format of the connection address is
jdbc:impala//host:port/dbname. For example,jdbc:impala//192.168.*.1:5433/dataphin.Kerberos
Kerberos is an identity authentication protocol based on symmetric key technology:
If the Hadoop cluster has Kerberos authentication, you need to enable Kerberos.
If the Hadoop cluster does not have Kerberos authentication, you do not need to enable Kerberos.
Krb5 File/KDC Server, Keytab File, Principal
After enabling Kerberos, you need to configure the following parameters:
Krb5 File/KDC Server: Upload a Krb5 file containing the Kerberos authentication domain name or configure the KDC server address to assist with Kerberos authentication.
NoteYou can configure multiple KDC server addresses. Separate them with commas (
,).Keytab File: Upload the file containing the account and password for logging in to the Krb5 file domain name or KDC server address.
Principal: Configure the Kerberos authentication username corresponding to the Keytab File.
Username, Password
If you have not enabled Kerberos, you need to configure the username and password for accessing the Impala instance.
Configure the data source metadata database parameters.
Metadata Retrieval Method: Three methods are supported: Metadata Database, HMS, and DLF. Different retrieval methods require different configuration information.
Metadata Database Retrieval Method
Parameter
Description
Database Type
Select the database type according to the metadata database type used in your cluster. Dataphin supports MySQL and PostgreSQL. The MySQL database type supports MySQL 5.1.43, MYSQL 5.6/5.7, and MySQL 8 versions.
JDBC URL
Enter the JDBC connection address of the target database. The format is
jdbc:mysql://host:port/dbname.Username, Password
Enter the username and password for logging in to the metadata database.
HMS Retrieval Method
Parameter
Description
hive-site.xml
Upload the hive-site.xml configuration file for Hive.
DLF Retrieval Method
NoteMetadata retrieval using DLF is supported only when connecting to an Impala data source of E-MapReduce 5.x.
Parameter
Description
Endpoint (Optional)
Enter the endpoint of the region where the cluster is located in the DLF data center. If not specified, the configuration in hive-site.xml will be used. For information about how to obtain the endpoint, see DLF Region and Endpoint Reference Table.
AccessKey ID, AccessKey Secret
Enter the AccessKey ID and AccessKey Secret of the account where the cluster is located.
You can obtain the account's AccessKey ID and AccessKey Secret on the User Information Management page.

hive-site.xml
Upload the hive-site.xml configuration file for Hive.
Configure advanced settings for the connection between the data source and Dataphin.
Parameter
Description
Connection Retry Count
If the database connection times out, the system will automatically retry the connection until the specified number of retries is reached. If the connection is still unsuccessful after the maximum number of retries, the connection fails.
NoteThe default retry count is 1, and you can configure a value between 0 and 10.
The connection retry count will be applied by default to offline integration tasks and global quality (requires the asset quality function module to be enabled). You can configure task-level retry counts separately in offline integration tasks.
Select a Default Resource Group, which will be used to run tasks related to the current data source, including database SQL, offline database migration, data preview, and more.
Click Test Connection or OK to save the configuration and create the Impala data source.
When you click Test Connection, the system tests whether the data source can connect to Dataphin normally. If you directly click OK, the system automatically tests the connection for all selected clusters. However, even if all selected clusters fail the connection test, the data source can still be created normally.