All Products
Search
Document Center

Dataphin:Create an Impala data source

Last Updated:Feb 12, 2026

By creating an Impala data source, you can enable Dataphin to read business data from Impala or write data to Impala. This topic describes how to create an Impala data source.

Background information

Impala is a SQL query engine for processing large amounts of data stored in Hadoop clusters. If you use Impala and want to export Dataphin data to Impala, you need to create an Impala data source first. For more information about Impala, see Impala official website.

Permission management

Only custom global roles with the Create Data Source permission and users with the Super Administrator, Data Source Administrator, Business Unit Architect, or Project Administrator role can create data sources.

Limits

Dataphin uses Java Database Connectivity (JDBC) for Impala data source integration, which results in lower performance compared to Hive. If the table that you are integrating is not a Kudu table, use a Hive data source and its input and output components instead.

Metadata retrieval using DLF is supported only when connecting to an Impala data source of E-MapReduce 5.x.

Procedure

  1. On the Dataphin homepage, click Management Center > Data Source Management in the top navigation bar.

  2. On the Datasource page, click +Create Data Source.

  3. On the Create Data Source page, under Big Data, select Impala.

    If you have recently used Impala, you can select it from the Recently Used section or enter `Impala` in the search box to find it.

  4. On the Create Impala Data Source page, configure the connection parameters for the data source.

    1. Configure the basic information of the data source.

      Parameter

      Description

      Datasource Name

      The name must meet the following requirements:

      • It can contain only Chinese characters, letters, digits, underscores (_), and hyphens (-).

      • It cannot exceed 64 characters in length.

      Datasource Code

      After you configure the data source code, you can reference tables in Flink_SQL nodes using the datasource_code.table_name or datasource_code.schema.table_name format. To automatically access the data source of the current environment, use the ${datasource_code}.table or ${datasource_code}.schema.table variable format. For more information, see Develop a Dataphin data source table.

      Important
      • The data source code cannot be modified after it is configured.

      • You can preview data on the object details page in the asset directory and asset checklist only after the data source code is configured.

      • In Flink SQL, only MySQL, Hologres, MaxCompute, Oracle, StarRocks, Hive, SelectDB, and GaussDB data warehouse service (DWS) data sources are currently supported.

      Version

      Select the version of the Impala data source. Supported versions include the following:

      • CDH5:2.11.0

      • CDH6:3.2.0

      • CDP7.1.3:3.4.0

      • E-MapReduce 3.x: 3.4.0

      • E-MapReduce 5.x: 3.4.0

      • E-MapReduce 5.x: 4.2.0

      Data Source Description

      A brief description of the data source. It cannot exceed 128 characters.

      Data Source Configuration

      Select the data source to configure:

      • If your business data source distinguishes between production and development data sources, select Production + Development Data Source.

      • If your business data source does not distinguish between production and development data sources, select Production Data Source.

      Tag

      You can categorize and tag data sources using tags. For information about how to create tags, see Manage data source tags.

    2. Configure the connection parameters between the data source and Dataphin.

      If you selected Production + Development Data Source for Data Source Configuration, you need to configure the connection information for both Production + Development Data Source. If you selected Production Data Source, you only need to configure the connection information for the Production Data Source.

      Note

      Typically, production and development data sources should be configured as separate data sources to isolate the development environment from the production environment and reduce the impact of the development data source on the production data source. However, Dataphin also supports configuring them as the same data source with identical parameter values.

      Parameter

      Description

      JDBC URL

      The format of the connection address is jdbc:impala//host:port/dbname. For example, jdbc:impala//192.168.*.1:5433/dataphin.

      Kerberos

      Kerberos is an identity authentication protocol based on symmetric key technology:

      • If the Hadoop cluster has Kerberos authentication, you need to enable Kerberos.

      • If the Hadoop cluster does not have Kerberos authentication, you do not need to enable Kerberos.

      Krb5 File/KDC Server, Keytab File, Principal

      After enabling Kerberos, you need to configure the following parameters:

      • Krb5 File/KDC Server: Upload a Krb5 file containing the Kerberos authentication domain name or configure the KDC server address to assist with Kerberos authentication.

        Note

        You can configure multiple KDC server addresses. Separate them with commas (,).

      • Keytab File: Upload the file containing the account and password for logging in to the Krb5 file domain name or KDC server address.

      • Principal: Configure the Kerberos authentication username corresponding to the Keytab File.

      Username, Password

      If you have not enabled Kerberos, you need to configure the username and password for accessing the Impala instance.

    3. Configure the data source metadata database parameters.

      Metadata Retrieval Method: Three methods are supported: Metadata Database, HMS, and DLF. Different retrieval methods require different configuration information.

      • Metadata Database Retrieval Method

        Parameter

        Description

        Database Type

        Select the database type according to the metadata database type used in your cluster. Dataphin supports MySQL and PostgreSQL. The MySQL database type supports MySQL 5.1.43, MYSQL 5.6/5.7, and MySQL 8 versions.

        JDBC URL

        Enter the JDBC connection address of the target database. The format is jdbc:mysql://host:port/dbname.

        Username, Password

        Enter the username and password for logging in to the metadata database.

      • HMS Retrieval Method

        Parameter

        Description

        hive-site.xml

        Upload the hive-site.xml configuration file for Hive.

      • DLF Retrieval Method

        Note

        Metadata retrieval using DLF is supported only when connecting to an Impala data source of E-MapReduce 5.x.

        Parameter

        Description

        Endpoint (Optional)

        Enter the endpoint of the region where the cluster is located in the DLF data center. If not specified, the configuration in hive-site.xml will be used. For information about how to obtain the endpoint, see DLF Region and Endpoint Reference Table.

        AccessKey ID, AccessKey Secret

        Enter the AccessKey ID and AccessKey Secret of the account where the cluster is located.

        You can obtain the account's AccessKey ID and AccessKey Secret on the User Information Management page.

        image

        hive-site.xml

        Upload the hive-site.xml configuration file for Hive.

    4. Configure advanced settings for the connection between the data source and Dataphin.

      Parameter

      Description

      Connection Retry Count

      If the database connection times out, the system will automatically retry the connection until the specified number of retries is reached. If the connection is still unsuccessful after the maximum number of retries, the connection fails.

      Note
      • The default retry count is 1, and you can configure a value between 0 and 10.

      • The connection retry count will be applied by default to offline integration tasks and global quality (requires the asset quality function module to be enabled). You can configure task-level retry counts separately in offline integration tasks.

  5. Select a Default Resource Group, which will be used to run tasks related to the current data source, including database SQL, offline database migration, data preview, and more.

  6. Click Test Connection or OK to save the configuration and create the Impala data source.

    When you click Test Connection, the system tests whether the data source can connect to Dataphin normally. If you directly click OK, the system automatically tests the connection for all selected clusters. However, even if all selected clusters fail the connection test, the data source can still be created normally.