All Products
Search
Document Center

Dataphin:Real-time integration of data to Hive

Last Updated:Mar 05, 2025

You can achieve real-time synchronization of MySQL, Oracle, and PostgreSQL data to Hive by creating a real-time integration task. This topic describes how to create a real-time integration task.

Prerequisites

The required data sources are configured. You need to configure the data sources you want to integrate before configuring the real-time integration task, so that you can select the corresponding source data and target data during the configuration process. For more information, see real-time integration supported data sources.

Step 1: create a real-time integration task

  1. In the top menu bar on the Dataphin home page, select Development > Data Integration.

  2. In the top menu bar, select Project (Dev-Prod mode requires selecting Environment).

  3. In the left-side navigation pane, select Integration > Stream Pipeline.

  4. Click the image icon in the real-time integration list, select Real-time Integration Task, and open the Create Real-time Integration Task dialog box.

  5. In the Create Real-time Integration Task dialog box, configure the following parameters.

    Parameter

    Description

    Task Name

    Enter the name of the real-time task.

    Starts with a letter, contains only lowercase English letters, numbers, and underscores (_), and is limited to 4-63 characters in length.

    Production/development Environment Queue Resource

    You can select all resource groups configured for real-time tasks.

    Note

    This configuration item is supported only when the computing source used by the project is a Flink computing source with Kubernetes deployment mode.

    Description

    Enter a brief description of the task, not exceeding 1000 characters.

    Select Directory

    Select the directory where the real-time task is stored.

    If no directory is created, you can Create Folder as follows:

    1. Click the image icon above the real-time task list on the left side of the page to open the Create Folder dialog box.

    2. In the Create Folder dialog box, enter the folder Name and select the Directory location as needed.

    3. Click OK.

  6. After the configuration is complete, click OK.

  7. In the newly created real-time integration task, configure Source Data and Target Data.

    When the target data source type is selected as Hive, the source data source type only supports MySQL, Oracle, PostgreSQL, and Kafka. The parameters required for different source data sources vary.

    1. Source Data Configuration

      Source data source is MySQL, Oracle, PostgreSQL

      Partition

      Parameter

      Description

      Datasource Config

      Datasource Type

      Select the datasource type as MySQL, Oracle, or PostgreSQL.

      Datasource

      Select a data source.

      The system provides an entry to create a new data source. You can click Create to create a new data source on the Datasource page. For details, see Create MySQL Data Source, Create Oracle Data Source, Create PostgreSQL Data Source.

      Important

      You need to enable logging on the data source side and ensure that the configured account has permission to read logs. Otherwise, the system cannot synchronize the data source in real-time.

      Sync Rule Configuration

      Sync Scheme

      Default is Real-time Incremental.

      Note

      Collects incremental changes from the source database in the order they occur and writes them to the downstream destination database.

      Selection Method

      Supports three methods: entire database, selected tables, and excluded tables.

      • Entire Database: Synchronize the entire database data.

        Only when the Source Data Source type is selected as MySQL, all tables in all databases under the selected data source will be synchronized.

      • Selected Tables/Excluded Tables: Select some tables in the current database for real-time synchronization.

        • Batch Select/Batch Exclude: When Batch Select is used, multiple tables in the current database will be synchronized in real-time. When Batch Exclude is used, multiple tables in the current database will not be synchronized in real-time.

          When the Source Data Source type is selected as MySQL, you can select all tables in all databases under the selected data source. The list displays each table in the format of DBname.Tablename.

          When the Source Data Source type is selected as PostgreSQL, Excluding Tables is not supported.

        • Regular Expression Matching: You can enter the regular expression of the table name in the Regular Expression input box. Java regular expressions are applicable, such as schemaA.*|schemaB.*.

          Only when the Source Data Source type is selected as MySQL, you can batch match all tables in all databases under the selected data source. You can use the database name (DBname) and table name (Tablename) for regular expression matching.

          When the Source Data Source type is selected as PostgreSQL, Microsoft SQL Server, or IBM DB2, Regular Expression Matching is not supported.

      Source data source is Kafka

      Partition

      Parameter

      Description

      Datasource Config

      Datasource Type

      Select the datasource type as Kafka.

      Datasource

      Select a data source.

      The system provides an entry to create a new data source. You can click Create to create a new data source on the Datasource page. For details, see Create Kafka Data Source.

      Important

      You need to enable logging on the data source side and ensure that the configured account has permission to read logs. Otherwise, the system cannot synchronize the data source in real-time.

      Source Topic

      Select the Topic of the source data. You can enter the Topic name keyword for fuzzy search.

      Data Format

      Currently, only Canal JSON format is supported. Canal JSON is a compatible format of Canal, and its data storage format is Canal JSON.

      Key Type

      The Key type of Kafka determines the key.deserializer configuration when initializing KafkaConsumer. Currently, only STRING is supported.

      Value Type

      The Value type of Kafka determines the value.deserializer configuration when initializing KafkaConsumer. Currently, only STRING is supported.

      Consumer Group ID (optional)

      Enter the ID of the consumer group. The consumer group ID is used to report the status offset.

      Sync Rule Configuration

      Table List

      Enter the names of the tables to be synchronized. Use line breaks to separate multiple table names, not exceeding 1024 characters.

      Table names support the following three formats: tablename, db.tablename, schema.tablename.

    2. Target Data Configuration

      Partition

      Parameter

      Description

      Datasource Config

      Datasource Type

      Select the datasource type as Hive.

      Datasource

      Select the target data source.

      The system provides an entry to create a new data source. You can click Create to create a new data source on the Datasource page. For more information, see Create Hive Data Source

      Data Lake Table Format

      You can choose None or Hudi.

      • None: Write and create tables as ordinary Hive tables.

      • Hudi: Write and create tables in Hudi format.

      Note

      This item is supported only when the selected Hive data source enables the Data Lake Table Format Configuration.

      New Table Configuration

      Table Name Transformation

      Click Configure Table Name Transformation to configure transformation rules in the Configure Table Name Transformation Rules dialog box.

      Note
      • Click Create Rule to add a row, up to 5 rows can be displayed.

      • Rules are matched and replaced from top to bottom.

      • English characters in the replacement characters and table name prefixes and suffixes will be automatically converted to lowercase.

      • The target table name prefix and suffix cannot be empty and support English letters, numbers, and underscores within 32 characters.

      Partition Settings

      Partition Interval

      The default selection is Hour, and you can select Day.

      • Hour: Displays four-level partitions of YYYY, MM, DD, and HH.

      • Day: Displays three-level partitions of YYYY, MM, and DD.

      Format

      Currently, only Multi-partition is supported.

    3. Target Table Configuration

      Note

      When the sync rule is not completed, the target table is empty, and the refresh table and mapping relationship button is grayed out.

      image..png

      Area

      Description

      View Additional Fields Button

      During real-time incremental synchronization, additional fields are automatically added by default for data usage when creating tables. Click View Additional Fields to view the fields. In the Additional Fields dialog box, you can view the information of the currently added fields.

      Important

      If you select an existing table as the target table and there are no additional fields in the table, it is recommended to add additional fields to the existing target table yourself. Otherwise, it will affect data usage.

      Click View DDL Of Added Fields to view the DDL statement of the added additional fields.

      Note

      When the source data source type is selected as Kafka, viewing additional fields is not supported.

      Search And Filter Area

      Supports searching by Source Table and Target Table Name. To quickly filter the target table, click the 1 icon at the top, and you can filter by Mapping Status and Table Creation Method.

      Add Global Field, Refresh Table And Mapping Relationship

      • Add Global Field

        Click Add Global Field to add a global field in the Add Global Field dialog box.

        • Name: The global name.

        • Type: Supports five data types: String, Long, Double, Date, Boolean.

        • Value: The value of the global field.

        • Description: Description of the field.

        Note
        • When both global and single table fields are added, only the single table field takes effect.

        • Currently, only constants can be added.

        • Global fields only take effect for target tables with Automatic Table Creation method.

        • When the source data source type is selected as Kafka, adding global fields is not supported.

      • Refresh Table and Mapping Relationship

        To refresh the target table configuration list, click Refresh Table And Mapping Relationship.

        Important
        • When there is content in the target table configuration, reselecting the data source type and data source will reset the target table list and mapping relationship. Please operate with caution.

        • You can click to refresh again at any time during the refresh process. Each time you click Refresh Table And Mapping Relationship, only the globally added fields are saved. Other information, including the target table creation method, target table name, and delete records, is not saved.

        • When the source data source type is selected as Kafka, after clicking Refresh Table And Mapping Relationship, the system will map according to the table list in Sync Rule - Table Configuration. If the table does not exist, an error will be reported.

      Target Database List

      The target database list includes Serial Number, Source Table, Mapping Status, Target Table Creation Method, Target Table Name. You can also perform Add Field, View Field, Refresh, and Delete operations on the target table.

      • Target Table Creation Method is divided into the following three types:

        • If a table with the same name as the source table exists in the target database, the target table creation method is to use the existing table, and this table is used as the target table by default. If you want to change to automatic table creation, you need to add table name transformation rules or prefixes and suffixes and then remap.

        • If no table with the same name is found in the target database, the target table creation method defaults to automatic table creation. You can also change the method to use an existing table and select an existing table for synchronization.

        • Only tables with automatic table creation support adding fields or custom DDL table creation. Global fields also only take effect for tables with automatic table creation.

        Note
        • When automatic table creation is used, if the data lake table format is None, an Ordinary Hive Table will be created. Otherwise, a table corresponding to the selected table format will be created. Currently, only Hudi is supported.

        • When custom table creation is used, if the data lake table format is None, the Ordinary Hive Table DDL is used. Otherwise, the DDL corresponding to the selected table format is required. Currently, only Hudi is supported.

        • When the source data source type is selected as Kafka, the target table creation method only supports using existing tables.

      • Mapping Status: Different mapping statuses display different operation items. They are:

        • Completed: Mapping completed normally.

        • Incomplete: The status was modified, and the mapping was not refreshed.

        • Mapping: Waiting for mapping or mapping in progress.

        • Abnormal: There is a data source or system internal error.

        • Failed: The target partitioned table is inconsistent with the partition set by the real-time task.

        • Alert: There may be incompatible data types between the source table and the target table.

      • Add Field: You can customize table creation through Add Field or DDL. After enabling custom table creation, globally added fields will no longer take effect.

        Note
        • After adding fields, they are only displayed in the operation column of automatic table creation.

        • Modifying an existing target table is not supported, that is, the target table with the table creation method of using an existing table.

      • View Field: You can view the fields and types of the source table and target table.

      • Refresh: Remap the source table and target table.

      • Delete: Once the source table is deleted, it cannot be revoked.

      Note

      The target table name only supports English letters, numbers, and underscores. If the source table name contains other characters, please configure the table name transformation rules.

      Batch Operation

      You can perform batch Delete operations on the target table.

    4. DDL Processing Policy

      • Normal Processing: This DDL information will continue to be sent to the target data source for processing. Different target data sources will have different processing policies.

      • Ignore: Discard this DDL information and do not send it to the target data source.

      • Error: Directly terminate the real-time synchronization task with an error status.

      Note
      • New columns added to existing partitions of Hive tables cannot be synchronized. That is, the data of the new columns in existing partitions is all NULL, and the next new partition takes effect normally.

      • When the PostgreSQL data source type is selected, DDL processing policies are not supported.

      • When the data lake table format is selected as Hudi, DDL processing policies only support ignoring.

      • When the source data source type is selected as Kafka, DDL processing policies only support ignoring.

  8. After the configuration is complete, click Save.

Step 2: configure real-time integration task attributes

  1. Click Resource Configuration in the top menu bar of the current real-time integration task tab, or click Attribute in the right sidebar to open the Attribute panel.

  2. Configure the Basic Information and Resource Configuration of the current real-time integration task.

    • Basic Information: Select the Development Owner and Operation Owner of the current real-time integration task, and enter the corresponding Description of the current task, not exceeding 1000 characters.

    • Resource Configuration: For details, see Real-time Integration Resource Configuration.

Step 3: submit the real-time integration task

  1. Click Submit to submit the current real-time integration task.

  2. In the Submit dialog box, enter the Submission Remarks information and click OK And Submit.

  3. After submission is complete, you can view the submission details in the Submit dialog box.

    In Dev-Prod project mode, you must publish the real-time integration task to the production environment. For more information, see Manage Published Tasks.

What to do next

You can view and manage real-time integration tasks in the Operation Center to ensure their normal functioning. For more information, see View and Manage Real-time Tasks.