You can achieve real-time synchronization of MySQL, Oracle, and PostgreSQL data to Hive by creating a real-time integration task. This topic describes how to create a real-time integration task.
Prerequisites
The required data sources are configured. You need to configure the data sources you want to integrate before configuring the real-time integration task, so that you can select the corresponding source data and target data during the configuration process. For more information, see real-time integration supported data sources.
Step 1: create a real-time integration task
In the top menu bar on the Dataphin home page, select Development > Data Integration.
In the top menu bar, select Project (Dev-Prod mode requires selecting Environment).
In the left-side navigation pane, select Integration > Stream Pipeline.
Click the
icon in the real-time integration list, select Real-time Integration Task, and open the Create Real-time Integration Task dialog box.
In the Create Real-time Integration Task dialog box, configure the following parameters.
Parameter
Description
Task Name
Enter the name of the real-time task.
Starts with a letter, contains only lowercase English letters, numbers, and underscores (_), and is limited to 4-63 characters in length.
Production/development Environment Queue Resource
You can select all resource groups configured for real-time tasks.
NoteThis configuration item is supported only when the computing source used by the project is a Flink computing source with Kubernetes deployment mode.
Description
Enter a brief description of the task, not exceeding 1000 characters.
Select Directory
Select the directory where the real-time task is stored.
If no directory is created, you can Create Folder as follows:
Click the
icon above the real-time task list on the left side of the page to open the Create Folder dialog box.
In the Create Folder dialog box, enter the folder Name and select the Directory location as needed.
Click OK.
After the configuration is complete, click OK.
In the newly created real-time integration task, configure Source Data and Target Data.
When the target data source type is selected as Hive, the source data source type only supports MySQL, Oracle, PostgreSQL, and Kafka. The parameters required for different source data sources vary.
Source Data Configuration
Source data source is MySQL, Oracle, PostgreSQL
Partition
Parameter
Description
Datasource Config
Datasource Type
Select the datasource type as MySQL, Oracle, or PostgreSQL.
Datasource
Select a data source.
The system provides an entry to create a new data source. You can click Create to create a new data source on the Datasource page. For details, see Create MySQL Data Source, Create Oracle Data Source, Create PostgreSQL Data Source.
ImportantYou need to enable logging on the data source side and ensure that the configured account has permission to read logs. Otherwise, the system cannot synchronize the data source in real-time.
Sync Rule Configuration
Sync Scheme
Default is Real-time Incremental.
NoteCollects incremental changes from the source database in the order they occur and writes them to the downstream destination database.
Selection Method
Supports three methods: entire database, selected tables, and excluded tables.
Entire Database: Synchronize the entire database data.
Only when the Source Data Source type is selected as MySQL, all tables in all databases under the selected data source will be synchronized.
Selected Tables/Excluded Tables: Select some tables in the current database for real-time synchronization.
Batch Select/Batch Exclude: When Batch Select is used, multiple tables in the current database will be synchronized in real-time. When Batch Exclude is used, multiple tables in the current database will not be synchronized in real-time.
When the Source Data Source type is selected as MySQL, you can select all tables in all databases under the selected data source. The list displays each table in the format of
DBname.Tablename
.When the Source Data Source type is selected as PostgreSQL, Excluding Tables is not supported.
Regular Expression Matching: You can enter the regular expression of the table name in the Regular Expression input box. Java regular expressions are applicable, such as
schemaA.*|schemaB.*
.Only when the Source Data Source type is selected as MySQL, you can batch match all tables in all databases under the selected data source. You can use the database name (DBname) and table name (Tablename) for regular expression matching.
When the Source Data Source type is selected as PostgreSQL, Microsoft SQL Server, or IBM DB2, Regular Expression Matching is not supported.
Source data source is Kafka
Partition
Parameter
Description
Datasource Config
Datasource Type
Select the datasource type as Kafka.
Datasource
Select a data source.
The system provides an entry to create a new data source. You can click Create to create a new data source on the Datasource page. For details, see Create Kafka Data Source.
ImportantYou need to enable logging on the data source side and ensure that the configured account has permission to read logs. Otherwise, the system cannot synchronize the data source in real-time.
Source Topic
Select the Topic of the source data. You can enter the Topic name keyword for fuzzy search.
Data Format
Currently, only Canal JSON format is supported. Canal JSON is a compatible format of Canal, and its data storage format is Canal JSON.
Key Type
The Key type of Kafka determines the key.deserializer configuration when initializing KafkaConsumer. Currently, only STRING is supported.
Value Type
The Value type of Kafka determines the value.deserializer configuration when initializing KafkaConsumer. Currently, only STRING is supported.
Consumer Group ID (optional)
Enter the ID of the consumer group. The consumer group ID is used to report the status offset.
Sync Rule Configuration
Table List
Enter the names of the tables to be synchronized. Use line breaks to separate multiple table names, not exceeding 1024 characters.
Table names support the following three formats:
tablename
,db.tablename
,schema.tablename
.Target Data Configuration
Partition
Parameter
Description
Datasource Config
Datasource Type
Select the datasource type as Hive.
Datasource
Select the target data source.
The system provides an entry to create a new data source. You can click Create to create a new data source on the Datasource page. For more information, see Create Hive Data Source
Data Lake Table Format
You can choose None or Hudi.
None: Write and create tables as ordinary Hive tables.
Hudi: Write and create tables in Hudi format.
NoteThis item is supported only when the selected Hive data source enables the Data Lake Table Format Configuration.
New Table Configuration
Table Name Transformation
Click Configure Table Name Transformation to configure transformation rules in the Configure Table Name Transformation Rules dialog box.
NoteClick Create Rule to add a row, up to 5 rows can be displayed.
Rules are matched and replaced from top to bottom.
English characters in the replacement characters and table name prefixes and suffixes will be automatically converted to lowercase.
The target table name prefix and suffix cannot be empty and support English letters, numbers, and underscores within 32 characters.
Partition Settings
Partition Interval
The default selection is Hour, and you can select Day.
Hour: Displays four-level partitions of YYYY, MM, DD, and HH.
Day: Displays three-level partitions of YYYY, MM, and DD.
Format
Currently, only Multi-partition is supported.
Target Table Configuration
NoteWhen the sync rule is not completed, the target table is empty, and the refresh table and mapping relationship button is grayed out.
Area
Description
①View Additional Fields Button
During real-time incremental synchronization, additional fields are automatically added by default for data usage when creating tables. Click View Additional Fields to view the fields. In the Additional Fields dialog box, you can view the information of the currently added fields.
ImportantIf you select an existing table as the target table and there are no additional fields in the table, it is recommended to add additional fields to the existing target table yourself. Otherwise, it will affect data usage.
Click View DDL Of Added Fields to view the DDL statement of the added additional fields.
NoteWhen the source data source type is selected as Kafka, viewing additional fields is not supported.
②Search And Filter Area
Supports searching by Source Table and Target Table Name. To quickly filter the target table, click the
icon at the top, and you can filter by Mapping Status and Table Creation Method.
③Add Global Field, Refresh Table And Mapping Relationship
Add Global Field
Click Add Global Field to add a global field in the Add Global Field dialog box.
Name: The global name.
Type: Supports five data types: String, Long, Double, Date, Boolean.
Value: The value of the global field.
Description: Description of the field.
NoteWhen both global and single table fields are added, only the single table field takes effect.
Currently, only constants can be added.
Global fields only take effect for target tables with Automatic Table Creation method.
When the source data source type is selected as Kafka, adding global fields is not supported.
Refresh Table and Mapping Relationship
To refresh the target table configuration list, click Refresh Table And Mapping Relationship.
ImportantWhen there is content in the target table configuration, reselecting the data source type and data source will reset the target table list and mapping relationship. Please operate with caution.
You can click to refresh again at any time during the refresh process. Each time you click Refresh Table And Mapping Relationship, only the globally added fields are saved. Other information, including the target table creation method, target table name, and delete records, is not saved.
When the source data source type is selected as Kafka, after clicking Refresh Table And Mapping Relationship, the system will map according to the table list in Sync Rule - Table Configuration. If the table does not exist, an error will be reported.
④Target Database List
The target database list includes Serial Number, Source Table, Mapping Status, Target Table Creation Method, Target Table Name. You can also perform Add Field, View Field, Refresh, and Delete operations on the target table.
Target Table Creation Method is divided into the following three types:
If a table with the same name as the source table exists in the target database, the target table creation method is to use the existing table, and this table is used as the target table by default. If you want to change to automatic table creation, you need to add table name transformation rules or prefixes and suffixes and then remap.
If no table with the same name is found in the target database, the target table creation method defaults to automatic table creation. You can also change the method to use an existing table and select an existing table for synchronization.
Only tables with automatic table creation support adding fields or custom DDL table creation. Global fields also only take effect for tables with automatic table creation.
NoteWhen automatic table creation is used, if the data lake table format is None, an Ordinary Hive Table will be created. Otherwise, a table corresponding to the selected table format will be created. Currently, only Hudi is supported.
When custom table creation is used, if the data lake table format is None, the Ordinary Hive Table DDL is used. Otherwise, the DDL corresponding to the selected table format is required. Currently, only Hudi is supported.
When the source data source type is selected as Kafka, the target table creation method only supports using existing tables.
Mapping Status: Different mapping statuses display different operation items. They are:
Completed: Mapping completed normally.
Incomplete: The status was modified, and the mapping was not refreshed.
Mapping: Waiting for mapping or mapping in progress.
Abnormal: There is a data source or system internal error.
Failed: The target partitioned table is inconsistent with the partition set by the real-time task.
Alert: There may be incompatible data types between the source table and the target table.
Add Field: You can customize table creation through Add Field or DDL. After enabling custom table creation, globally added fields will no longer take effect.
NoteAfter adding fields, they are only displayed in the operation column of automatic table creation.
Modifying an existing target table is not supported, that is, the target table with the table creation method of using an existing table.
View Field: You can view the fields and types of the source table and target table.
Refresh: Remap the source table and target table.
Delete: Once the source table is deleted, it cannot be revoked.
NoteThe target table name only supports English letters, numbers, and underscores. If the source table name contains other characters, please configure the table name transformation rules.
⑤Batch Operation
You can perform batch Delete operations on the target table.
DDL Processing Policy
Normal Processing: This DDL information will continue to be sent to the target data source for processing. Different target data sources will have different processing policies.
Ignore: Discard this DDL information and do not send it to the target data source.
Error: Directly terminate the real-time synchronization task with an error status.
NoteNew columns added to existing partitions of Hive tables cannot be synchronized. That is, the data of the new columns in existing partitions is all NULL, and the next new partition takes effect normally.
When the PostgreSQL data source type is selected, DDL processing policies are not supported.
When the data lake table format is selected as Hudi, DDL processing policies only support ignoring.
When the source data source type is selected as Kafka, DDL processing policies only support ignoring.
After the configuration is complete, click Save.
Step 2: configure real-time integration task attributes
Click Resource Configuration in the top menu bar of the current real-time integration task tab, or click Attribute in the right sidebar to open the Attribute panel.
Configure the Basic Information and Resource Configuration of the current real-time integration task.
Basic Information: Select the Development Owner and Operation Owner of the current real-time integration task, and enter the corresponding Description of the current task, not exceeding 1000 characters.
Resource Configuration: For details, see Real-time Integration Resource Configuration.
Step 3: submit the real-time integration task
Click Submit to submit the current real-time integration task.
In the Submit dialog box, enter the Submission Remarks information and click OK And Submit.
After submission is complete, you can view the submission details in the Submit dialog box.
In Dev-Prod project mode, you must publish the real-time integration task to the production environment. For more information, see Manage Published Tasks.
What to do next
You can view and manage real-time integration tasks in the Operation Center to ensure their normal functioning. For more information, see View and Manage Real-time Tasks.