Synchronize full data at a time and periodically synchronize incremental data from a MySQL database to Hive - DataWorks

This topic describes how to synchronize data from a MySQL database to Hive in offline mode.

Limits

In this example, only exclusive resource groups for Data Integration are supported.

Prerequisites

The data sources that you want to use as the source and destination are configured.
In this example, you need to add a MySQL data source and a Hive data source on the Data Source page in the DataWorks console. For information about the synchronization capabilities of the data sources you added, see MySQL data source and Hive data source.
An exclusive resource group for Data Integration is purchased and associated with a DataWorks workspace, and network connections between the resource group and data sources are established.
In this example, only exclusive resource groups for Data Integration are supported. You must purchase and configure an exclusive resource group for Data Integration and establish network connections between the resource group and the MySQL and Hive data sources. For information about how to purchase and configure an exclusive resource group for Data Integration, see Create and use an exclusive resource group for Data Integration.

Create a synchronization task

Log on to the DataWorks console and go to the Data Integration page. In the left-side navigation pane, click Data Synchronization Node. On the Nodes page, click Create Node to create and configure a synchronization task.
Configure basic information for the synchronization task.
1. Task name: In this section, specify a name for the synchronization task based on your business requirements.
2. Synchronization Type: In this section, select MySQL as the source type and Hive as the destination type. After you select the source and destination types, the Synchronization Method and Synchronization Mode parameters are displayed. You can configure the two parameters based on your business requirements.
  - If you want to perform one-time or periodic synchronization of full data, select full database offline synchronization for the Synchronization Mode parameter.
  - If you want to perform one-time or periodic synchronization of incremental data, select inc offline synchronization for the Synchronization Mode parameter.
  - If you want to perform one-time synchronization of full data and periodic synchronization of incremental data, select full database offline synchronization and inc offline synchronization for the Synchronization Mode parameter.
3. Network and Resource Configuration: In this section, select the MySQL data source you added, the exclusive resource group for Data Integration you purchased, and the Hive data source you added from the drop-down lists. Then, click Test Connectivity for All Resource Groups and Data Sources to confirm that network connections are established between the resource group and data sources.
Configure the information that is specific to your synchronization task.
1. Select the MySQL table from which you want to read data by following the on-screen instructions. You can use a regular expression to search for the desired table by name.
2. Configure properties for the destination Hive database. The properties of the destination Hive database affect the schemas of destination Hive tables that will be automatically created, but do not affect the schemas of existing tables.
  - Storage Type Of Newly Created Table: The storage method of destination Hive tables that are automatically created. Valid values: Internal Table and External Table.
  - Format Of Newly Created Table: The format of destination Hive tables that are automatically created. Valid values: parquet, orc, and txt.
  - Write Mode: The write mode. Valid values: Load New Data with Existing Data Deleted and Load New Data with Existing Data Retained.
  - Partition Initialization Settings: The initialization result for partitions in the destination Hive tables that are automatically created. By default, only level-1 partitions are available. You can click the Configuration icon to modify the initialization result.
3. Configure the parameters in the Configurations for Full and Incremental Synchronization section:
  1. Configure the synchronization mode for the synchronization task.
    - By default, if you select full database offline synchronization and inc offline synchronization for the Synchronization Mode parameter in the Synchronization Type section of the Create Data Synchronization Solution page, One-time Synchronization is selected for the Full Sync parameter and Periodical Synchronization is selected for the Method of Incremental Synchronization parameter in this section. You cannot change the settings.
    - If you select full database offline synchronization for the Synchronization Mode parameter in the Synchronization Type section of the Create Data Synchronization Solution page, you can select One-time Full Synchronization or Periodic Full Synchronization for Task Type in this section.
    - If you select inc offline synchronization for the Synchronization Mode parameter in the Synchronization Type section of the Create Data Synchronization Solution page, you can select One-time Incremental Synchronization or Periodic Incremental Synchronization for Task Type in this section.
      Note
      In this example, the synchronization mode of one-time synchronization of full data and periodic synchronization of incremental data is used.
  2. Configure the parameters for periodic scheduling for the synchronization task.
    If you want the synchronization task to be periodically scheduled, you can click Periodic scheduling parameter configuration to configure the parameters for periodic scheduling.
Configure the parameters in the Mapping Rules for Destination Tables section.
1. Select all tables in the Mapping Rules for Destination Tables section and click Batch Refresh Mapping Results to refresh the table mappings.
2. Configure the parameters such as Condition for Incremental Synchronization and Scheduling Configuration based on on-screen instructions.
  - Configurations of table-level periodic scheduling parameters and conditions for incremental synchronization: You can modify the settings of multiple tables at the same time and the settings of a single table based on your business requirements.
    - Condition for Incremental Synchronization: specifies a WHERE clause based on which you want to filter data in the source. You do not need to include the WHERE keyword in the clause. If the settings for periodic scheduling are enabled, you can use the system parameter variables. Example:
```
STR_TO_DATE('${bizdate}', '%Y%m%d') <= columnName AND columnName < DATE_ADD(STR_TO_DATE('${bizdate}', '%Y%m%d'), interval 1 day)
```
      ${bizdate} is the scheduling parameter you configured.
  - Assign values to partition fields: You can select Value assignment from the Batch Modify drop-down list to assign values to partition fields.
  - Configure the Customize Mapping Rules for Destination Table Names and Custom Advanced Parameters parameters.
    - Customize Mapping Rules for Destination Table Names: You can use built-in variables and manually entered strings to concatenate the final destination table name.
    - Custom Advanced Parameters: You can configure advanced parameters for each table, such as the maximum number of parallel threads that can be used to read data from each table.
Click Complete.

Perform O&M operations on the synchronization task

Start the synchronization task

After you complete the configurations, the Nodes page appears. You can find the synchronization task and click Start in the Actions column to start the synchronization task.

View the details of the synchronization task

You can click the name of the synchronization task to view the task details. The task details include the following parts:

Basic Information: includes the data source information of the synchronization task, the resource group used to run the synchronization task, and the synchronization task type.
Running Status: the details of the synchronization steps after the synchronization task is subdivided.
- If you select full database offline synchronization for the Synchronization Mode parameter in the Synchronization Type section of the Create Data Synchronization Solution page, the Schema Migration and Full Data Synchronization sections are displayed.
- If you select inc offline synchronization for the Synchronization Mode parameter in the Synchronization Type section of the Create Data Synchronization Solution page, the Schema Migration and Incremental Synchronization sections are displayed.
- If you select full database offline synchronization and inc offline synchronization for the Synchronization Mode parameter in the Synchronization Type section of the Create Data Synchronization Solution page, the Schema Migration, Full Data Synchronization, and Incremental Synchronization sections are displayed.
Details: the steps that are carried out based on the parameter configurations in the Schema Migration, Full Data Synchronization, and Incremental Synchronization sections.

Rerun the task

Rerun the task directly: Find the synchronization task on the Nodes page, click More in the Actions column, and select Rerun from the drop-down list to rerun the task without modifying the task configurations.
Effect: The one-time subtask is rerun and the properties of the periodic subtask are updated.
Modify the task and rerun the task: Find the synchronization task on the Nodes page, add tables to or remove tables from the task, and then click Complete. In this case, Rerun is changed to Apply updates in the Actions column on the Nodes page. Click Apply updates to rerun the modified task.
Effect: Data in the tables that you add will be synchronized and data in the tables whose data has been synchronized will not be synchronized again.
Modify the task and rerun the task: Modify the names of the destination Hive tables or use other Hive tables for data synchronization and click Complete. In this case, Rerun is changed to Apply updates in the Actions column on the Nodes page. Click Apply updates to rerun the modified task.
Effect: Data will be synchronized to the destination Hive tables that you modified. Data will not be synchronized to destination Hive tables that you do not modify.

Data development scenarios

You can refer to Scheduling dependency configuration guide to configure scheduling dependencies for your synchronization task to meet data development requirements. You can view the information about periodically scheduled subtasks in the Scheduling Configuration section.