Synchronize data from MySQL to MaxCompute in real time - DataWorks

This topic describes how to use Data Integration to synchronize full data and incremental data from a MySQL data source to a MaxCompute data source.

Prerequisites

A MaxCompute data source and a MySQL data source are added to DataWorks. You must add a MaxCompute data source and a MySQL data source to DataWorks. This way, you can select the data sources to synchronize data between the data sources when you configure a data synchronization solution. In this example, the MySQL data source doc_mysql1 is used. For more information about how to add a MySQL data source and a MaxCompute data source to DataWorks, see Add a MySQL data source and Add a MaxCompute data source.
Note
For information about the items that you must understand before you add a data source, see Overview.
An exclusive resource group for Data Integration that meets your business requirements is purchased. For more information, see Create and use an exclusive resource group for Data Integration.
Network connections between the exclusive resource group for Data Integration and data sources are established. For more information, see Establish a network connection between a resource group and a data source.
The data source environments are prepared.
- MaxCompute: The MaxCompute V2.0 data type edition supports the DECIMAL data type. Before you run a data synchronization solution that uses the MaxCompute data source, you must enable the MaxCompute V2.0 data type edition for the data source. For more information, see MaxCompute V2.0 data type edition.
- MySQL: You must enable the binary logging feature for the MySQL data source. For more information, see Prepare a MySQL environment.

Sample scenario

Data: Synchronize full data from the MySQL data source to MaxCompute base tables at a time and incremental data from the MySQL data source to a MaxCompute log table in real time, and write the incremental data to the MaxCompute base tables on a regular basis to merge the full data and incremental data.
Table: Synchronize data from the tables whose names are prefixed with doc_ in the MySQL data source to the tables whose names are prefixed with ods_ in the MaxCompute data source.
Field: Add the additional field execute_time to the destination tables to record the time when a data record in a source table is changed.

The following table describes the mappings between the source tables and destination tables.

Source database	Source table	Destination table	Additional field
doc_demo1	doc_tb1	ods_tb1	execute_time
doc_demo1	doc_tb2_nopk	ods_tb2_nopk	execute_time

Procedure

Create and configure a data synchronization solution
Run the data synchronization solution
Perform O&M operations for the data synchronization solution
- Configure monitoring and alerting settings for the real-time synchronization node generated by the data synchronization solution
- Configure monitoring and alerting settings for the merge node generated by the data synchronization solution

Create and configure a data synchronization solution

Select a synchronization solution type.
Log on to the DataWorks console and go to the homepage of or Tasks page in Data Integration to create a data synchronization solution. Select the source type MySQL and destination type MaxCompute and select One-click Real-time Synchronization to MaxCompute.
Establish network connections.
Select doc_mysql1 as the source, use the default MaxCompute data source as the destination, and establish network connections between the exclusive resource group for Data Integration and the source and destination. The default MaxCompute data source is generated after a MaxCompute project is associated with a DataWorks workspace.
Select source tables and configure a mapping rule.
1. In the Basic Configuration section, configure the parameters such as Solution Name and Location.
2. In the Data Source section, view information about the source.
3. In the Source Table section, select the tables from which you want to read data from the Source Table list. Then, click the icon to move the tables to the Selected Tables list.
  In this example, the tables doc_tb1 and doc_tb2_nopk in the database doc_demo1 are selected.
4. In the Set Mapping Rules for Table/Database Names section, click Add Rule, select a rule type, and then configure a mapping rule of the selected type.
  By default, data in a source table is written to the destination table that is named the same as the source table. You can configure a mapping rule to specify the name of a destination table. In this example, a mapping rule of the Conversion Rule for Table Name type is configured. The mapping rule defines that data in the source tables whose names are prefixed with doc_ is written to the destination tables whose names are prefixed with ods_.

Configure the destination tables.

Configure the Write Mode parameter.
You can set this parameter only to Write log table directly in real time. This way, incremental data in the source tables is written to a MaxCompute log table. Then, the system writes the incremental data in the MaxCompute log table to the MaxCompute base tables on a regular basis to merge the full data and incremental data.
Configure the Automatic Partitioning by Time parameter.
You can configure the Automatic Partitioning by Time parameter to write data to partitioned tables or non-partitioned tables. If you set this parameter to Partition Table, you can specify the partition key column. In this example, this parameter is set to Partition Table, and the partition key column is ds.
Note
If you set the Automatic Partitioning by Time parameter to Partition Table, you can click the icon to specify the partition key column for the destination tables.

Map source tables to destination tables.

Click Refresh source table and MaxCompute Table mapping. The system maps the source tables to destination tables based on the mapping rule that you configured in the Set Mapping Rules for Table/Database Names section. By default, if no mapping rule is configured, data in the source tables is synchronized to the destination tables that are named the same as the source tables. If no such destination tables exist, the system automatically creates the tables in the destination. You can change the table generation method and add additional fields to a destination table.

Note

The system maps the source tables to destination tables based on the mapping rule that you configured in the Set Mapping Rules for Table/Database Names section.

Operation	Description
Synchronize data from a source table that does not have a primary key	The current data synchronization solution cannot be used synchronize data from source tables that do not have primary keys. If you want to synchronize data from a source table that does not have a primary key, you must click the icon in the Synchronized Primary Key column of the source table to specify a primary key for the source table. You can use a field or a combination of several fields in the source table as the primary key for the source table. The system removes duplicate data based on the primary key during data synchronization. In this example, the source table `doc_tb2_nopk` does not have a primary key, and the field `id` is used as the primary key for the source table.
Select a table generation method	You can select Create Table and Use Existing Table from the drop-down list in the Table creation method column. If you select Use Existing Table, all existing MaxCompute tables are displayed in the drop-down list in the MaxComputeBase Table name column. You can select a table name that you need to use from the drop-down list. If you select Create Table, the name of the destination table that is automatically created appears in the MaxComputeBase Table name column. You can click the table name to view and modify the table creation statement. In this example, Create Table is selected.
Specify whether to synchronize full data	In the Full Synchronization column, you can specify whether to synchronize full data from a source table at a time before incremental data in the source table is synchronized in real time. If you turn off the switch in the Full Synchronization column of a source table, the data synchronization solution does not synchronize full data in the source table before real-time incremental synchronization is performed. If full data in a source table is already synchronized to a destination table at a time, you can turn off the switch for the source table. In this example, full synchronization is performed for all source tables before real-time incremental synchronization is performed.
Add additional fields to a destination table and assign values to the fields	You can click Edit additional fields in the Actions column of a destination table to add additional fields to the table and assign values to the fields. You can assign constants and variables to additional fields as values. Note You can add additional fields to a destination table only if you select Create Table from the drop-down list in the Table creation method column of the destination table. In this example, the additional field `execute_time` is added to the destination tables, and the variable `_execute_time_` is assigned to the additional field as the value. The field execute_time is used to record the time when a data record in a source table is changed.

Check the information about the destination tables, fields in the destination tables, and mappings between the source tables and destination tables. If the configurations are correct, click Next.

Configure rules for processing DDL messages.
DDL operations are performed on a source. Data Integration provides default rules to process DDL messages. You can also configure processing rules for different DDL messages based on your business requirements. For more information, see Rules for processing DDL messages.
In this example, the default processing rules are used.
Configure resources required to run the data synchronization solution.
After you create a synchronization solution, the synchronization solution generates batch synchronization nodes for full data synchronization and real-time synchronization nodes for incremental data synchronization. You must configure the parameters in the Configure Resources step.
You can configure the exclusive resource groups for Data Integration that you want to use to run real-time synchronization nodes and batch synchronization nodes, and the resource groups for scheduling that you want to use to run batch synchronization nodes. You can also click Advanced Configuration to configure the Number of concurrent writes on the target side and Allow Dirty Data Records parameters.
Note
- DataWorks uses resource groups for scheduling to issue the generated batch synchronization subtasks to resource groups for Data Integration and runs the subtasks on the resource groups for Data Integration. Therefore, a batch synchronization subtask also occupies the resources of a resource group for scheduling. You are charged fees for using the exclusive resource group for scheduling to schedule the batch synchronization subtasks. For information about the task issuing mechanism, see Mechanism for issuing nodes.
- We recommend that you use different resource groups to run the generated batch and real-time synchronization subtasks. If you use the same resource group to run the subtasks, the subtasks compete for resources and affect each other. For example, CPU resources, memory resources, and networks used by the two types of subtasks may affect each other. In this case, the batch synchronization subtasks may slow down, or the real-time synchronization subtasks may be delayed. The batch or real-time synchronization subtasks may even be terminated by the out of memory (OOM) killer due to insufficient resources.

Run the data synchronization solution

Go to the Nodes page in Data Integration and find the created synchronization task.
Click Start or Commit and Run in the Actions column to run the synchronization task.
Click Running Details in the Actions column to view the running details of the synchronization task.

Note

You need to take note of the table names and node names that are described in the following section to facilitate subsequent O&M operations. For example, you need to take note of the name of the MaxCompute log table, the names of the MaxCompute base tables, the name of the real-time synchronization node that is generated by the data synchronization solution, and the name of the merge node that is used to merge full data and incremental data.

Perform O&M operations for the data synchronization solution

The merge of full data and incremental data that are synchronized to MaxCompute consists of the following stages:

On the day on which the data synchronization solution is configured, the system runs the batch synchronization nodes that are generated by the solution to synchronize full data from the source tables to the MaxCompute base tables.
On the day on which the data synchronization solution is configured, the system starts to run the real-time synchronization node that is generated by the solution to synchronize incremental data from the source tables to the MaxCompute log table in real time after the full synchronization is complete.
On the next day, the system runs the merge node to write the incremental data to the MaxCompute base tables to merge the full data and incremental data. One day is required to merge the full data and incremental data.

To prevent errors from occurring during the stages and ensure that data can be normally generated, you must configure monitoring and alerting settings for the real-time synchronization node and the merge node.

Configure monitoring and alerting settings for the real-time synchronization node.
1. Go to the Real Time DI page in Operation Center. For more information, see Go to the Real Time DI page.
2. Find the real-time synchronization node that is generated by the data synchronization solution, click Alarm settings in the Operation column.
3. On the page that appears, click the Alarm Rules tab. On the Alarm Rules tab, click New rule.
  Set the Indicators parameter to Status and Alarm interval parameter to 5. This way, the system sends an alert notification for the node if the heartbeat of the node is lost for 5 minutes. Configure the other parameters based on your business requirements and click Confirm. For more information about O&M operations that you can perform on a real-time synchronization node, see Perform operations on a real-time synchronization node.
Configure monitoring and alerting settings for the merge node.
To prevent errors from occurring during the merge of the full data and incremental data and ensure that data in the MaxCompute base tables can be normally generated, we recommend that you configure a custom alert rule for the merge node. For more information about how to configure a custom alert rule, see Create a custom alert rule.
1. Obtain the name of the merge node on the execution details page of the data synchronization solution. For more information, see View the running details of a synchronization task.
2. Go to the Rule Management page in Operation center. On the Rule Management page, click Create Custom Rule to create a custom alert rule. For more information, see Create a custom alert rule.
  In the Create Custom Rule dialog box, set the Object Type parameter to Node and the Object parameter to the name of the merge node, and set the Trigger Condition parameter to Error. Configure the other parameters based on your business requirements and click OK.

Appendix: Query the data that is synchronized to MaxCompute

One day is required to merge the full data and incremental data that are synchronized to MaxCompute. On the day on which the data synchronization solution is configured, you can query only the full data in the MaxCompute base tables. After the full data and incremental data are merged on the next day, you can query the merged data in the MaxCompute base tables.

Query the full data that is synchronized to MaxCompute

Create an ODPS SQL node on the DataStudio page and run a command to query the full data that is synchronized to MaxCompute. Before you run the command, you must check the status of the stage named "Start batch synchronization nodes to synchronization full data" on the execution details page of the data synchronization solution in Data Integration and make sure that the stage is complete. For more information about how to create an ODPS SQL node, see Create an ad hoc query.

Query the incremental data that is synchronized to the MaxCompute log table in real time

View the name of the MaxCompute log table that is created in the stage named "Create a MaxCompute log table" on the execution details page of the data synchronization solution in Data Integration.
Find the created ODPS SQL node on the DataStudio page and run a command to query the incremental data that is synchronized to MaxCompute. For more information about how to create an ODPS SQL node, see Create an ad hoc query.
Note
For information about the format of data in a MaxCompute log table that is used to store incremental data, see Fields used for real-time synchronization.