Synchronize data from tables in sharded MySQL databases to MaxCompute - DataWorks

This topic describes how to use Data Integration to synchronize data from tables in sharded databases to MaxCompute in real time. In this example, data is synchronized from MySQL.

Prerequisites

A MaxCompute data source and a MySQL data source are prepared. You must prepare the required data sources in advance. This way, when you configure a data synchronization solution, you can select the data sources to determine the source from which you want to read data and the destination to which you want to write data. For more information, see Add a MySQL data source and Add a MaxCompute data source.
Note
For information about the items that you need to understand before you configure a data source, see Overview.
An exclusive resource group for Data Integration that meets your business requirements is purchased. For more information, see Create and use an exclusive resource group for Data Integration.
Network connections are established between the exclusive resource group for Data Integration and the data sources. For more information, see Establish a network connection between a resource group and a data source.
The data source environments are prepared.
- MaxCompute: Before you read data from or write data to a MaxCompute table, you can determine whether to enable related properties for the MaxCompute data source based on your business requirements. For information about how to prepare a MaxCompute environment, see Prepare a MaxCompute environment.
- MySQL: If the MySQL data source is used as the source, you must enable the binary logging feature for the MySQL data source. For more information, see Prepare a MySQL environment.

Background information

In actual business scenarios, data cannot be synchronized by using a few simple batch or real-time synchronization nodes. Instead, multiple batch synchronization nodes, real-time synchronization nodes, and data processing nodes are required to synchronize data. In this case, complex configurations are required. The synchronization of data from tables in sharded MySQL databases to a single MaxCompute table requires more complex configurations that involve multiple nodes. O&M operations on the nodes are also complex.

To resolve this issue, DataWorks Data Integration provides data synchronization solutions that can be used to synchronize data between various data sources in different business scenarios. These solutions improve the efficiency of data synchronization.

Synchronization of data from tables in sharded MySQL databases to MaxCompute is a real-time synchronization solution. This solution is implemented based on the real-time synchronization of data from binary logs and the merge of full and incremental data in T+1 mode. When you run such a data synchronization solution, full data in the source is synchronized to the daily partitions of the MaxCompute base tables, and incremental data in the source is synchronized to a MaxCompute log table in real time. Then, a merge node is run every day to merge the incremental data in the MaxCompute log table on the previous day into the MaxCompute base tables to update the full data in the MaxCompute base tables. If you want to query data in the MaxCompute base tables, you may need to query the latest partitions in the tables in most cases.

Limits

Resource group used for data synchronization
You must use an exclusive resource group for Data Integration to run a data synchronization solution that is used to synchronize data from tables in sharded databases to MaxCompute.
Data synchronization across time zones
You cannot use a data synchronization solution to synchronize data across time zones. If a data source of a data synchronization solution resides in a different time zone from the resource group that is used to run the solution, errors may occur during data synchronization.

Procedure

Step 1: Select a synchronization solution type
Step 2: Configure an exclusive resource group for Data Integration and establish network connections
Step 3: Configure tables in sharded MySQL databases from which you want to synchronize data
Step 4: Configure destination tables
Step 5: Configure table-level synchronization rules
Step 6: Configure rules to process DDL messages
Step 7: Configure the resources required by the data synchronization solution
Step 8: Run the data synchronization solution

Step 1: Select a synchronization solution type

Log on to the DataWorks console and go to the homepage of or Tasks page in Data Integration to create a data synchronization solution. Set the source type to MySQL and destination type to MaxCompute and select synchronize data from tables in sharded databases to MaxCompute.

Step 2: Configure an exclusive resource group for Data Integration and establish network connections

You must select the MySQL data source and the MaxCompute data source that are prepared in advance. Test network connectivity between the exclusive resource group for Data Integration and the data sources.

Note

You must specify the encoding format of the source based on your business requirements, and the time zone in which the source resides.

Step 3: Configure tables in sharded MySQL databases from which you want to synchronize data

The minimum unit for synchronizing data from tables in sharded databases is a logical table. A logical table corresponds to a set of physical tables at the source and a base table at the destination. DataWorks provides the automatic generation feature for logical tables. You can use this feature to scan physical tables at the source based on specific rules. This simplifies the procedure that is required to configure logical tables. If the scan result does not meet your business requirements, you need to only modify the scan result. If the rules based on which physical tables at the source are scanned are complex, you can manually add logical tables to the data synchronization solution.

Automatically generate logical tables

Click Automatically Generate Logical Table. In the data source list, select the data sources that you want to use.
In the Specify Generation Rule for Logical Table section, click Create Rule. Then, click Preset Rule or Manual Input to add the required rule.
In this example, the following preset rules are used:
- Merge multiple source tables whose names differ only in numeric suffixes into a logical table.
- Merge multiple source tables with the same name into a logical table.

Manually add logical tables

If the scan result returned after you use the automatic generation feature for logical tables does not meet your business requirements, you can modify the configured rules or manually add the logical tables that you want to synchronize.

Click Add Logical Table. Then, add conditions to filter source databases and tables. Data in the same logical table is synchronized to the same destination table.
Add conditions to filter source databases and tables.
You can add a database-based filter condition and a table-based filter condition for each logical table. You can click Add Condition in the Select Source Database section and in the Select Source Table section to add the related filter conditions. When the data synchronization solution is run, the solution searches for source databases and tables based on the filter conditions and aggregates the obtained databases and tables. Data will be synchronized from the source tables that are finally obtained.

Edit a rule for multiple sharded databases

Rules for sharded databases impose limits on source databases. The limits take effect for all source tables that belong to the source databases. For this example, if you want to synchronize data from xiaobo* databases, you can add the rule that is shown in the following figure. Add a rule

Step 4: Configure destination tables

Configure the Write Mode parameter.
The value of this parameter is fixed to Write log table directly in real time. In this mode, incremental data in the source tables is written to a MaxCompute log table in real time. Then, the system runs a merge node to write the incremental data in the MaxCompute log table to the MaxCompute base tables on a regular basis to merge the full data and incremental data.
Configure the Automatic Partitioning by Time parameter.
You can configure the Automatic Partitioning by Time parameter to write data to partitioned or non-partitioned MaxCompute tables. If you set this parameter to Partitioned Table, you can specify the partition key column. In this example, this parameter is set to Partitioned Table, and the partition key column is ds.
Note
If you set the Automatic Partitioning by Time parameter to Partitioned Table, you can click the icon to modify the partition settings for the destination tables.

Configure mappings between the source tables and destination MaxCompute tables.

Click Refresh source table and MaxCompute Table mapping to configure the mappings between the source tables and destination MaxCompute tables.

Operation	Description
Select a primary key for a source table	Source tables that do not have primary keys cannot be synchronized. If a source table does not have a primary key, you can click the icon in the Synchronized Primary Key column to specify one or more fields in the table as the primary key of the table. The system removes duplicate data based on the primary key during data synchronization. In this example, `id` is used as the primary key of the source table that does not have a primary key.
Edit the schema of a destination table	You can click the name of a destination MaxCompute table to preview the schema of the table. If you select Create Table from the drop-down list in the Table creation method column, the system automatically creates a destination MaxCompute table. During the table creation process, the system automatically adds the`_src_info_` field to the table. The _src_info_ field is used to mark the child table to which a row of data belongs. You can use the `_src_info_` field together with a primary key to locate a row of data. Note By default, the lifecycle of a destination MaxCompute table that is automatically created is 30 days, and automatic data type mapping may exist during the creation of a destination MaxCompute table. If the field data types in a destination MaxCompute table are different from the field data types in the mapped source table, the data synchronization solution maps the field data types in the destination MaxCompute table that are compatible with the field data types in the source table for the source table when the system creates the destination MaxCompute table. You can click the name of a destination MaxCompute table in the MaxComputeTable name column to change the data types of fields and the lifecycle of the table.
Select a table generation method	You can select Create Table or Use Existing Table from the drop-down list in the Table creation method column. If you select Use Existing Table, all existing MaxCompute tables are displayed in the drop-down list in the MaxComputeTable name column. You can select a table name that you need to use from the drop-down list. Create Table: If you select this method, the name of the destination MaxCompute table that is automatically created appears in the MaxComputeTable name column. You can click the table name to view and modify the table creation statement. In this example, Create Table is selected.
Specify whether to synchronize full data	You can determine whether to turn on the switch in the Full Synchronization column of a source table to synchronize full data in the table to the mapped destination table. If you turn off the switch in the Full Synchronization column of a source table, the data synchronization solution does not synchronize full data in the source table before real-time incremental synchronization is performed. If full data in a source table is already synchronized to a destination table at a time, you can turn off the switch for the source table. In this example, full synchronization is performed for all source tables before real-time incremental synchronization is performed.

After you complete and confirm the configurations in the Set Destination Table step, click Next.

Step 5: Configure table-level synchronization rules

You can configure rules to process DML messages for the data synchronization solution. For example, if you want to retain data that is deleted from the source in MaxCompute base tables, you can configure an Ignore processing rule for the deletion operation performed on the source, as shown in the following figure.You can configure rules to process DML messages generated for insertion, update, and deletion operations that are performed on the source.

Normal: The system sends the DML messages to the destination, and the system performs the same operation on the destination as the operation that is performed on the source.
Ignore: The system discards the DML messages and no longer sends the messages to the destination, and the related data is not modified.
Conditionally Normal Processing: The system first filters data based on the filter expression. Data that meets the filter expression is normally processed. Data that does not meet the filter expression is ignored.

Note

By default, if no processing rule is configured in this step, a Normal processing rule is used.

Step 6: Configure rules to process DDL messages

DDL operations are performed on a source. Data Integration provides default rules to process DDL messages. You can also configure processing rules for different DDL messages based on your business requirements. For more information, see Rules for processing DDL messages.

Step 7: Configure the resources required by the data synchronization solution

After the data synchronization solution is created and run, the solution generates batch synchronization nodes for full data synchronization and a real-time synchronization node for incremental data synchronization. You must configure the parameters in the Configure Resources step.

You must configure the exclusive resource groups for Data Integration that you want to use to run the real-time synchronization node and batch synchronization nodes, and the resource groups for scheduling that you want to use to run batch synchronization nodes. You can also click Advanced Configuration to configure parameters such as Number of concurrent writes on the target side, Maximum number of connections supported by source read, and Allow Dirty Data Records.

Note

DataWorks uses resource groups for scheduling to issue the generated batch synchronization subtasks to resource groups for Data Integration and runs the subtasks on the resource groups for Data Integration. Therefore, a batch synchronization subtask also occupies the resources of a resource group for scheduling. You are charged fees for using the exclusive resource group for scheduling to schedule the batch synchronization subtasks. For information about the task issuing mechanism, see Mechanism for issuing nodes.
We recommend that you use different resource groups to run the generated batch and real-time synchronization subtasks. If you use the same resource group to run the subtasks, the subtasks compete for resources and affect each other. For example, CPU resources, memory resources, and networks used by the two types of subtasks may affect each other. In this case, the batch synchronization subtasks may slow down, or the real-time synchronization subtasks may be delayed. The batch or real-time synchronization subtasks may even be terminated by the out of memory (OOM) killer due to insufficient resources.

Step 8: Run the data synchronization solution

Go to the Nodes page in Data Integration and find the created synchronization task.
Click Start or Commit and Run in the Actions column to run the synchronization task.
Click Running Details in the Actions column to view the running details of the synchronization task.

What to do next

After a data synchronization solution is configured, you can manage the solution. For example, you can add tables to or remove tables from the solution, configure alerting and monitoring settings for the nodes that are generated by the solution, and view information about the running of the nodes. For more information, see Perform O&M on a full and incremental synchronization task.