Synchronize an entire MySQL database to MaxCompute in real time - DataWorks

Data Integration lets you synchronize entire databases in real time from sources such as ApsaraDB for OceanBase, MySQL, Oracle, PolarDB, and PolarDB-X 2.0 to MaxCompute. This topic uses an example of synchronizing data from MySQL to MaxCompute to describe how to perform full and incremental data synchronization from an entire MySQL database to a MaxCompute Delta table.

Background

A Data Integration task that synchronizes an entire MySQL database to MaxCompute in real time first synchronizes the full data from the source MySQL database to a MaxCompute Delta table and then synchronizes incremental data changes in real time. The destination can be a partitioned or non-partitioned table. The synchronized data is visible in near real-time, with minute-level latency. You can query incremental data synchronized to the Delta table in as little as five minutes.

For more information about MaxCompute Delta tables, see Near real-time data warehouse.

Prerequisites

You have purchased a Serverless resource group or an exclusive resource group for Data Integration.
You have created a MySQL data source and a MaxCompute data source. For more information, see Data Source Configuration.
- MySQL: For a MySQL data source, you must enable binary logging. For more information, see Prepare a MySQL data source.
- MaxCompute: For a MaxCompute data source, you must enable the MaxCompute V2.0 data type edition before you run a synchronization task. The MaxCompute V2.0 data type edition supports the DECIMAL data type. For more information, see MaxCompute V2.0 data types.
The resource group and the data source must be connected over a network. For more information, see Overview of network connection solutions.

Limitations

This feature does not support MaxCompute data sources that use tenant-level schema syntax.
Synchronizing source data to MaxCompute external tables is not supported.

Procedure

Step 1: Select a synchronization task type

Go to the Data Integration page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Integration > Data Integration. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Integration.
In the navigation pane on the left, click Sync Task. On the page that appears, click Create Sync Task and configure the following parameters.
- Source and Destination: MySQL→MaxCompute
- Task Name: Enter a custom name for the sync task.
- Synchronization Type: Real-time Synchronization of the Entire Database.
- Sync Steps: Select Full Synchronization and Incremental Synchronization.

Step 2: Configure network and resource settings

In the Network and Resource Settings section, select a Resource Group for the sync task. You can also allocate CUs to the task in the Task Resource Usage section.
For Source Data Source, select the MySQL data source. For Destination Data Source, select the MaxCompute data source. Then, click Test Connectivity.
After the connectivity test is successful for both data sources, click Next.

Step 3: Select the databases and tables to synchronize

Select the source tables to synchronize from the Source Databases and Tables area and click the icon to move them to the Selected Databases and Tables area on the right.

Select specific tables:
- In the Source Databases and Tables area, enter keywords in the Database Filter and Table Filter fields to find the databases and tables that you want to synchronize. Select the databases and tables and click the icon to move them to the Selected Databases and Tables area.
- In the Selected Databases and Tables area, enter keywords in the Database Filter and Table Filter fields to find databases and tables that you do not want to synchronize. Select these databases and tables and click the icon to move them back to the Source Databases and Tables area.
Select tables using regular expressions (supports adding or removing tables while the task is running):
Enter regular expressions in the Database Filter and Table Filter fields to filter tables. Click Confirm Selection to confirm the databases and tables that you want to synchronize.
Note
For example, to filter for databases with the prefix a and tables with the prefix order, enter a.* in the Database Filter box and order.* in the Table Filter box.

Step 4: Map destination tables

After you select the tables from which you want to synchronize data, the selected tables are automatically displayed in the Mapping Rules for Destination Tables section. The properties of the destination tables are waiting to be mapped. You must manually define mappings between the source tables and destination tables to determine the data reading and writing relationships. Then, you can click Refresh in the Actions column. You can directly refresh mappings between source tables and destination tables. You can also refresh mappings between source tables and destination tables after you configure settings related to destination tables.

Note

You can select the tables to synchronize and click Batch Refresh Mappings. If no mapping rules are configured, the default naming rule for destination tables is ${source_database_name}_${table_name}. If a destination table with the generated name does not exist, it is automatically created.
In the Custom Target Table Name Mapping column, you can click Edit to customize the naming rules for destination tables.
You can concatenate built-in variables with custom strings to create destination table names. You can also edit the built-in variables. For example, you can create a naming rule that adds a suffix to the source table name.
To synchronize data to a MaxCompute Delta table, you must specify a primary key for the destination table. By default, the primary key of the source table is used. If the source table has no primary key, you can customize the primary key columns. The sync task cannot be saved if you do not specify a primary key.
The default number of buckets for an automatically created Delta table is 16. You can change this number in the Number of Table Buckets setting for the destination table mapping.
You cannot change the number of buckets for an existing table. The number of buckets determines how table data is divided. Operations such as queries, writes, and data merges can be executed concurrently at the bucket level. However, an excessive number of buckets can create many small files. Therefore, you must set this value as needed. For more information, see Table operations and Data storage and sharding.
The default queryable period for historical data in an automatically created Delta table is 0 hours. You can change this period in the Queryable Period For Historical Data setting of the destination table mapping.
You cannot modify the time travel retention period for an existing table. This setting determines the time range during which you can query historical versions of data. You cannot query historical data that is older than the specified period. A longer retention period retains more historical data, which increases storage costs. Therefore, you must configure this setting as needed. For more information, see Table operations and Time travel.

a. Modify data type mappings for fields

Default mappings exist between data types of source fields and data types of destination fields. You can click Edit Mapping of Field Data Types in the upper-right corner of the Mapping Rules for Destination Tables section to configure data type mappings between source fields and destination fields based on your business requirements. After the configuration is complete, click Apply and Refresh Mapping.

b. Modify the schema of a destination table to add fields to the table and assign values to the fields

If a destination table is in the To Be Created state, you can perform the following steps to add fields to the table and assign values to the fields:

Add fields to one or more destination tables.
- Add fields to a single destination table: Find the destination table to which you want to add fields and click the icon in the Destination Table Name column. In the dialog box that appears, add fields.
- Add fields to multiple destination tables at a time: Select the destination tables to which you want to add fields at a time, click Batch Modify in the lower part of the page, and then click Destination Table Schema - Batch Modify and Add Field.
Assign values to the fields. You can perform one of the following operations to assign values to the fields:
- Assign values to the fields that are added to a single destination table: Find the destination table in which you want to assign values to newly added fields and click Configure in the Value assignment column. In the Additional Field dialog box, assign values to the fields.
- Assign values to the fields that are added to multiple destination tables at a time: Select the destination tables in which you want to assign values to newly added fields, click Batch Modify in the lower part of the page, and then click Value assignment to assign values to the same fields in the selected destination tables at a time.
Note
You can click the icon to switch the value assignment method and assign constants and variables to the fields that are added to a destination table.

c. Configure DML processing rules

Data Integration provides default DML processing rules. You can also configure DML processing rules for destination tables based on your business requirements.

Configure DML processing rules for a single destination table: Find the destination table for which you want to configure DML processing rules and click Configure in the Configure DML Rule column to configure DML processing rules for the table.
Configure DML processing rules for multiple destination tables at a time: Select the destination tables for which you want to configure DML processing rules, click Batch Modify in the lower part of the page, and then click Configure DML Rule.

4. Set the source shard key

In the **Source Sharding** column, you can select a field from the source table or select No Sharding.

5. Perform full synchronization

If you selected Full Synchronization in the Sync Steps section when you selected the sync task type, you can disable full synchronization for a specific table on this page.

5. Configure alert rules

To prevent the failure of the synchronization task from causing latency on business data synchronization, you can configure different alert rules for the synchronization task.

In the upper-right corner of the page, click Configure Alert Rule to go to the Configure Alert Rule panel.
In the Configure Alert Rule panel, click Add Alert Rule. In the Add Alert Rule dialog box, configure the parameters to configure an alert rule.
Note
The alert rules that you configure in this step take effect for the real-time synchronization subtask that will be generated by the synchronization task. After the configuration of the synchronization task is complete, you can refer to Run and manage real-time synchronization tasks to go to the Real-time Synchronization Task page and modify alert rules configured for the real-time synchronization subtask.
Manage alert rules.
You can enable or disable alert rules that are created. You can also specify different alert recipients based on the severity levels of alerts.

6. Configure advanced parameters

You can change the values of specific parameters configured for the synchronization task based on your business requirements. For example, you can specify an appropriate value for the Maximum read connections parameter to prevent the current synchronization task from imposing excessive pressure on the source database and data production from being affected.

Note

To prevent unexpected errors or data quality issues, we recommend that you understand the meanings of the parameters before you change the values of the parameters.

In the upper-right corner of the configuration page, click Configure Advanced Parameters.
In the Configure Advanced Parameters panel, change the values of the desired parameters.

7. Configure DDL processing rules

DDL operations may be performed on the source. You can click Configure DDL Capability in the upper-right corner of the page to configure rules to process DDL messages from the source based on your business requirements.

Note

For more information, see Configure rules to process DDL messages.

8. View and change resource groups

You can click Configure Resource Group in the upper-right corner of the page to view and change the resource groups that are used to run the current synchronization task.

9. Run the synchronization task

After the configuration of the synchronization task is complete, click Complete in the lower part of the page.
In the Synchronization Task section of the Data Integration page, find the created synchronization task and click Start in the Operation column.
Click the Name/ID of the synchronization task in the Tasks section and view the detailed running process of the synchronization task.

Sync task O&M

View task running status

After you create a sync task, you can view a list of your sync tasks and their basic information on the sync task page.

In the **Actions** column, you can Start or Stop a sync task. You can also perform other operations, such as Edit and View, from the **More** menu.
For running tasks, you can view their basic operational status in the Execution Overview section. You can also click an area in the overview to view execution details.
A real-time sync task that synchronizes an entire MySQL database to MaxCompute consists of the following three steps:
- Schema migration: This step includes the creation method for the destination table, which can be an existing table or an automatically created table. If a table is automatically created, the DDL statement is displayed.
- Full initialization: This step includes information about the tables for offline synchronization, the synchronization progress, and the number of written records.
- Real-time data synchronization: This step includes real-time synchronization statistics, such as real-time progress, DDL records, DML records, and alert information.

Rerun the synchronization task

In some special cases, if you add tables to or remove tables from the source, or change the schema or name of a destination table, you can click More in the Operation column of the synchronization task and then click Rerun to rerun the task after the change. During the rerun process, the synchronization task synchronizes data only from the newly added tables to the destination or only from the mapped source table to the destination table whose schema or name is changed.

If you want to rerun the synchronization task without modifying the configuration of the task, click More in the Actions column and then click Rerun to rerun the task to perform full synchronization and incremental synchronization again.
If you want to rerun the synchronization task after you add tables to or remove tables from the task, click Complete after the change. In this case, Apply Updates is displayed in the Actions column of the synchronization task. Click Apply Updates to trigger the system to rerun the synchronization task. During the rerun process, the synchronization task synchronizes data from the newly added tables to the destination. Data in the original tables is not synchronized again.