Synchronize an entire MySQL database to MaxCompute in real time - MaxCompute

Data Integration lets you synchronize entire databases in real time from sources such as ApsaraDB for OceanBase, MySQL, Oracle, PolarDB, and PolarDB-X 2.0 to MaxCompute. This topic walks you through full and incremental data synchronization from an entire MySQL database to a MaxCompute Delta table.

How it works

A real-time synchronization task that targets MaxCompute runs in two steps:

Full synchronization: Data Integration reads all existing data from the source MySQL database and writes it to a MaxCompute Delta table.
Incremental synchronization: After the full load completes, Data Integration continuously reads the MySQL binary log and applies data changes to the Delta table in near real-time.

The destination can be a partitioned or non-partitioned table. Synchronized data is visible with minute-level latency, and you can query incremental data in the Delta table in as little as five minutes.

For more information about MaxCompute Delta tables, see Near real-time data warehouse.

Limitations

This feature does not support MaxCompute data sources that use tenant-level schema syntax.
Synchronizing source data to MaxCompute external tables is not supported.

Before you begin

Make sure you have completed the following tasks before you create a synchronization task:

Prerequisite	Details
Resource group	Purchase a Serverless resource group or an exclusive resource group for Data Integration.
MySQL data source	Create a MySQL data source with binary logging enabled. For more information, see Prepare a MySQL data source.
MaxCompute data source	Create a MaxCompute data source and enable the MaxCompute V2.0 data type edition before you run a synchronization task. The V2.0 data type edition supports the DECIMAL data type. For more information, see MaxCompute V2.0 data types.
Network connectivity	Make sure the resource group and both data sources can communicate over the network. For more information, see Overview of network connection solutions.

For general instructions on creating data sources, see Data Source Configuration.

Procedure

Step 1: Select a synchronization task type

Go to the Data Integration page. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Integration > Data Integration. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Integration.

In the left-side navigation pane, click Sync Task. On the page that appears, click Create Sync Task and configure the following parameters:

Parameter	Value
Source and Destination	`MySQL` -> `MaxCompute`
Task Name	Enter a custom name for the synchronization task.
Synchronization Type	`Real-time Synchronization of the Entire Database`
Sync Steps	Select Full Synchronization and Incremental Synchronization.

Step 2: Configure network and resource settings

In the Network and Resource Settings section, select a Resource Group for the synchronization task. You can also allocate CUs (Compute Units) to the task in the Task Resource Usage section.
For Source Data Source, select the MySQL data source. For Destination Data Source, select the MaxCompute data source. Then, click Test Connectivity.
After the connectivity test succeeds for both data sources, click Next.

Step 3: Select the databases and tables to synchronize

Select the source tables to synchronize from the Source Databases and Tables area and click the move icon icon to move them to the Selected Databases and Tables area on the right.

You can select tables in the following ways:

Select specific tables: In the Source Databases and Tables area, enter keywords in the Database Filter and Table Filter fields to find the databases and tables you want. Select them and click the icon to move them to the Selected Databases and Tables area. To remove tables, go to the Selected Databases and Tables area, use the Database Filter and Table Filter fields to find the tables you want to remove, select them, and click the icon to move them back to the Source Databases and Tables area.
Select tables with regular expressions (supports adding or removing tables while the task is running): Enter regular expressions in the Database Filter and Table Filter fields to filter tables, then click Confirm Selection.
For example, to match databases with the prefix a and tables with the prefix order, enter a.* in the Database Filter field and order.* in the Table Filter field.

Step 4: Map destination tables

After you select the source tables, they appear in the Mapping Rules for Destination Tables section. Define the mappings between source and destination tables to establish the data reading and writing relationships. You can directly refresh mappings between source tables and destination tables. You can also refresh mappings between source tables and destination tables after you configure settings related to destination tables.

You can select the tables to synchronize and click Batch Refresh Mappings. If no mapping rules are configured, destination tables are named using the default rule: ${source_database_name}_${table_name}. If a destination table with the generated name does not exist, it is automatically created.

In the Custom Target Table Name Mapping column, click Edit to customize the naming rules for destination tables. You can concatenate built-in variables with custom strings to create destination table names, and you can edit the built-in variables. For example, you can create a naming rule that appends a suffix to the source table name.

Primary key configuration

To synchronize data to a MaxCompute Delta table, you must specify a primary key for the destination table. By default, the primary key of the source table is used. If the source table has no primary key, you can set custom primary key columns. The synchronization task cannot be saved if you do not specify a primary key.

Bucket configuration

The default number of buckets for an automatically created Delta table is 16. You can change this value in the Number of Table Buckets setting for the destination table mapping.

Important

You cannot change the number of buckets for an existing table.

The number of buckets determines how table data is divided. Operations such as queries, writes, and data merges can run concurrently at the bucket level. However, too many buckets can create a large number of small files. Set this value based on your workload. For more information, see Table operations and Data storage and sharding.

Historical data queryable period

The default queryable period for historical data in an automatically created Delta table is 0 hours. You can change this value in the Queryable Period For Historical Data setting of the destination table mapping.

Important

You cannot modify the time travel retention period for an existing table.

This setting determines the time range during which you can query historical versions of data. You cannot query historical data that is older than the specified period. A longer retention period retains more historical data, which increases storage costs. Configure this setting based on your needs. For more information, see Table operations and Time travel.

Step 4a: Modify data type mappings for fields

Default mappings exist between source and destination field data types. To customize these mappings, click Edit Mapping of Field Data Types in the upper-right corner of the Mapping Rules for Destination Tables section. Configure the data type mappings based on your business requirements, then click Apply and Refresh Mapping.

Step 4b: Modify the schema of a destination table

If a destination table is in the to-be-created state, you can add fields to it and assign values to the fields.

Add fields to destination tables:
- Single table: Find the destination table and click the icon in the Destination Table Name column. In the dialog box that appears, add the fields.
- Multiple tables: Select the destination tables, click Batch Modify at the bottom of the page, and then click Destination Table Schema - Batch Modify and Add Field.
Assign values to the added fields:
- Single table: Find the destination table and click Configure in the Value assignment column. In the Additional Field dialog box, assign values to the fields.
- Multiple tables: Select the destination tables, click Batch Modify at the bottom of the page, and then click Value assignment to assign values to the same fields across all selected tables.
You can click the icon to switch the value assignment method and assign constants or variables to the added fields.

Step 4c: Configure DML processing rules

Data Integration provides default DML processing rules. You can customize these rules based on your business requirements.

Single table: Find the destination table and click Configure in the Configure DML Rule column.
Multiple tables: Select the destination tables, click Batch Modify at the bottom of the page, and then click Configure DML Rule.

Step 4d: Set the source shard key

In the Source Sharding column, select a field from the source table as the shard key, or select No Sharding.

Step 4e: Configure full synchronization per table

If you selected Full Synchronization in the Sync Steps section in Step 1, you can disable full synchronization for individual tables on this page.

Step 5: Configure alert rules

To avoid delays in business data synchronization caused by task failures, you can configure alert rules for the synchronization task.

In the upper-right corner of the page, click Configure Alert Rule to open the Configure Alert Rule panel.
In the panel, click Add Alert Rule. In the dialog box that appears, configure the alert rule parameters.
The alert rules you configure here take effect for the real-time synchronization subtask generated by the synchronization task. After you finish configuring the task, you can go to the Real-time Synchronization Task page to modify these alert rules. For more information, see Manage real-time synchronization tasks.
Manage alert rules. You can enable or disable alert rules, and you can specify different alert recipients based on alert severity levels.

Step 6: Configure advanced parameters

You can adjust specific parameters for the synchronization task based on your business requirements. For example, you can set the Maximum read connections parameter to limit the number of concurrent connections to the source database and reduce the load on your production environment.

To prevent unexpected errors or data quality issues, make sure you understand what each parameter does before you change its value.

In the upper-right corner of the configuration page, click Configure Advanced Parameters.
In the Configure Advanced Parameters panel, adjust the parameter values as needed.

Step 7: Configure DDL processing rules

DDL operations may be performed on the source database during synchronization. Click Configure DDL Capability in the upper-right corner of the page to define how the synchronization task handles DDL messages from the source.

For more information, see Configure rules to process DDL messages.

Step 8: View and change resource groups

Click Configure Resource Group in the upper-right corner of the page to view and change the resource groups used to run the synchronization task.

Step 9: Run the synchronization task

After you finish configuring the synchronization task, click Complete at the bottom of the page.
In the Nodes section of the Data Integration page, find the synchronization task you created and click Start in the Actions column.
Click the name or ID of the synchronization task in the Tasks section to view the detailed running process.

Synchronization task operations and maintenance

View task running status

After you create a synchronization task, you can view a list of your tasks and their basic information on the Sync Task page.

In the Actions column, you can Start or Stop a synchronization task. From the More menu, you can also perform other operations such as Edit and View.

For running tasks, you can view the operational status in the Execution Overview section. Click an area in the overview to view execution details.

A real-time synchronization task that synchronizes an entire MySQL database to MaxCompute consists of three steps:

Step	Description
Schema migration	Creates the destination tables. For each table, shows the creation method (existing table or automatically created table). If a table is automatically created, the DDL statement is displayed.
Full initialization	Lists the tables for offline synchronization, the synchronization progress, and the number of written records.
Real-time data synchronization	Displays real-time synchronization statistics, including real-time progress, DDL records, DML records, and alert information.

Rerun the synchronization task

If you add or remove tables from the source, or change the schema or name of a destination table, you can rerun the synchronization task. During a rerun, the task only synchronizes data from the newly added tables or from the mapped source table whose destination table schema or name has changed.

Rerun without modifying the task configuration: Click More in the Actions column of the synchronization task and then click Rerun. This triggers full synchronization and incremental synchronization for the task again.
Rerun after adding or removing tables: After you modify the task, click Complete. The Actions column then displays Apply Updates. Click Apply Updates to trigger a rerun. During this rerun, only the newly added tables are synchronized. Data in the original tables is not synchronized again.