How to synchronize data from RDS MySQL or RDS MySQL Serverless to MaxCompute - Data Transmission Service

MaxCompute, formerly known as ODPS, is a fast, fully managed, exabyte-scale data warehousing solution. Use DTS to synchronize data from an ApsaraDB RDS for MySQL instance to MaxCompute for real-time analytics.

Prerequisites

Complete the following tasks:

Important notes

During initial full data synchronization, DTS consumes read and write resources from the source and destination databases, which increases the database load. If database performance is poor, instance specifications are low, or business traffic is heavy (for example, the source database has many slow SQL queries or tables without primary keys, or the destination database experiences deadlocks), the database load increases and may even cause the service to become unavailable. Before you synchronize data, evaluate the performance of your source and destination instances. We recommend performing data synchronization during off-peak hours, for example, when the CPU utilization of both instances is below 30%.
Only table-level data synchronization is supported.
Do not use tools such as gh-ost or pt-online-schema-change to perform online DDL changes on synchronization objects in the source database during synchronization. Otherwise, synchronization fails.
MaxCompute does not support primary key constraints. If DTS retransmits data due to network issues, duplicate records may appear in MaxCompute.

Billing

Synchronization type	Pricing
Schema synchronization and full data synchronization	Free of charge.
Incremental data synchronization	Charged. For more information, see Billing overview.

Supported source instance types

The source MySQL database supports the following instance types:

Self-managed database on an ECS instance
Self-managed database connected over Express Connect, VPN Gateway, or Smart Access Gateway
Self-managed database connected over Database Gateway
ApsaraDB RDS for MySQL instances in the same or different Alibaba Cloud accounts

This topic uses an RDS Instance as an example. The procedure is similar for other instance types.

Note

If the source database is a self-managed MySQL database, you must also complete preparatory steps. Overview of preparations.

Supported SQL operations

DDL operations: ALTER TABLE, ADD COLUMN
DML operations: INSERT, UPDATE, DELETE

Synchronization process

Initial schema synchronization.

DTS synchronizes table schemas from the source database to MaxCompute, adding the _base suffix to table names. For example, the source table customer becomes customer_base.
Initial full data synchronization.

DTS synchronizes existing data from the source tables to the _base tables in MaxCompute. For example, data from customer is synchronized to customer_base. This data serves as the baseline for incremental synchronization.

Note
This table is also known as a full baseline table.
Incremental data synchronization.

DTS creates an incremental data table in MaxCompute with the _log suffix, such as customer_log, and synchronizes incremental data in real time.

Note
For the incremental data table schema, see Incremental log table schema.

Procedure

Warning

Perform the following steps with your Alibaba Cloud account to ensure the DTS synchronization account is properly authorized.

Purchase a data synchronization task. Purchase a DTS instance.

Note
When purchasing, set the source instance to MySQL, the destination instance to MaxCompute, and the synchronization topology to One-way Synchronization.
Log on to the DTS console.

Note
If you are automatically redirected to the Data Management (DMS) console, you can click the icon in the lower-right corner and then click to return to the classic DTS console.
In the left-side navigation pane, click Data Synchronization.
At the top of the Synchronization Tasks page, select the region where your destination instance is located.
Find the data synchronization task that you purchased and click Configure Task.

Configure the source and destination instances.

Section	Parameter	Description
N/A	Synchronization task name	DTS automatically generates a task name. Specify a descriptive name for easy identification. The name does not need to be unique.
Source instance details	Instance type	Select RDS Instance.
	Instance region	The source instance region selected during purchase. This value cannot be changed.
	Instance ID	Select the source RDS instance ID.
	Database account	Enter the database account for the source RDS instance. Note If the database type of the source RDS instance is MySQL 5.5 or MySQL 5.6, you do not need to configure Database Account and Database Password.
	Database password	Enter the password for the database account.
	Connection method	Select Non-encrypted or SSL-encrypted. If you select SSL-encrypted, enable SSL encryption on the RDS instance first. Configure SSL encryption. Important Currently, the Encryption parameter is available only for regions in the Chinese mainland and the China (Hong Kong) region.
Destination instance details	Instance type	Set to MaxCompute. This value cannot be changed.
	Instance region	The destination instance region selected during purchase. This value cannot be changed.
	Project	Enter the name of the MaxCompute Project. You can find it on the MaxCompute Project List page.

In the lower-right corner, click Set Whitelist and Next.

If the source or destination database is an Alibaba Cloud database instance, such as ApsaraDB RDS for MySQL or ApsaraDB for MongoDB, DTS automatically adds the IP addresses of the DTS service in the corresponding region to the IP whitelist of the database instance. If the source or destination database is a self-managed database on an ECS instance, DTS automatically adds the IP addresses of the DTS service in the corresponding region to the security group rules of the ECS instance. You must also ensure that the self-managed database does not restrict access from the ECS instance. If the database is deployed as a cluster on multiple ECS instances, you must manually add the IP addresses of the DTS service in the corresponding region to the security group rules of each of the other ECS instances. If the source or destination database is a self-managed database in an on-premises data center or a database from another cloud provider, you must manually add the IP addresses of the DTS service in the corresponding region to allow access from the DTS servers. For a list of DTS service IP addresses, see IP address ranges of DTS servers.

Warning
Adding the public IP address blocks of the DTS service, either automatically or manually, may pose security risks. Using this product, you acknowledge that you understand and accept the potential security risks and that you must implement basic security measures. These measures include, but are not limited to, strengthening password security, limiting the ports open to each CIDR block, using authentication for internal API calls, and regularly checking and restricting unnecessary CIDR blocks. Alternatively, you can connect through a private network using a leased line, VPN Gateway, or Smart Access Gateway.
In the lower-right corner, click Next to grant the following permissions on the MaxCompute project to the DTS synchronization account.

Configure synchronization policies and objects.

Parameter	Description
Incremental data table partition definition	Select a partition name based on your business needs. For partition details, see Partitions.
Synchronization initialization	Includes Initial Schema Synchronization and Initial Full Data Synchronization. Select both Initial Schema Synchronization and Initial Full Data Synchronization. DTS synchronizes the schema and existing data to the destination before starting incremental synchronization.
Conflict resolution for existing tables	Precheck and Report Errors: Checks for tables with the same name in the destination database. If a conflict exists, the precheck fails and the task does not start. Note To avoid name conflicts, you can set object names in the destination instance. Ignore Errors and Proceed: Skips the table name conflict check. Warning Selecting Ignore Errors and Proceed can cause data inconsistencies. For example: If schemas are identical, records with the same primary key are retained during initialization but overwritten during incremental synchronization. If schemas differ, initialization may fail, data may be partially synchronized, or the task may fail entirely.
Select synchronization objects	In the Source Objects box, click the table you want to synchronize, and then click the Note Only tables can be selected as synchronization objects. You can select tables from multiple databases. By default, synchronization object names remain unchanged. To rename objects in the destination, use the object name mapping feature. Set object names in the destination instance.
Select additional column rules	DTS adds additional columns to destination tables in MaxCompute. If these column names conflict with existing columns, synchronization fails. Select Yes or No for Enable new additional column rules. Warning Verify that additional column names do not conflict with existing columns before selecting a rule. Otherwise, the task may fail or data may be lost. Additional column names and definitions.
Table and column mapping	Change the names of synchronized objects in the destination instance. For more information, see Map databases, tables, and columns.
Replicate temporary tables during DMS online DDL	If you use Data Management (DMS) to perform online DDL changes on the source database, you can choose whether to synchronize the temporary tables generated by the DDL changes. Yes: Synchronizes the temporary tables generated by online DDL changes. Note If a large amount of temporary table data is generated by online DDL changes, the data synchronization task may be delayed. No: Does not synchronize the temporary tables generated by online DDL changes. Only the original DDL operations from the source database are synchronized. Note This option causes tables in the destination database to be locked.
Connection retry duration	If DTS cannot connect to the source or destination instance, it retries for 720 minutes (12 hours) by default. You can also specify a custom retry duration. If DTS reconnects to the source or destination instance within the specified duration, the synchronization task automatically resumes. Otherwise, the task fails. Note You are billed for task run time during connection retries. Customize the retry duration based on your business needs, or release the DTS instance as soon as the source and destination instances are released.

After completing the preceding configurations, click Precheck and Start in the lower-right corner of the page.
Note
- A precheck runs before the synchronization task starts, and you can only start the task after it passes.
- If the precheck fails, click the icon next to the failed item to view the details.
  
  You can fix the issues based on the cause and run the precheck again.
  
  If you do not need to fix the items that triggered warnings, you can click Ignore or Ignore Warnings and Rerun Precheck to skip the warnings and run the precheck again.
After the Precheck dialog box displays Precheck Passed, close the Precheck dialog box. The synchronization task starts automatically.
Wait for the task to finish initialization and enter the Synchronizing state.

You can view the status of the data synchronization task on the Data Synchronization page.

Incremental log table schema

Note

You must run the set odps.sql.allow.fullscan=true; command in MaxCompute to set the project property that allows full table scans.

DTS stores both incremental data and metadata in the incremental data table. The following is an example.

Note

In this example, modifytime_year, modifytime_month, modifytime_day, modifytime_hour, and modifytime_minute are partition fields specified in the Configure synchronization policies and objects step.

Schema definition

Field	Description
record_id	The unique ID of the incremental log record. Note The ID is unique and auto-incrementing. For UPDATE operations, the change is split into two records (before-image and after-image) with the same `record_id`.
operation_flag	The operation type. Valid values: I: INSERT operation. D: DELETE operation. U: UPDATE operation.
utc_timestamp	The operation timestamp (in UTC), which is the binlog timestamp.
before_flag	Indicates whether all column values are pre-update values (the before-image). Valid values: Y or N.
after_flag	Indicates whether all column values are post-update values (the after-image). Valid values: Y or N.

The before_flag and after_flag fields

The before_flag and after_flag values depend on the operation type:

INSERT

All column values are newly inserted values (post-update). before_flag=N, after_flag=Y.
UPDATE

DTS splits an UPDATE into two log records with the same record_id, operation_flag, and utc_timestamp.

The first record contains pre-update values (before_flag=Y, after_flag=N). The second contains post-update values (before_flag=N, after_flag=Y).
DELETE

All column values are deleted values (pre-update). before_flag=Y, after_flag=N.

Example: Merge full and incremental data

After data synchronization completes, DTS creates a full baseline table and an incremental data table in MaxCompute. Run SQL commands to merge these tables and obtain the full data at a specific point in time.

This example uses a customer table with the following schema.

In MaxCompute, create a table to store the merged result. This table's schema must match that of the source table.
For example, to get the full data of the customer table at timestamp 1565944878, create a result table:
```
CREATE TABLE `customer_1565944878` (
    `id` bigint NULL,
    `register_time` datetime NULL,
    `address` string);
```
Note
- You can run SQL commands in the ad hoc query feature of MaxCompute.
- For MaxCompute data types, see Data types.

In MaxCompute, run the following SQL command to merge the full baseline table with the incremental data table to obtain the full data at a specific point in time.

set odps.sql.allow.fullscan=true;
insert overwrite table <result_storage_table>
select <col1>,
       <col2>,
       <colN>
  from(
select row_number() over(partition by t.<primary_key_column>
 order by record_id desc, after_flag desc) as row_number, record_id, operation_flag, after_flag, <col1>, <col2>, <colN>
  from(
select incr.record_id, incr.operation_flag, incr.after_flag, incr.<col1>, incr.<col2>,incr.<colN>
  from <table_log> incr
 where utc_timestamp< <timestamp>
 union all
select 0 as record_id, 'I' as operation_flag, 'Y' as after_flag, base.<col1>, base.<col2>,base.<colN>
  from <table_base> base) t) gt
where row_number=1 
  and after_flag='Y'

Note

<result_storage_table>: The name of the table that stores the merged result set.
<col1>/<col2>/<colN>: The column names in the synchronized table.
<primary_key_column>: The name of the primary key column in the synchronized table.
<table_log>: The name of the incremental data table.
<table_base>: The name of the full baseline table.
<timestamp>: The point in time for which to retrieve the full data.

Example: merge tables to get the full customer data at timestamp 1565944878:

set odps.sql.allow.fullscan=true;
insert overwrite table customer_1565944878
select id,
       register_time,
       address
  from(
select row_number() over(partition by t.id
 order by record_id desc, after_flag desc) as row_number, record_id, operation_flag, after_flag, id, register_time, address
  from(
select incr.record_id, incr.operation_flag, incr.after_flag, incr.id, incr.register_time, incr.address
  from customer_log incr
 where utc_timestamp< 1565944878
 union all
select 0 as record_id, 'I' as operation_flag, 'Y' as after_flag, base.id, base.register_time, base.address
  from customer_base base) t) gt
 where gt.row_number= 1
   and gt.after_flag= 'Y';

Query the merged data in the customer_1565944878 table.