MaxCompute (formerly known as ODPS) is a fast and fully managed computing platform for large-scale data warehousing. MaxCompute can process exabytes of data. This topic describes how to synchronize data from an ApsaraDB RDS for MySQL instance to a MaxCompute project by using Data Transmission Service (DTS).

Prerequisites

The following operations are performed:

Limits

  • DTS uses read and write resources of the source and destination databases during initial full data synchronization. This may increase the loads of the database servers. If the database performance or specifications are unfavorable, or the data volume is large, database services may become unavailable. For example, DTS occupies a large amount of read and write resources in the following cases: a large number of slow SQL queries are performed on the source database, the tables have no primary keys, or a deadlock occurs in the destination database. Before you synchronize data, evaluate the impact of data synchronization on the performance of the source and destination databases. We recommend that you synchronize data during off-peak hours. For example, you can synchronize data when the CPU utilization of the source and destination databases is less than 30%.
  • Only tables can be selected as the objects to be synchronized.
  • We recommend that you do not use a tool such as gh-ost or pt-online-schema-change to perform DDL operations on objects during data synchronization. Otherwise, data synchronization may fail.
  • MaxCompute does not support PRIMARY KEY constraints. If network errors occur, DTS may synchronize duplicate data records to MaxCompute.

Billing

Synchronization typeTask configuration fee
Schema synchronization and full data synchronizationFree of charge.
Incremental data synchronizationCharged. For more information, see Billing overview.

Supported source database types

You can use DTS to synchronize data from the following types of MySQL databases:

  • Self-managed database hosted on Elastic Compute Service (ECS)
  • Self-managed database connected over Express Connect, VPN Gateway, or Smart Access Gateway
  • Self-managed database connected over Database Gateway
  • ApsaraDB RDS for MySQL instance that is owned by the same Alibaba Cloud account as the MaxCompute project or a different Alibaba Cloud account from the MaxCompute project
In this example, an ApsaraDB RDS for MySQL instance is used to describe how to configure a data synchronization task. You can also follow the procedure to configure data synchronization tasks for other types of MySQL databases.
Note If your source database is a self-managed MySQL database, you must deploy the network environment for the source database. For more information, see Preparation overview.

SQL operations that can be synchronized

  • DDL operations: ALTER TABLE and ADD COLUMN.
  • DML operations: INSERT, UPDATE, and DELETE

Synchronization process

  1. Initial schema synchronization.
    DTS synchronizes the schemas of the required objects from the source database to MaxCompute. During initial schema synchronization, DTS adds the _base suffix to the end of the source table name. For example, if the name of the source table is customer, the name of the table in MaxCompute is customer_base.
  2. Initial full data synchronization.
    DTS synchronizes the historical data of the table from the source database to the destination table in MaxCompute. For example, the customer table in the source database is synchronized to the customer_base table in MaxCompute. The data is the basis for subsequent incremental synchronization.
    Note The destination table that is suffixed with _base is known as a full baseline table.
  3. Incremental data synchronization.
    DTS creates an incremental data table in MaxCompute. The name of the incremental data table is suffixed with _log, such as customer_log. Then, DTS synchronizes the incremental data that was generated in the source database to the incremental data table.
    Note For more information, see Schema of an incremental data table.

Procedure

Warning To ensure that the synchronization account can be authorized, we recommend that you perform the following steps by using your Alibaba Cloud account.
  1. Purchase a data synchronization instance. For more information, see Purchase a DTS instance.
    Note On the buy page, set the Source Instance parameter to MySQL, set the Destination Instance parameter to MaxCompute, and then set the Synchronization Topology parameter to One-way Synchronization.
  2. Log on to the DTS console.
    Note If you are redirected to the Data Management (DMS) console, you can click the old icon in the lower-right corner to go to the previous version of the DTS console.
  3. In the left-side navigation pane, click Data Synchronization.
  4. In the upper part of the Data Synchronization Tasks page, select the region in which the destination instance resides.
  5. Find the data synchronization instance and click Configure Task in the Actions column.
  6. Configure the source and destination instances.
    Configure the source and destination instances
    SectionParameterDescription
    N/ASynchronization Task NameThe task name that DTS generates. We recommend that you specify a descriptive name that makes it easy to identify. You do not need to use a unique task name.
    Source Instance DetailsInstance TypeThe instance type of the source instance. Select RDS Instance.
    Instance RegionThe source region that you selected on the buy page. You cannot change the value of this parameter.
    Instance IDThe ID of the source ApsaraDB RDS for MySQL instance.
    Database AccountThe database account of the source ApsaraDB RDS for MySQL instance.
    Note If the database engine of the source ApsaraDB RDS for MySQL instance. is MySQL 5.5 or MySQL 5.6, you do not need to configure the Database Account and Database Password parameters.
    Database PasswordThe password of the database account.
    EncryptionSpecifies whether to encrypt the connection to the source instance. Select Non-encrypted or SSL-encrypted based on your business and security requirements. If you select SSL-encrypted, you must enable SSL encryption for the ApsaraDB RDS for MySQL instance before you configure the data synchronization task. For more information, see Configure SSL encryption for an ApsaraDB RDS for MySQL instance.
    Important The Encryption parameter is available only within regions in the Chinese mainland and the China (Hong Kong) region.
    Destination Instance DetailsInstance TypeThe instance type of the destination instance. This parameter is set to MaxCompute and cannot be changed.
    Instance RegionThe destination region that you selected on the buy page. You cannot change the value of this parameter.
    ProjectThe name of the MaxCompute project. To view the name of a project, log on to the DataWorks console and choose Compute Engines > MaxCompute in the left-side navigation pane. You can view the name of a project on the Projects page. MaxCompute projects
  7. In the lower-right corner of the page, click Set Whitelist and Next.
    Note You do not need to modify the security settings for ApsaraDB instances (such as ApsaraDB RDS for MySQL and ApsaraDB for MongoDB instances) and self-managed databases hosted on ECS. DTS automatically adds the CIDR blocks of DTS servers to the IP whitelists of ApsaraDB instances or the security rules of ECS instances. For more information, see Add the CIDR blocks of DTS servers to the security settings of on-premises databases.
  8. In the lower-right corner of the page, click Next. In this step, the permissions on the MaxCompute project are granted to the synchronization account.
    Grant permissions to an account
  9. Select the synchronization policy and the objects to be synchronized.
    Select objects to be synchronized
    Parameter or settingDescription
    Partition Definition of Incremental Data TableSelect the partition names based on your business requirements. For more information about partitions, see Partition.
    Initialize Synchronization

    Initial synchronization includes initial schema synchronization and initial full data synchronization.

    Select both Initial Schema Synchronization and Initial Full Data Synchronization. In this case, DTS synchronizes the schemas and historical data of the required objects and then synchronizes incremental data.

    Processing Mode in Existed Target Table
    • Pre-check and Intercept: checks whether the destination database contains tables that have the same names as tables in the source database. If the source and destination databases do not contain identical table names, the precheck is passed. Otherwise, an error is returned during precheck and the data synchronization task cannot be started.
      Note You can use the object name mapping feature to rename the tables that are synchronized to the destination database. You can use this feature if the source and destination databases contain identical table names and the tables in the destination database cannot be deleted or renamed. For more information, see Rename an object to be synchronized.
    • Ignore: skips the precheck for identical table names in the source and destination databases.
      Warning If you select Ignore, data consistency is not ensured, and your business may be exposed to potential risks.
      • During initial data synchronization, DTS does not synchronize the data records that have the same primary keys as the data records in the destination database. This occurs if the source and destination databases have the same schema. However, DTS synchronizes these data records during incremental data synchronization.
      • If the source and destination databases have different schemas, initial data synchronization may fail. In this case, only some columns are synchronized or the data synchronization task fails.
    Select the objects to be synchronized

    Select one or more tables from the Available section and click the Rightwards arrow icon to move the tables to the Selected section.

    Note
    • You can select tables from multiple databases as the objects to be synchronized.
    • By default, after an object is synchronized to the destination database, the name of the object remains unchanged. You can use the object name mapping feature to rename the objects that are synchronized to the destination database. For more information, see Rename an object to be synchronized.
    Whether to enable the new naming rules for additional columnsAfter DTS synchronizes data to MaxCompute, DTS adds additional columns to the destination table. If the names of additional columns are the same as the names of existing columns in the destination table, data synchronization fails. Select Yes or No to specify whether you want to enable new naming rules for additional columns.
    Warning Before you specify this parameter, check whether additional columns and existing columns in the destination table have name conflicts. For more information, see Naming rules for additional columns.
    Rename Databases and Tables

    You can use the object name mapping feature to rename the objects that are synchronized to the destination instance. For more information, see Object name mapping.

    Replicate Temporary Tables When DMS Performs DDL Operations
    If you use Data Management (DMS) to perform online DDL operations on the source database, you can specify whether to synchronize temporary tables generated by online DDL operations.
    • Yes: DTS synchronizes the data of temporary tables generated by online DDL operations.
      Note If online DDL operations generate a large amount of data, the data synchronization task may be delayed.
    • No: DTS does not synchronize the data of temporary tables generated by online DDL operations. Only the original DDL data of the source database is synchronized.
      Note If you select No, the tables in the destination database may be locked.
    Retry Time for Failed Connections
    By default, if DTS fails to connect to the source or destination database, DTS retries within the next 720 minutes (12 hours). You can specify the retry time based on your needs. If DTS reconnects to the source and destination databases within the specified time, DTS resumes the data synchronization task. Otherwise, the data synchronization task fails.
    Note When DTS retries a connection, you are charged for the DTS instance. We recommend that you specify the retry time based on your business needs. You can also release the DTS instance at your earliest opportunity after the source and destination instances are released.
  10. In the lower-right corner of the page, click Precheck.
    Note
    • Before you can start the data synchronization task, DTS performs a precheck. You can start the data synchronization task only after the task passes the precheck.
    • If the task fails to pass the precheck, click the Info icon icon next to each failed item to view details.
      • After you troubleshoot the issues based on the causes, run a precheck again.
      • If you do not need to troubleshoot the issues, ignore failed items and run a precheck again.
  11. Close the Precheck dialog box after the following message is displayed: Precheck Passed. Then, the data synchronization task starts.
  12. Wait until initial synchronization is complete and the data synchronization task enters the Synchronizing state.
    You can view the status of the data synchronization task on the Synchronization Tasks page. View the status of a data synchronization task

Schema of an incremental data table

Note You must run the set odps.sql.allow.fullscan=true; command in MaxCompute to allow full table scan for the MaxCompute project.

DTS synchronizes incremental data that is generated in the source MySQL database to the incremental data table in MaxCompute. The incremental data table stores incremental data and specific metadata. The following figure shows the schema of an incremental data table.

Schema of an incremental data table
Note In the example, the modifytime_year, modifytime_month, modifytime_day, modifytime_hour, and modifytime_minute fields form the partition key. These fields are specified in the Select the synchronization policy and the objects to be synchronized step.

Schema of an incremental data table

FieldDescription
record_idThe unique ID of the incremental log entry.
Note
  • The ID auto-increments for each new log entry.
  • If an UPDATE operation is performed, DTS generates two incremental log entries to record the pre-update and post-update values. The two incremental log entries have the same record ID.
operation_flagThe type of the operation. Valid values:
  • I: an INSERT operation
  • D: a DELETE operation
  • U: an UPDATE operation
utc_timestampThe operation timestamp, in UTC. It is also the timestamp of the binary log file.
before_flagIndicates whether the column values are pre-update values. Valid values: Y and N.
after_flagIndicates whether the column values are post-update values. Valid values: Y and N.

Additional information about the before_flag and after_flag fields

The before_flag and after_flag fields of an incremental log entry are defined depending on the operation type.

  • INSERT

    For an INSERT operation, the column values are the newly inserted record values (post-update values). The value of the before_flag field is N and the value of the after_flag field is Y.

    Example of an INSERT operation
  • UPDATE

    DTS generates two incremental log entries for an UPDATE operation. The two incremental log entries have the same values for the record_id, operation_flag, and utc_timestamp fields.

    The first log entry records the pre-update values, so the value of the before_flag field is Y and the value of the after_flag field is N. The second log entry records the post-update values, so the value of the before_flag field is N and the value of the after_flag field is Y.

    Example of an UPDATE operation
  • DELETE

    For a DELETE operation, the column values are the deleted record values (pre-update values). The value of the before_flag field is Y and the value of the after_flag field is N.

    Example of a DELETE operation

Merge a full baseline table and an incremental data table

After a data synchronization task is started, DTS creates a full baseline table and an incremental data table in MaxCompute. You can use SQL statements to merge the two tables. This allows you to obtain the full data at a specific time point.

This section describes how to merge data for a table named customer. The following figure shows the schema of the customer table.

Schema of the customer table
  1. Create a table in MaxCompute based on the schema of the source table. The table is used to store the merged data.

    For example, you can obtain full data of the customer table at the 1565944878 time point. Run the following SQL statements to create the required table:

    CREATE TABLE `customer_1565944878` (
        `id` bigint NULL,
        `register_time` datetime NULL,
        `address` string);
    Note
  2. Run the following SQL statements in MaxCompute to merge the full baseline table and the incremental data table and obtain full data at a specific time point:
    set odps.sql.allow.fullscan=true;
    insert overwrite table <result_storage_table>
    select <col1>,
           <col2>,
           <colN>
      from(
    select row_number() over(partition by t.<primary_key_column>
     order by record_id desc, after_flag desc) as row_number, record_id, operation_flag, after_flag, <col1>, <col2>, <colN>
      from(
    select incr.record_id, incr.operation_flag, incr.after_flag, incr.<col1>, incr.<col2>,incr.<colN>
      from <table_log> incr
     where utc_timestamp< <timestamp>
     union all
    select 0 as record_id, 'I' as operation_flag, 'Y' as after_flag, base.<col1>, base.<col2>,base.<colN>
      from <table_base> base) t) gt
    where row_number=1 
      and after_flag='Y'
    Note
    • <result_storage_table>: the name of the table that stores the merged data.
    • <col1>/<col2>/<colN>: the names of the columns in the table to be merged.
    • <primary_key_column>: the name of the primary key column in the table to be merged.
    • <table_log>: the name of the incremental data table.
    • <table_base>: the name of the full baseline table.
    • <timestamp>: the timestamp that is generated when full data is obtained.

    Run the following SQL statements to obtain full data of the customer table at the 1565944878 time point:

    set odps.sql.allow.fullscan=true;
    insert overwrite table customer_1565944878
    select id,
           register_time,
           address
      from(
    select row_number() over(partition by t.id
     order by record_id desc, after_flag desc) as row_number, record_id, operation_flag, after_flag, id, register_time, address
      from(
    select incr.record_id, incr.operation_flag, incr.after_flag, incr.id, incr.register_time, incr.address
      from customer_log incr
     where utc_timestamp< 1565944878
     union all
    select 0 as record_id, 'I' as operation_flag, 'Y' as after_flag, base.id, base.register_time, base.address
      from customer_base base) t) gt
     where gt.row_number= 1
       and gt.after_flag= 'Y';
  3. Query the merged data from the customer_1565944878 table.
    Query the merged data