edit-icon download-icon

Data increase synchronization

Last Updated: Apr 04, 2018

The two types of data to be synchronized

Based on whether the data is changed after being written, the data to be synchronized is classified as unchanged data (generally log data) and changed data (such as the personnel table where the personnel status may change).

Example

You must specify different synchronization policies for each data. The following example shows how to synchronize the data of the RDS database to MaxCompute, which also applies to other data sources.

According to the idempotence (multiple operations of tasks produce the same result. In this way, the task supports re-running scheduling and can easily clear dirty data when an error occurs), data is imported to a separate table or partition, or directly overwrites the historical data in the existing table or partition.

In the example, the task test date is 11/14/2016, full synchronization is performed on the same day, and historical data is synchronized to the partition where ds=20161113. For the incremental synchronization scenario in this example, automatic scheduling is configured to synchronize the incremental data to the partition where ds=20161114 on November 15, 2016. The time field optime indicating the modified time of the data is available, which is used to determine whether the data is incremental or not.

Incremental synchronization of unchanged data

This scenario allows you to partition easily based on the data generation pattern because the data remains unchanged after being generated. Typically, you can partition by date, such as creating one partition on a daily basis.

Data preparation

  1. drop table if exists oplog;
  2. create table if not exists oplog(
  3. optime DATETIME,
  4. uname varchar(50),
  5. action varchar(50),
  6. status varchar(10)
  7. );
  8. Insert into oplog values(str_to_date('2016-11-11','%Y-%m-%d'),'LiLei','SELECT','SUCCESS');
  9. Insert into oplog values(str_to_date('2016-11-12','%Y-%m-%d'),'HanMM','DESC','SUCCESS');

The two data entries as the historical data are available. Perform full data synchronization first to synchronize the historical data to the partition created yesterday.

Procedure

  1. Create a MaxCompute table.

    1. create table if not exists ods_oplog(
    2. optime datetime,
    3. uname string,
    4. action string,
    5. status string
    6. ) partitioned by (ds string);
  2. Configure a task to synchronize the historical data.

    Given that the task is performed only once, only one test is required. After the test is complete, change the status of the task to Paused (in the rightmost scheduling configuration) and submit and release the task again in the “Data Development” module to prevent the task from being scheduled automatically.

  3. Write more data to the RDS source table as the incremental data.

    1. insert into oplog values(CURRENT_DATE,'Jim','Update','SUCCESS');
    2. insert into oplog values(CURRENT_DATE,'Kate','Delete','Failed');
    3. insert into oplog values(CURRENT_DATE,'Lily','Drop','Failed');
  4. Configure a task to synchronize the incremental data.

    Note: If you configure the “Data Filtering”, all the data added to the source table on November 14 is retrieved and synchronized to the incremental partition in the target table during the synchronization on the early morning the next day, which is November 15.

    If you set the task scheduling cycle as daily scheduling, the task is scheduled automatically the next day after the task is submitted and released, and the data in the MaxCompute target table is changed as follows once the task runs successfully.

Incremental synchronization of changed data

For data in personnel or order tables that is subject to changes, full data synchronization on a daily basis is recommended based on the time variant collection feature of the data warehouse. In other words, you store full data on a daily basis. In this way, both historical and current data can be retrieved easily.

In actual scenarios, daily incremental synchronization may be required. Because MaxCompute does not support changing data with the Update statement, you must take other measures to implement the synchronization. The following describes how to implement full and incremental synchronization.

Data preparation

  1. drop table if exists user ;
  2. create table if not exists user(
  3. uid int,
  4. uname varchar(50),
  5. deptno int,
  6. gender VARCHAR(1),
  7. optime DATETIME
  8. );
  9. -- Historical data
  10. insert into user values (1,'LiLei',100,'M',str_to_date('2016-11-13','%Y-%m-%d'));
  11. insert into user values (2,'HanMM',null,'F',str_to_date('2016-11-13','%Y-%m-%d'));
  12. insert into user values (3,'Jim',102,'M',str_to_date('2016-11-12','%Y-%m-%d'));
  13. insert into user values (4,'Kate',103,'F',str_to_date('2016-11-12','%Y-%m-%d'));
  14. insert into user values (5,'Lily',104,'F',str_to_date('2016-11-11','%Y-%m-%d'));
  15. -- Incremental data
  16. update user set deptno=101,optime=CURRENT_TIME where uid = 2; -- Change null to non-null
  17. update user set deptno=104,optime=CURRENT_TIME where uid = 3; -- Change non-null to non-null
  18. update user set deptno=104,optime=CURRENT_TIME where uid = 4; -- Change non-null to null
  19. delete from user where uid = 5;
  20. insert into user(uid,uname,deptno,gender,optime) values (6,'Lucy',105,'F',CURRENT_TIME);

Daily full synchronization

Daily full synchronization is relatively simple.

Procedure

  1. Create a MaxCompute target table.

    1. -- Full synchronization
    2. create table ods_user_full(
    3. uid bigint,
    4. uname string,
    5. deptno bigint,
    6. gender string,
    7. optime DATETIME
    8. ) partitioned by (ds string);
  2. Configure a full synchronization task.

    Note: Set the scheduling cycle of the task as daily scheduling because daily full synchronization is required.

  3. Test the task and view the synchronized MaxCompute target table.

    Because full synchronization is performed on a daily basis and no incremental synchronization is performed in this case, you can see the following data results after the task is automatically scheduled on the next day:

    To query the data results, set where ds = 20161114 to retrieve the full data.

Daily incremental synchronization

This mode is not recommended except in specific scenarios. Because the delete statement is not supported in specific scenarios, deleted data cannot be retrieved by filtering conditions of SQL statements. Generally, enterprises’ codes are deleted logically, in which case the update statement is applied instead of the delete statement. Now that there are some inapplicable scenarios, using this sync method might cause data inconsistency when some special condition is encountered. Another drawback is that you must merge new data and historical data after the synchronization. See the following for specific steps.

Data preparation

Create two tables, one of which is for writing latest data and the other is for writing incremental data.

  1. -- Result table
  2. create table dw_user_inc(
  3. uid bigint,
  4. uname string,
  5. deptno bigint,
  6. gender string,
  7. optime DATETIME
  8. );
  9. -- Incremental record
  10. create table ods_user_inc(
  11. uid bigint,
  12. uname string,
  13. deptno bigint,
  14. gender string,
  15. optime DATETIME
  16. )

Procedure

  1. Configure a task to write full data directly to the result table.

    Note: Run this task only once and set the task as Paused in the Data Development module after the task runs successfully.

  2. Configure a task to write incremental data to the incremental record.

  3. Merge the data.

    1. insert overwrite table dw_user_inc
    2. select
    3. case when b.uid is not null then b.uid else a.uid end as uid,
    4. case when b.uid is not null then b.uname else a.uname end as uname,
    5. case when b.uid is not null then b.deptno else a.deptno end as deptno,
    6. case when b.uid is not null then b.gender else a.gender end as gender,
    7. case when b.uid is not null then b.optime else a.optime end as optime
    8. from
    9. dw_user_inc a
    10. full outer join ods_user_inc b
    11. on a.uid = b.uid ;

    The daily incremental synchronization is different from the daily full synchronization in that the daily incremental synchronization synchronize only a small amount of incremental data, but with the risk of data inconsistency, and requires extra computing workload for data merging. Unless necessary, use the daily full synchronization for changed data. In addition, you can set a Lifecycle for the historical data, which is deleted automatically after a certain period.

Thank you! We've received your feedback.