edit-icon download-icon

Data synchronization task

Last Updated: Apr 25, 2018

Currently, data synchronization task supports the following data source types: MaxCompute, RDS (MySQL, SQL Server, PostgreSQL), Oracle, FTP, AnalyticDB, OSS, and DRDS.

DataSyncTask

This article describes how to perform a data synchronization task by taking RDS data synchronization to MaxCompute as an example.

Procedure

Create a data table

For more information about how to create a MaxCompute table, see Table creation.

Create a data source

Note:
Only the project administrator can create data sources.

If the RDS data source fails the connectivity test, add a whitelist of data synchronization server IP addresses onto your own RDS.

11.192.97.82,11.192.98.76,10.152.69.0/24,10.153.136.0/24,10.143.32.0/24,120.27.160.26,10.46.67.156,120.27.160.81,10.46.64.81,121.43.110.160,10.117.39.238,121.43.112.137,10.117.28.203,118.178.84.74,10.27.63.41,118.178.56.228,10.27.63.60,118.178.59.233,10.27.63.38,118.178.142.154,10.27.63.15,100.64.0.0/8

Note:
To use a custom resource group to schedule the RDS data synchronization task, add the server IP address of the custom resource group to the RDS whitelist.

  1. Log on to Alibaba Cloud account.

  2. Go to DTplus platform.

  3. Select DataWorks > Management Console as a developer. Click Enter Project to enter project management page.

  4. Click Data Integration from the upper menu and navigate to the Data Sources page.

  5. Click New Source in the upper-right corner.

  6. Complete the following configurations in the New Source dialog box.

    MySQLSource

    • Type: The currently selected data source type (RDS > MySQL > RDS instance type).

    • Name: A data source name consists of letters, numbers, and underscores () and must start with a letter or an underscore (). The length of the name cannot exceed 30 characters.

    • Description: A brief description of the data source, which cannot exceed 1024 characters.

    • Examples of RDS ID: The ID of the RDS instance of the MySQL data source.

    • Examples of RDS buyers ID: The purchaser ID of the RDS instance of the MySQL data source.

      If you select the JDBC format to configure the data source, the JDBC connection information is in the following format: jdbc:mysql://IP:Port/database.

    • Database name: The database name of the data source.

    • Username/Password: The username and password of the database.

  7. Click Test connectivity.

  8. If the connectivity test is successful, click Complete.

    For the configurations of other types of data sources (MaxCompute, RDS, Oracle, FTP, AnalyticDB, OSS, and DRDS), see the articles under Configure Data Sources.

Create a task

  1. On the Data Development page, select New > Create Task from the Tool bar.

  2. Complete the configurations in the Create Task dialog box.

    Here, we take the creation of a node task as an example. If the node requires daily automatic scheduling, select Periodic Scheduling as the scheduling type and configure the scheduling period in node properties.

    CreateTask

  3. Click Create.

Configure the data synchronization task

The configurations of the task synchronization node include Choose Source, Select Target, Field Mapping, and Channel Control.

  1. Select a source

    Select a Data Source and a Table.

    EnterSourceInfo

    • Data filtering: Enter a WHERE filtering statement (no need to enter the WHERE keywords) based on the corresponding SQL syntax. This filtering condition can be used as an incremental synchronization criterion.

Note:
The WHERE condition is used for source data filtering. The specified column, table, and WHERE conditions are concatenated to create an SQL command for data extraction. The WHERE condition is applicable to both full and incremental synchronization. Specific description is as follows:

  • Full synchronization: The first data import is normally a full import and no WHERE condition is required.

  • Incremental synchronization: In actual business scenarios, incremental import typically synchronizes the data generated on the current day. Generally, before writing a WHERE condition statement, you must determine the field that describes the increment (timestamp) in the table. For example, if the field that describes the increment is create_time in Table A, write create_time>${yesterday} in the WHERE condition and assign a value to the parameter in parameter configurations. For the usage of more built-in parameters, see System scheduling parameters.

If the data synchronization task is an RDS, Oracle, or MaxCompute task, the splitting key configuration is displayed on the page.

  • Splitting key: Supports only integer fields. When data is being read, the data is split based on the configured fields to achieve concurrent reading, which improves the data synchronization efficiency. The splitting key configurations are displayed only when the synchronization task is importing RDS or Oracle data to MaxCompute.

    Note:
    If the source is a MySQL data source, the data synchronization task also supports database- and table-based data import (on the premise that the table structure is consistent, whether the data is stored in the same database or not).

    Database- and table-based data import can be used in the following scenarios:

    • Multiple tables in the same database: Click Search Table and add the tables you want to synchronize.

    • Multiple tables in different databases: Click Add to select a source database, and then click Search Table to add tables.

  1. Select a target, and then click Next.

    • Zoning information: Partition helps you easily search for the special columns introduced by some data. By specifying the partition, you can quickly locate the expected data. In this case, the custom parameter uses the date of the previous day as the value of the partition. Partition values support both constants and variables. For more custom parameters, see Parameter configurations.

    • Clearing rules:

      • Clear existing data before writing: Before the data import, all the data in the table or partition must be cleared, which is equivalent to Insert Overwrite.

      • Keep existing data before writing: No data is cleared before the data import and new data is always appended with each run, which is equivalent to Insert Into.

    Assign values to parameters in the parameter configurations.

  2. Field mapping

    Configure a one-to-one mapping between fields. The Source Table Field on the left maps to the Target Table Field on the right.

    FieldMapping

    Add/Delete: Hover the mouse over a line and click Delete to delete the line. Click To Add a Line to add a field. If the database type is MaxCompute, use the partition column name as the value of the added line. By doing this, the partition column is also synchronized during data synchronization.

    Note:

    • The writing method for custom variables and constants is as follows.

    • To import a constant or variable to a field of a MaxCompute table, click Insert and enter the value of the constant or variable embraced with a pair of single quotation marks. For example, the value assigned to variable ${yesterday} in the parameter configurations is yesterday=$[yyyymmdd]. For more information about time parameters, see System scheduling parameters.

  3. Channel control

    Channel control is used to configure the maximum job rate and dirty data check rules.

    ChannelControl

    • DMU: Configure the current data synchronization task rate. The maximum supported rate is 20 MB/s. The channel traffic measured value is the measured value of the data synchronization task and does not represent the actual NIC traffic.

    • Number of concurrent jobs: The number of concurrent jobs is valid only after the splitting key is configured. The maximum number of concurrent jobs is subject to the maximum job rate. For example, you can select 10 at maximum as the number of concurrent jobs if the maximum job rate is 10 MB.

    • Incorrectly recorded more than: This number is the amount of dirty data and can be used when writing data to RDS and Oracle. The data synchronization task will be terminated when the number of error records exceeds the number configured here.

    Note:
    A large number of concurrent jobs is not recommended to be set. The larger the number of concurrent jobs is set, the more resources are consumed. This is likely to cause other tasks to wait for resources and affect the proper running of other tasks.

  4. Preview and save

    After the preceding configurations, click Next to preview. Then, click Save if the configurations are correct.

Submit the data synchronization task and test the workflow

  1. Click Submit from the upper menu.

  2. Click Test Run after the task is submitted.

For example, if a date 2017-01-04 exists as the createtime in the source table. The scheduling time parameters used in the configurations are $[yyyy-mm-dd-1] and ${bdp.system.bizdate}. Therefore, to assign 2017-01-04 to the createtime value and 20170104 to the partition value of the target table in the test, you must select 2017-01-04 as the business date.

  1. After the test task is triggered, click Go to O&M Center to view the task progress.

  2. You can view the synchronized data by using the following command: read <my_region>.

Thank you! We've received your feedback.