Create a data sync job

Last Updated: Oct 13, 2017

Currently data source types supported by the data synchronization jobs include: MaxCompute, RDS (MySQL, SQL Server, PostgreSQL), Oracle, FTP, ADS, OSS, OCS, and DRDS.

1

Taking the data synchronization from RDS to MaxCompute for example, the detailed descriptions can be found below:

Step 1: Create a data table

For details on how to create a MaxCompute table, see Create a Table.

Step 2: Create a data source

[Description] Only users with the project administrator role are allowed to create a new data source.

Preparation

Currently only the China East 1 (Hangzhou) region is supported as a RDS data source, and the Beijing region is not yet supported. In addition, when the RDS data sources in the Hangzhou region cannot be connected to during testing, you should add a whitelist of data synchronization server IP addresses onto your RDS:

10.152.69.0/24,10.153.136.0/24,10.143.32.0/24,120.27.160.26,10.46.67.156,120.27.160.81,10.46.64.81,121.43.110.160,10.117.39.238,121.43.112.137,10.117.28.203,118.178.84.74,10.27.63.41,118.178.56.228,10.27.63.60,118.178.59.233,10.27.63.38,118.178.142.154,10.27.63.15

The specific steps are as follows:

  1. Go to Alibaba Cloud Dataplus platform > Data IDE Kit > Console as a developer, click the Enter Work Zone in the action bar of the corresponding project.

  2. Click Manage Projects in the top menu bar, and then click Manage Data Sources in the left navigation bar.

  3. Click New Data Source.

    1

  4. Fill in the configuration items in the “New Data Source” pop-up box.

    1

    Specific descriptions of the configuration items in the figure above are as follows:

    • Data source name: A data source name may consist of letters, numbers, and underscores. It must begin with a letter or an underscore and cannot exceed 30 characters in length.
    • Data source descriptions: A brief description of the data source. The description should not exceed 1,024 characters in length.
    • Data source type: The data source type selected currently (RDS>MySQL>RDS).
    • RDS instance ID: The ID of the MySQL data source RDS instance.
    • RDS instance purchaser ID: The purchaser ID of the MySQL data source RDS instance.

    Note: If you have selected the JDBC form to configure the data source, the format of the JDBC connection information is: jdbc:mysql://IP:Port/database.

    • Database name: The database name of the data source.
    • User name/password: The user name and password of the database.
  5. Click Test Connectivity.

  6. If the test result is connected successfully, click the Save button to save the configuration information.

For detailed configurations of other types of data sources (MaxCompute, RDS, Oracle, FTP, ADS, OSS, OCS, and DRDS), see Data Source Configuration.

Step 3: Create a new job

Take the “wizard mode” new task as an example.

  1. In the “data integration” interface, click on the left navigation bar to synchronize tasks;
  2. Click the “wizard mode” in the interface to get to the task configuration page.

    1

Step 4: Configure the data synchronization job

The synchronization job node includes five configuration items: “Select Data Source and Target”, “Field Mapping”, “Channel Control” and “Preview & Save”.

  1. Select the data source

    Select Data Source(The data source has been created in Step 2), and then select the data table.

    1

    • Extraction Filtering: You can specify the WHERE filter based on the corresponding SQL syntax (You do not need to specify the WHERE keyword). The WHERE filter will be used as a condition of incremental synchronization.

      The WHERE filter is used for source data filtering. The specified column, table, and WHERE filter are concatenated to create an SQL command for data extraction. The WHERE filter can be used for full synchronization and incremental synchronization. Specific descriptions are as follows:

      • Full synchronization:
        Full synchronization is usually executed when data is imported for the first time. You do not need to configure the WHERE filter. You can set the WHERE filter limit to 10 to avoid a large data size during tests.
      • Incremental synchronization:
        In the actual service scenario, incremental synchronization usually synchronizes the data generated on the current day. Before compiling the WHERE filter, you usually need to first determine the field that describes the increment (timestamp) in the table. For example, if in Table A, the field that describes the increment is “creat_time”, you need to compile “creat_time>$yesterday” in the WHERE filter and assign a value to the parameter in parameter configuration.
    • Splitting key: If the data synchronization job is RDS/Oracle/MaxCompute, the splitting key configuration will be displayed on the page.
      only supports integer fields. During data reading, the data will be split based on the configured fields to achieve concurrent reading, improving data synchronization efficiency. The splitting key configuration item will only be displayed when the synchronization job is for importing RDS/Oracle data into MaxCompute.

    If the source is a MySQL data source, the data synchronization job also supports database- and table-based data importing (on a premise that the table structure must be consistent, no matter whether the data is stored in the same database or different databases).

    Database- and table-based data importing supports the following scenarios:

    • Multiple tables in the same database: Click “Search Table” to search for the tables and add the tables you want to synchronize.

    • Multiple tables in different databases: First click “Add” to select the source database, and then click “Search Table” to add the tables.

  2. Select the data target

    Click “rapid establishment of table” and you will be able to convert the tabulation statements of the source table to DDL statements conforming to the MaxCompute SQL syntax to create a target table. After making the necessary selections, click “Next”.

    1

    • Partition information: Partitioning helps you to easily search for the special columns introduced by some data. By specifying the partition, you can quickly locate the desired data. Constant partitions and variable partitions are supported.

    • Clearing rules:

      • Clear existing data before writing: Before data importing, all the data in the table or partition should be cleared, which is equivalent to “Insert Overwrite”.

      • Keep existing data before writing: No data needs to be cleared before data importing. New data is always appended with each run, which is equivalent to “Insert into”.

    Assign values to parameters in the parameter configuration, as shown in the figure below:

    1

  3. Field Mapping

    You need to configure the field mapping relationships. The “Source Table Fields” on the left correspond one to one with the “Target Table Fields” on the right.

    1

    • Add/Delete: Click “Add a Line” to add a single field. Move and hover the cursor on a line above, and click the “Delete” icon, and you will delete the current field.

    Writing method for user-defined variables and constants:To import a constant or variable to a field in the MaxCompute table, you only need to click the “Insert” button and enter the value of the constant or variable enclosed in single quotation marks. For example, for the ‘${yesterday}’ variable, you can then assign a value to the variable using the parameter configuration component, such as yesterday=$[yyyymmdd].

  4. Channel Control

    The Channel Control is used to configure the maximum speed of the job and the dirty data check rules, as shown in the figure:

    1

    • The maximum speed of the job refers to the speed of the current data synchronization job, with a maximum value of 10 MB/s supported (The channel traffic measured value is the measured value of the data synchronization job, and does not represent the actual traffic of the network interface card).

    Dirty data check rules (available for writing data to RDS and Oracle):

    • When the number of error records (that is the volume of dirty data) exceeds the configured quantity, the data synchronization job ends.
  5. Preview & Save

    When you complete the above configuration, click “next” to preview, if correct, click “save”, as shown below:

    1

Step 5: Submit the data synchronization job and test the workflow

  1. Click the top menu bar to submit the job ;

  2. After the job is submitted successfully, click Test Run ;

    Because some createtime values in the source table in this example are 2017-01-04, while the scheduling time parameters used in the configuration are $[yyyy-mm-dd-1] and ${bdp.system.bizdate}, we set the partition value of the target table to 20170104 to assign the value of 2017-01-04 to the createtime parameter in the test. The 2017-01-04 should be selected as the business time in the test, as shown in the figure below:

    1

    After the test task is triggered successfully, you can click “Go to O&M Center” to view the task progress.

    1

  3. View the synchronized data.

    1

Thank you! We've received your feedback.