Create a data sync job

Last Updated: Dec 12, 2017

Generally in DataWorks, the data integration function is used to periodically import business data generated in your system to the workspace and periodically export the flow computing results to the data source you specify for further display or operation.

DataWorsStep3

Currently, data from the following data sources can be imported to or exported from the workspace through the data integration function: RDS, MySQL, SQL Server, PostgreSQL, MaxCompute, ApsaraDB for Memcache, DRDS, OSS, Oracle, FTP, dm, HDFS, and MongoDB. For details about supported data source types, see Supported data source types.

This section uses MySQL as an example to show how to export data in Data IDE to MySQL through the data integration function.

Prerequisites

If your database is a self-built database on ECS or a RDS/MongoDB data source, you must add the data synchronization machine IP address whitelist to your ECS security group or RDS/MongoDB whitelist. For details, see Add whitelist and security group.

NOTE:

If you use a custom resource group to schedule RDS data synchronization tasks, you must add the machine IP address of the custom resource group to the RDS whitelist.

Instruction

Add a data source

NOTE:

Only the project administrator can create a data source. Other roles can only view the data source.

  1. Log on to the Data IDE console as an administrator and click Enter Workspace in the operations column of the relevant project in the Project List.

  2. Click Data Integration in the top menu, and click Data Source in the left-side navigation pane.

  3. Click New Data Source in the upper-right corner, as shown in the following figure:

    AddDataSource

  4. Enter the configuration items in the create data source dialog box, as shown in the following figure:

    18

    • Data source name: The name must contain letters, numbers, and underlines, but cannot begin with a number or underline, for example, abc_123.
    • Data source description: The description cannot exceed 80 characters.

    • Date source type: You can select the data source type based on actual needs. Make sure that the data source you select contains tables.

      • Classic network: The cloud products are centrally deployed in Alibaba Cloud’s public infrastructure network. Alibaba Cloud is responsible for network planning and management. The classic network is more suitable for customers that require a high level of ease-of-use.

      • VPC: Virtual Private Cloud (VPC) is an isolated network environment based on Alibaba Cloud. You have full control over your own VPC, including choosing your preferred IP address range, network segment, route table, and gateway. For public network access, select the classic network type. Note the public network bandwidth speed and related network data traffic charges. Public network is recommended only for special situations.

    • Network type: You can set the network type according to actual needs.

    • JDBC URL:jdbc:mysql://host:port/database

    • Username/Password: These are the username and password used to connect to the database.

For configurations of different types of data sources, see Data source configuration.

  1. Click Test Connectivity.

  2. If the connectivity test is successful, click Save.

If the connectivity test fails, see Connection to self-built database on ECS fails or Connection to ApsaraDB for RDS data source fails based on the actual situation.

Ensure that the target MySQL database contains tables.

Create the table odps_result in the MySQL database. The statements used for table creation are as follows:

  1. CREATE TABLE `ODPS_RESULT` (
  2. `education` varchar(255) NULL ,
  3. `num` int(10) NULL
  4. )

After the table is created, you can run ‘desc odps_result;’ to view the table details.

Create and configure a synchronization node

This section shows how to create and configure the synchronization node “write_result”, and write data from result_table to the MySQL database. The specific steps are as follows:

  1. Create the node write_result, as shown in the following figure:

    19

    20

  2. Select the source.

    Select the MaxCompute data source and the source table “result_table” and click Next, as shown in the following figure:

    21

  3. Select the target.

    Select the MySQL data source and the target table “odps_result” and click Next, as shown in the following figure:

    22

  4. Map the fields.

    Select the mapping between fields. You must configure the field mapping relationships. The Source Table Fields on the left correspond one to one with the Target Table Fields on the right.

    FieldMapping

  5. Control the channel.

    Click Next to configure the maximum job rate and dirty data check rules, as shown in the following figure:

    24

  6. Preview and store.

    After configuration, you can scroll up or down to view the task configurations. If there are no errors, click Save, as shown in the following figure:

    25

Submit a data synchronization task

After saving a synchronization task, click Submit on the right to submit the synchronization task to the scheduling system. The scheduling system automatically and periodically runs the task from the second day according to the configuration attributes.

Subsequent steps

Now, you know how to create a synchronization task and export data to data sources of different types. Continue to the next tutorial for further study. This tutorial shows you how to set the scheduling attribute and dependency for a synchronization task. For details, see Set task scheduling attribute and dependency.

Thank you! We've received your feedback.