All Products
Search
Document Center

DataWorks:Create a synchronization task

Last Updated:Feb 28, 2024

This topic describes how to create a synchronization task to export data from MaxCompute to a MySQL data source.

Prerequisites

  • An ApsaraDB RDS for MySQL instance is created, and the ID of the instance is obtained. A whitelist is configured for the instance in the ApsaraDB RDS console. For more information, see Create an ApsaraDB RDS for MySQL instance.

    Note

    If you use a custom resource group to run the synchronization task, you must add the IP addresses of the servers in the custom resource group to the whitelist of the ApsaraDB RDS for MySQL instance.

  • A table named odps_result is created in the ApsaraDB RDS for MySQL database to which you want to synchronize data. You can create the table by executing the following statement:

    CREATE TABLE `ODPS_RESULT` (
    `education`  varchar(255) NULL ,
    `num`  int(10) NULL 
    );

    After the table is created, you can execute the desc odps_result; statement to view the table details.

  • A result table named result_table is prepared. For more information, see Create tables and upload data.

  • A zero load node named start and an ODPS SQL node named insert_data are created. For more information, see Create a workflow.

Background information

You can use Data Integration to periodically synchronize the business data that is generated in a business system to a DataWorks workspace. You can create SQL tasks to compute the data and use Data Integration to periodically synchronize the computing results to a specified data source for further display or use.工作流程

Data Integration allows you to import data to and export data from various data sources, such as ApsaraDB RDS, MySQL, SQL Server, PostgreSQL, MaxCompute, ApsaraDB for Memcache (OCS), PolarDB-X, Object Storage Service (OSS), Oracle, FTP, Dameng (DM), Hadoop Distributed File System (HDFS), and MongoDB. For more information about data source types, see Supported data source types, Reader plug-ins, and Writer plug-ins.

Step 1: Add a data source

Note

Only the workspace administrator can add data sources. Members that are assigned other roles can only view data sources.

  1. Go to the Management Center page.

    Log on to the DataWorks console. In the left-side navigation pane, click Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.

  2. Add a MySQL data source.

    1. In the left-side navigation pane of the SettingCenter page, click Data Source. On the Data Source page, click Add Data Source.

    2. In the Add data source dialog box, click MySQL.

  3. Configure information about the MySQL data source.

    In the Add MySQL data source dialog box, configure the parameters. In this example, set the Data Source Type parameter to Alibaba Cloud instance mode.image.png

    The following table describes the parameters.

    Parameter

    Description

    Data Source Type

    The mode in which you want to add the data source to DataWorks. Set this parameter to Alibaba Cloud instance mode.

    Data Source Name

    The name of the data source. You can specify a custom name.

    Data Source Description

    The description of the data source. For example, you can enter the purpose of the data source.

    Applicable environment

    The environment in which the data source is used. Valid values: Development and Production.

    Note

    This parameter is displayed only if the workspace is in standard mode.

    Region

    The region in which the data source resides.

    Instance Owner

    Select Current cloud account.

    RDS instance ID

    The ID of the ApsaraDB RDS for MySQL instance. You can log on to the ApsaraDB RDS console to obtain the ID.

    Default Database Name, Username, and Password

    The name of the default ApsaraDB RDS for MySQL database, and the username and password that are used to log on to the database. Do not use at signs (@) in the password.

    The following descriptions provide instructions for you to configure a synchronization task that uses a MySQL data source:

    • When you configure a database-level real-time or batch synchronization task that uses a MySQL data source, you can select one or more databases on which you have access permissions in the ApsaraDB RDS for MySQL instance.

    • If you select multiple databases when you configure a batch synchronization task, you must add a data source for each database.

    Special Authentication Method

    Select None Auth.

    Standby Library Settings

    If the ApsaraDB RDS for MySQL instance is a primary instance and has a secondary read-only instance, you can turn on Standby Library Settings, and select the ID of the secondary instance. This prevents the primary instance from being affected and ensures the performance of the primary instance. If the primary instance has multiple secondary read-only instances, only the data of one of the read-only instances is read.

    Note

    If you turn on Standby Library Settings, you can use only an exclusive resource group to run the task that uses the data source.

  4. Test network connectivity between the data source and resource groups that you select.

    Click the Data Integration tab next to the Resource Group connectivity parameter, find the resource group for Data Integration that you want to use, and then click Test Connectivity in the Operation column. Click the Schedule tab and perform the same operations to test network connectivity between the data source and the resource group for scheduling that you want to use. If the connectivity status is Connected, the resource groups are connected to the data source.

    Note
    • A synchronization task can use only one resource group of a specific type.

    • To ensure that your synchronization task can be run as expected, you must test network connectivity between the data source and all types of resource groups on which your synchronization task will be run.

    测试连通性

  5. If the data source passes the network connectivity test, click Complete.

Step 2: Create and configure a synchronization task

This section describes how to create a synchronization node named write_result to generate a synchronization task. The task is used to synchronize data in the result_table table to your MySQL data source. Procedure:

  1. Go to the DataStudio page.

    Log on to the DataWorks console. In the left-side navigation pane, choose Data Modeling and Development > DataStudio. On the page that appears, select the desired workspace from the drop-down list and click Go to DataStudio.

  2. Create a batch synchronization node.

    Move the pointer over the image.png icon and choose Create Node > Data Integration > Offline synchronization to create a batch synchronization node named write_result to generate a synchronization task.

  3. Establish network connections between a specific resource group and the data sources.

    1. In the Configure Network Connections and Resource Group step on the configuration tab of the synchronization node, select MaxCompute(ODPS) for Source and MySQL for Destination, select a resource group that you want to use to run the synchronization task, and then test network connectivity between the resource group and data sources. For more information about how to establish network connections, see Establish a network connection between a resource group and a data source.image.png

    2. Click Next to go to the Configure tasks step.

  4. Configure information for data synchronization.

    1. Configure the source and destination.

      Select the result_table table as the source and select the odps_result table as the destination. Configure other parameters based on your business requirements. For more information, see Configure a batch synchronization task by using the codeless UI.image.png

    2. Configure mappings between source fields and destination fields.

      After the source and destination are configured, you must configure mappings between source fields and destination fields. After you configure the mappings, the batch synchronization task writes the values of the source fields to the destination fields of the same data type based on the mappings. Make sure that fields in the source table have a one-to-one mapping with fields in the destination table.

    3. Configure channel control policies.

      Parameter

      Description

      Task Expected Maximum Concurrency

      The maximum number of parallel threads that the batch synchronization task uses to read data from the source or write data to the destination.

      Note
      • The actual number of parallel threads that are used during data synchronization may be less than or equal to the specified threshold due to the specifications of the exclusive resource group for Data Integration. You are charged for the exclusive resource group for Data Integration based on the number of parallel threads that are used. For more information, see Performance metrics.

      • DataWorks uses resource groups for scheduling to issue batch synchronization tasks in Data Integration to resource groups for Data Integration and run the tasks by using the resource groups for Data Integration. You are charged for using resource groups for scheduling to schedule batch synchronization tasks based on the number of tasks. For more information about the task issuing mechanism, see Mechanism for issuing tasks.

      Synchronization rate

      Specifies whether to enable throttling.

      • If you enable throttling, you can specify a maximum transmission rate to prevent heavy read workloads on the source. The minimum value of this parameter is 1 MB/s.

      • If you do not enable throttling, data is transmitted at the maximum transmission rate allowed by the hardware based on the specified maximum number of parallel threads.

      Note

      The bandwidth is a metric provided by Data Integration and does not represent the actual traffic of an elastic network interface (ENI). In most cases, the ENI traffic is one to two times the channel traffic. The actual ENI traffic depends on the serialization of the data storage system.

      Policy for Dirty Data Records

      Specifies whether to allow the generation of dirty data during data synchronization.

      Important

      If a large amount of dirty data is generated during data synchronization, the overall data synchronization speed is affected.

      • If this parameter is not configured, dirty data records are allowed during data synchronization, and the batch synchronization task can continue to run if dirty data records are generated.

      • If you set this parameter to 0, no dirty data records are allowed. If dirty data records are generated during data synchronization, the batch synchronization task fails.

      • If you specify a value that is greater than 0 for this parameter, the following situations occur:

        • If the number of dirty data records that are generated during data synchronization is less than or equal to the value that you specified, the dirty data records are ignored and are not written to the destination, and the batch synchronization task continues to run.

        • If the number of dirty data records that are generated during data synchronization is greater than the value that you specified, the batch synchronization task fails.

      Note

      Dirty data indicates data that is meaningless to business, does not match the specified data type, or leads to an exception during data synchronization. If an exception occurs when a single data record is written to the destination, the data record is considered as dirty data. Data records that fail to be written to a destination are considered as dirty data.

      For example, when a batch synchronization task attempts to write VARCHAR-type data in a source to an INT-type field in a destination, a data conversion error occurs, and the data fails to be written to the destination. In this case, the data is dirty data. When you configure a batch synchronization task, you can control whether dirty data is allowed. You can also specify the maximum number of dirty data records that are allowed during data synchronization. If the number of generated dirty data records exceeds the upper limit that you specified, the batch synchronization task fails and exits.

      Distributed Execution

      Specifies whether to enable the distributed execution mode for the batch synchronization task.

      • If you enable the distributed execution mode for a batch synchronization task, the system splits the task into slices and distributes them to multiple Elastic Compute Service (ECS) instances for parallel running. In this case, the more ECS instances, the higher the data synchronization speed.

      • If you do not enable the distributed execution mode for a batch synchronization task, the specified maximum number of parallel threads is used only for a single ECS instance to run the task.

      If you have a high requirement for data synchronization performance, you can use the distributed execution mode to run your batch synchronization task. If you run your batch synchronization task in distributed execution mode, fragment resources of ECS instances can be utilized. This improves resource utilization.

      Important
      • If your exclusive resource group contains only one ECS instance, we recommend that you do not run your batch synchronization task in distributed execution mode.

      • If one ECS instance can meet your business requirements for data transmission speed, you do not need to enable the distributed execution mode. This can simplify the execution mode of your task.

      • The distributed execution mode can be enabled only if the maximum number of parallel threads that you specified is greater than or equal to 8.

      • Whether a batch synchronization task supports the distributed execution mode varies based on the data source type. For more information, see the topics for Readers and Writers. For information about the data source types, Readers, and Writers that are supported by batch synchronization, see Supported data source types, Readers, and Writers.

    4. Configure scheduling dependencies for the synchronization task.

      Double-click the name of the desired workflow. On the configuration tab of the workflow, configure the insert_data node as an ancestor node of the write_result node.依赖关系

  5. After the configuration is complete, click the 保存 icon in the top toolbar to save the task.

Step 3: Commit and deploy the synchronization task

After you save the synchronization task, return to the configuration tab of the workflow to which the synchronization task belongs. Click the 提交 icon in the top toolbar of the configuration tab of the workflow to commit the synchronization task to the scheduling system. The scheduling system runs the synchronization task at the scheduled time from the next day based on scheduling settings.

What to do next

Now you have learned how to create a synchronization task to export data to a specific data source. You can proceed with the next tutorial. In the next tutorial, you will learn how to configure scheduling properties and scheduling dependencies for a synchronization task. For more information, see Configure scheduling properties and scheduling dependencies for a task.

References

For more information about how to configure a batch synchronization task, see Configure a batch synchronization task by using the codeless UI.