DataWorks provides solutions for various data synchronization scenarios, such as real-time synchronization, full batch synchronization, and incremental batch synchronization. These solutions help you migrate your business data to the cloud in a more efficient and convenient way.

Background information

In actual business scenarios, data cannot be synchronized by using only one or several simple batch or real-time synchronization nodes. Instead, multiple batch synchronization nodes, real-time synchronization nodes, and data processing nodes are required to synchronize data. In this case, complex configurations are required. To resolve this issue, DataWorks provides scenario-based synchronization solutions and allows you to synchronize data between different data sources with simple configurations. For example, you can easily synchronize data to Elasticsearch, Hologres, or MaxCompute by using relevant solutions provided by DataWorks. This simplifies data synchronization.

For example, a large amount of data is stored in your database system, and you want to synchronize full and incremental data from your database to MaxCompute for analysis. The traditional data synchronization method allows you to perform full synchronization or perform incremental synchronization based on fields such as modify_time in database tables. However, these fields may not exist in database tables in an actual scenario. Therefore, you cannot use the Java Database Connectivity (JDBC) driver to extract data for incremental synchronization. The One-click real-time synchronization to MaxCompute solution allows you to synchronize full and incremental data in your database to MaxCompute in real time. After the synchronization, the full and incremental data is automatically merged in MaxCompute. This simplifies data synchronization.

Synchronization solutions provide the following benefits:
  • Initializes full data.
  • Writes incremental data in real time.
  • Automatically merges full and incremental data at a scheduled time and writes the data to the new partition of a full table.

Limits

Synchronization solutions provided by DataWorks do not support data synchronization across time zones. If the time zone where data sources in a synchronization solution reside is different from the time zone of the resource group that is used to run the solution, errors may occur during data synchronization.

Supported data sources

The following table describes the data sources supported by the synchronization solutions of DataWorks.
Destination Source References for configuring data sources References for configuring synchronization nodes
Elasticsearch
  • MySQL
  • PolarDB for MySQL
    Note Among PolarDB data sources, only PolarDB for MySQL data sources are supported.
Configure data sources for data synchronization from MySQL Configure and view a batch synchronization solution used to synchronize all data in a database
Hologres
  • PolarDB for MySQL
  • Oracle
  • MySQL
  • PolarDB-X
Create and configure a sync solution
MaxCompute
  • PolarDB for MySQL
  • Oracle
  • MySQL
  • PolarDB-X
Synchronize data to MaxCompute in real time

Resource usage and billing

When you synchronize data, Data Integration nodes run on resources in resource groups for Data Integration and resource groups for scheduling. You can use only exclusive resource groups for Data Integration. Before you synchronize data, you must purchase an exclusive resource group for Data Integration and add the exclusive resource group to your DataWorks workspace.

The following table describes the performance metrics of exclusive resource groups for Data Integration.
Specifications Maximum number of parallel threads for a batch synchronization node Maximum number of parallel real-time synchronization nodes for a single table in a source Maximum number of parallel real-time synchronization nodes for multiple tables in a source Maximum number of parallel real-time synchronization nodes for table shards
4c8g 8 3 3 Not supported
8c16g 16 6 6 1
12c24g 24 9 9 1
16c32g 32 12 12 2
24c48g 48 18 18 3
For information about the pricing of exclusive resource groups for Data Integration in different regions, see Pricing. The actual prices on the buy page prevail.

You can estimate the required resources and purchase an exclusive resource group for Data Integration based on the amount of data that you want to synchronize. For more information about exclusive resource groups for Data Integration, see Overview of exclusive resource groups for Data Integration. You can use exclusive resource groups for scheduling or the shared resource group for scheduling to run nodes.

Note
  • You are not charged for synchronization solutions. However, a synchronization solution consists of multiple nodes, and you are charged for the resources used to run the nodes. For example, exclusive resource groups for Data Integration and resource groups for scheduling are used to run the real-time and batch synchronization nodes in a synchronization solution. In this case, you are charged for the resource groups.
  • Specific nodes in a synchronization solution may consume MaxCompute computing resources. For example, the One-click real-time synchronization to MaxCompute solution requires periodic merging of full and incremental data. The fees for the MaxCompute computing resources are included in your MaxCompute bill, and are positively correlated to the size of the full data and the merging cycle. For more information, see Overview in MaxCompute documentation.

Network connectivity solutions

For more information about network connectivity solutions, see Overview of network connectivity solutions. This section describes the solutions that can be used to connect a data source to an exclusive resource group.

An exclusive resource group for Data Integration is essentially a group of ECS instances. After you purchase such an exclusive resource group, it is isolated from other services. You must associate the resource group with a virtual private cloud (VPC) to ensure network connectivity between the resource group and data sources during subsequent data synchronization.

The network connectivity solutions vary based on the network environments of the source and destination. Network connectivity
  • The data source is deployed on the Internet.

    Connect the data source to the virtual private cloud (VPC) that is associated with the exclusive resource group.

  • The data source is deployed in a VPC that is in the same region as the exclusive resource group.
    • Same zone: Associate the exclusive resource group with the VPC in which the data source resides.
    • Different zones: Associate the exclusive resource group with a VPC. Then, configure a route between the associated VPC and the VPC in which the data source resides.
  • The data source is deployed in a VPC that is in a different region from the region in which the exclusive resource group resides.
    • Associate the exclusive resource group with a VPC. Then, configure a route between the associated VPC and the VPC in which the data source resides.
    • Associate the exclusive resource group with a VPC. Then, use Express Connect or VPN Gateway to connect the associated VPC to the VPC in which the data source resides.
  • The data source is deployed in a data center.
    • Associate the exclusive resource group with a VPC. Then, configure a route between the associated VPC and the network to which the data center is connected.
    • Associate the exclusive resource group with a VPC. Then, use Express Connect or VPN Gateway to connect the network to which the data center is connected to the associated VPC.
  • The data source is deployed on the Alibaba Cloud classic network.

    The classic network and VPCs cannot be connected. Therefore, we recommend that you migrate the data source to a VPC.

Procedure

To use a synchronization solution of DataWorks, perform the following steps:
  1. Plan and configure resources.

    Estimate the required resources and purchase an exclusive resource group for Data Integration and an exclusive resource group for scheduling based on your network conditions and the amount of data that you want to synchronize. Then, configure resources to ensure network connectivity.

  2. Configure data sources.

    After you establish network connections for data sources between which you want to synchronize data, configure the data sources to ensure accessibility. For example, make sure that the IP addresses of the exclusive resource groups are added to the IP address whitelists of the data sources. Otherwise, the synchronization fails.

  3. Add data sources.

    Add the data sources to DataWorks as the source and destination. This way, you can associate the data sources when you create a synchronization solution.

  4. Create and configure a synchronization solution.

    Create a synchronization solution and set the parameters based on the synchronization scenario.

Note
  • You can add or remove source tables to or from a created synchronization solution. If a real-time synchronization node is running, you must terminate the node before you add or remove the source tables. After you add or remove the tables, click Submit and Run to run the solution. DataWorks automatically creates batch synchronization nodes and updates real-time synchronization nodes. For more information about how to add or remove source tables to or from a synchronization solution that is running, see Add or remove source tables to or from a synchronization solution that is running.
  • When you configure a destination table for a synchronization solution, if you select Create Table for the Table creation method parameter, you can click the name of the table to modify the table creation statements or configurations of the table as needed. Check whether the table creation statements or configurations meet your requirements.
For more information about the synchronization between data sources, see the following topics: