DataWorks provides solutions for various data synchronization scenarios, such as real-time synchronization, offline full synchronization, and offline incremental synchronization. These solutions help enterprises migrate data to the cloud in a more efficient and convenient manner.
In actual business scenarios, data synchronization cannot be completed by using only one or several simple batch or real-time sync nodes. Instead, multiple batch sync nodes, real-time sync nodes, and data processing nodes are required to complete data synchronization. In this case, complex configurations are required. To resolve this issue, DataWorks provides scenario-based synchronization solutions and allows you to synchronize data between different data sources with simple configurations. For example, you can easily synchronize data to Elasticsearch, Hologres, or MaxCompute by using the related solution. This simplifies data synchronization.
For example, a large amount of data is stored in your database system, and you want to synchronize full and incremental data from your database to MaxCompute for analysis. The traditional data synchronization method allows you to perform full synchronization or perform incremental synchronization based on fields such as modify_time in the database table. However, these fields may not exist in the database table in an actual scenario. Therefore, you cannot use the Java Database Connectivity (JDBC) driver to extract data for incremental synchronization. The solution to data synchronization to MaxCompute allows you to synchronize full and incremental data in your database to MaxCompute in real time. After the synchronization, the full and incremental data is automatically merged in MaxCompute. This simplifies data synchronization.
- Initialize full data.
- Write incremental data in real time.
- Automatically merge the full and incremental data at a scheduled time and write the data to the partitions of a new table.
Synchronization solutions provided by DataWorks do not support data synchronization across time zones. If the data sources in a synchronization solution reside in a different time zone from the resource group that is used to run the solution, errors may occur during data synchronization.
Supported data sources
|Destination||Source||Configuration guide of the data source||Configuration guide of the synchronization solution|
||Configure a data source (MySQL)||Configure and view a batch synchronization solution used to synchronize all data in a database|
||Configure and view a data synchronization solution|
||Synchronize data to MaxCompute in real time|
Resource usage and billing
When you synchronize data, Data Integration nodes run on resources in resource groups for Data Integration and resource groups for scheduling. You can use only exclusive resource groups for Data Integration. Before you synchronize data, you must purchase an exclusive resource group for Data Integration and add the exclusive resource group to your DataWorks workspace.
|Specification||Maximum concurrent threads for batch synchronization||Maximum real-time sync nodes|
You can estimate the required resources and purchase an exclusive resource group for Data Integration based on the amount of data that you want to synchronize. For more information about exclusive resource groups for Data Integration, see Exclusive resource groups for Data Integration. You can use exclusive resource groups for scheduling or the shared resource group for scheduling to run nodes.
Network connectivity solutions
For more information about network connectivity solutions, see Overview. This section describes the solutions that can be used to connect a data source to an exclusive resource group.
An exclusive resource group for Data Integration is essentially a group of ECS instances. After you purchase such an exclusive resource group, it is isolated from other services. You must associate the resource group with a virtual private cloud (VPC) to ensure network connectivity between the resource group and data sources during subsequent data synchronization.
- The data source is deployed on the Internet.
Connect the data source to the virtual private cloud (VPC) that is associated with the exclusive resource group.
- The data source is deployed in a VPC that is in the same region as the exclusive resource
- Same zone: Associate the VPC where the data source resides with the exclusive resource group.
- Different zones: Associate a VPC with the exclusive resource group. Then, configure a route between the associated VPC and the VPC where the data source resides.
- The data source is deployed in a VPC that is in a different region from the region
where the exclusive resource group resides.
- Associate a VPC with the exclusive resource group. Then, configure a route between the associated VPC and the VPC where the data source resides.
- Associate a VPC with the exclusive resource group. Then, use Express Connect or VPN Gateway to connect the associated VPC to the VPC where the data source resides.
- The data source is deployed in a data center.
- Associate a VPC with the exclusive resource group. Then, configure a route between the associated VPC and the network where the data center resides.
- Associate a VPC with the exclusive resource group. Then, use Express Connect or VPN Gateway to connect the network where the data center resides to the associated VPC.
- The data source is deployed in the Alibaba Cloud classic network.
The classic network and VPCs cannot be connected. Therefore, we recommend that you migrate the data source to a VPC.
- Plan and configure resources.
Estimate the required resources and purchase an exclusive resource group for Data Integration and an exclusive resource group for scheduling based on your network conditions and the amount of data that you want to synchronize. Then, configure resources to ensure network connectivity.
- Configure data sources.
After you establish network connections for data sources between which you want to synchronize data, configure the data sources to ensure accessibility. For example, make sure that the IP addresses of the exclusive resource groups are added to the IP address whitelists of the data sources. Otherwise, the synchronization fails.
- Add data sources.
Add the data sources to DataWorks as the source and destination. This way, you can associate the data sources when you create a synchronization solution.
- Create and configure a synchronization solution.
Create a synchronization solution and configure the parameters based on the synchronization scenario.