DataWorks provides the real-time data synchronization feature. You can use this feature to synchronize data changes of a table or all tables in a source to a destination in real time. This way, data in the destination is consistent with data in the source in real time.

Architecture

Architecture
Note The Groovy plug-in and the feature to synchronize data to multiple destinations are in development and will be supported soon.
The real-time data synchronization feature has the following benefits:
  • Diverse data sources

    Multiple types of data sources are supported. You can synchronize data between different types of data sources.

  • Synchronization solutions

    You can configure a synchronization solution to synchronize the full data and then incremental data from a common source.

  • Diverse synchronization methods

    You can synchronize data from table shards, a single table in a source, or multiple tables in a source, and configure different processing rules for messages about different DDL operations.

  • Data processing

    You can perform data filtering, string replacement and data masking on the data from a source based on your business requirements and synchronize the processed data to a destination.

  • Monitoring and alerting
    The system can send you alert notifications about service latency, failover, dirty data, heartbeat, and failure by email, text message, or DingTalk message. This way, you can identify and handle alerts at the earliest opportunity.
    Note

    An alert notification can be sent by text message only in the Singapore (Singapore), Malaysia (Kuala Lumpur), and Germany (Frankfurt) regions. If you want to use this notification method in other regions, submit a ticket to contact Alibaba Cloud DataWorks technical support.

  • Graphical development

    You can perform drag-and-drop operations instead of writing code to develop real-time synchronization nodes. It is easy to use for beginners.

Supported synchronization methods, sources, and destinations

The following table describes the sources and destinations that are supported by real-time synchronization nodes.
Note Real-time synchronization nodes do not support a synchronization view.
Synchronization method Source Destination References for configuring data sources References for configuring synchronization nodes
Synchronize data from a single table in a source
  • MySQL Binlog
  • DataHub
  • LogHub (SLS)
  • Kafka
  • PolarDB
  • SQL Server
  • MaxCompute
  • Hologres
  • AnalyticDB MySQL
  • Elasticsearch
  • DataHub
  • Kafka
Configure and manage a real-time data sync node
Synchronize data from all tables in a source
  • PolarDB MySQL
    Note Only PolarDB for MySQL is supported.
  • Oracle
  • MySQL
MaxCompute Configure and manage a real-time data sync node
  • PolarDB MySQL
    Note Only PolarDB for MySQL is supported.
  • Oracle
  • MySQL
  • SQL Server
Hologres Configure and manage a real-time sync node
  • PolarDB MySQL
    Note Only PolarDB for MySQL is supported.
  • OceanBase
  • MySQL
  • Oracle
DataHub Configure and manage a real-time data sync node
MySQL Kafka Configure data sources for data synchronization from MySQL Configure and manage a real-time sync node

Resource usage and pricing

Before you use a data synchronization node to synchronize data, you must purchase an exclusive resource group for data integration and add the resource group to DataWorks for subsequent use.

The following table describes the performance metrics of exclusive resource groups for Data Integration.
Specifications Maximum number of parallel threads for a batch synchronization node Maximum number of parallel real-time synchronization nodes for a single table in a source Maximum number of parallel real-time synchronization nodes for multiple tables in a source Maximum number of parallel real-time synchronization nodes for table shards
4c8g 8 3 3 Not supported
8c16g 16 6 6 1
12c24g 24 9 9 1
16c32g 32 12 12 2
24c48g 48 18 18 3
For information about the pricing of exclusive resource groups for Data Integration in different regions, see Pricing. The actual prices on the buy page prevail.

You can estimate the required resources and purchase an exclusive resource group for Data Integration based on the amount of data that you want to synchronize. For more information about exclusive resource groups for Data Integration, see Exclusive resources for Data Integration.

Best practice: We recommend that you use different resource groups for batch synchronization nodes and real-time synchronization nodes. This ensures the isolation of resources used by the two types of nodes, which prevents issues such as resource preemption and runtime exception. Otherwise, CPU resources, memory resources, and networks used by the two types of nodes may affect each other. In this case, batch synchronization nodes may slow down or real-time synchronization nodes may be delayed. Even worse, out of memory (OOM) errors may occur due to the lack of resources.

Network connectivity solutions

For more information about network connectivity solutions, see Overview of network connectivity solutions. This section describes the solutions that can be used to connect a data source to an exclusive resource group.

An exclusive resource group for Data Integration is essentially a group of ECS instances. After you purchase such an exclusive resource group, it is isolated from other services. You must associate the resource group with a virtual private cloud (VPC) to ensure network connectivity between the resource group and data sources during subsequent data synchronization.

The network connectivity solution varies based on the network environments of a source and a destination. Network connectivity for real-time data synchronization
  • The data source is deployed on the Internet.

    Connect the data source to the virtual private cloud (VPC) that is associated with the exclusive resource group.

  • The data source is deployed in a VPC that is in the same region as the exclusive resource group.
    • Same zone: Associate the exclusive resource group with the VPC in which the data source resides.
    • Different zones: Associate the exclusive resource group with a VPC. Then, configure a route between the associated VPC and the VPC in which the data source resides.
  • The data source is deployed in a VPC that is in a different region from the region in which the exclusive resource group resides.
    • Associate the exclusive resource group with a VPC. Then, configure a route between the associated VPC and the VPC in which the data source resides.
    • Associate the exclusive resource group with a VPC. Then, use Express Connect or VPN Gateway to connect the associated VPC to the VPC in which the data source resides.
  • The data source is deployed in a data center.
    • Associate the exclusive resource group with a VPC. Then, configure a route between the associated VPC and the network to which the data center is connected.
    • Associate the exclusive resource group with a VPC. Then, use Express Connect or VPN Gateway to connect the network to which the data center is connected to the associated VPC.
  • The data source is deployed on the Alibaba Cloud classic network.

    The classic network and VPCs cannot be connected. Therefore, we recommend that you migrate the data source to a VPC.

Procedure

To use a synchronization solution of DataWorks, perform the following steps:
  1. Plan and configure resources.

    Estimate the required resources and purchase an exclusive resource group for Data Integration based on the amount of data that you want to synchronize and the network environment. Configure the resources to ensure network connectivity.

  2. Configure data sources.

    After you establish network connections for data sources between which you want to synchronize data, configure the data sources to ensure accessibility. For example, make sure that the IP addresses of the exclusive resource groups are added to the IP address whitelists of the data sources. Otherwise, the synchronization fails.

  3. Add data sources.

    Add the data sources to DataWorks as the source and destination. This way, you can associate the data sources when you create a synchronization solution.

  4. Create and configure a synchronization solution.

    Create a synchronization solution and set the parameters based on the synchronization scenario.

Note If you set the Table creation method parameter to Create Table when you configure a destination table for the synchronization node, you can click the table name to view and modify the table creation statements. Check whether the table creation statements meet your requirements.
For more information about the synchronization between sources and destinations, see the following topics: