All Products
Search
Document Center

DataWorks:Overview of the real-time synchronization feature

Last Updated:Jul 28, 2023

DataWorks provides the real-time synchronization feature to allow you to synchronize data changes from a single table or all tables in a source to a destination in real time. This way, data in the destination is consistent with data in the source in real time.

Limits

  • You cannot run a real-time synchronization node on the DataStudio page. Instead, you must run a real-time synchronization node in Operation Center in the production environment after you save and commit the node.

  • Real-time synchronization nodes can be run only on exclusive resource groups for Data Integration. For more information, see Exclusive resource groups for Data Integration.

  • Real-time synchronization nodes cannot be used to synchronize views.

Overview

The following figure shows the capabilities of the real-time synchronization feature.Capabilities provided by real-time synchronization

Capability

Description

Data synchronization between various data sources

The real-time synchronization feature allows you to combine multiple types of data sources to form a star-shaped data synchronization link. You can synchronize data between different types of data sources. For more information, see Data source types that support real-time synchronization.

Data synchronization from or to data sources that are deployed in complex network environments

The real-time synchronization feature supports data synchronization from or to Alibaba Cloud data sources, data centers, data sources that are hosted on Elastic Compute Service (ECS) instances, and data sources that do not belong to Alibaba Cloud. You can select appropriate network connectivity solutions to establish network connections between your resource group and data sources based on the network environments in which the data sources are deployed. Before you configure a data synchronization node, you must make sure that network connections are established between your resource group for Data Integration and data sources. For more information about how to establish a network connection between a resource group and a data source, see Establish a network connection between a resource group and a data source.

Data synchronization scenarios

The real-time synchronization feature allows you to synchronize incremental data from a single table to another single table in real time, synchronize incremental data from tables in sharded databases to a single table in real time, and synchronize incremental data from multiple tables in a database to multiple tables in real time.

  • Real-time synchronization of incremental data from a single table: supports real-time extract, transform, and load (ETL) of incremental data from a single table.

  • Real-time synchronization of incremental data from tables in one or more databases:

    • Supports synchronization of logs for changes from all tables in a source database to a destination. In most cases, this synchronization mode is used to collect real-time logs.

    • Supports synchronization of data from multiple tables in multiple databases of the same source at a time. You can specify a maximum of 3,000 source tables in a data synchronization node.

Note

The real-time synchronization feature can be used to synchronize only incremental data in real time. If you want to synchronize full data from a source at a time and then synchronize incremental data from the source in real time, you can use the solution-based synchronization feature. You can use the solution-based synchronization feature to continuously synchronize data from a source to a destination, which helps ensure the consistency between data in the destination and data in the source in real time. For more information about how to select a data synchronization feature, see Overview.

Configurations for real-time synchronization nodes

The real-time synchronization feature provides the following capabilities to allow you to configure a real-time synchronization node. You do not need to write code to configure the node. You need to only make simple configurations for the node to perform real-time ETL of incremental data from a single table or real-time synchronization of incremental data from multiple tables in a database. For more information, see Configure a real-time synchronization node to synchronize incremental data from a single table and Create a real-time synchronization node to synchronize all incremental data from a database.

  • Real-time synchronization of incremental data from a single table:

    • Graphical development is supported. You do not need to write code to develop real-time synchronization nodes. Instead, you need to only perform drag-and-drop operations.

    • Real-time ETL of incremental data from a single table is supported. You can use a Data Filtering, String Replace, or Data Masking node to process data in a source and synchronize the processed data to a destination.

      • Data filtering: You can use a Data Filtering node to filter data in a source based on specific rules, such as the field size. Only data that meets the rules is retained.

      • String replacement: You can use a String Replace node to replace the field values of the STRING type.

      • Data masking: You can use a Data Masking node to mask data in a single table that you want to synchronize in real time and synchronize the processed data to a destination.

  • Real-time synchronization of incremental data from multiple tables in a database:

    • Specify a custom name for a destination schema or table

      If you run a real-time synchronization node to synchronize incremental data from multiple tables in a database to a destination, the incremental data is automatically written to the destination schemas or tables with the same names as the source schemas or tables. If the destination does not contain schemas or tables with the same names as the source schemas or tables, the system creates such schemas or tables in the destination, and you can specify custom names for the schemas or tables.

    • Add fields to a destination and assign values to the fields

      If you run a real-time synchronization node to synchronize incremental data from multiple tables in a database to a destination, the system establishes mappings between source fields and destination fields that have the same names. The values of fields in the source are written to the fields in the destination that have the same names as the fields in the source. For fields that have no mapped fields in the destination, the field values cannot be synchronized. You can add fields to a destination table and assign constants or variables to the fields as values.

      Note

      If you synchronize data from a MySQL, Oracle, LogHub, or PolarDB data source to a DataHub or Kafka data source in real time, Data Integration adds five fields to the destination. These fields are used for operations such as metadata management, sorting, and deduplication. For more information, see Fields used for real-time synchronization.

    • Configure rules to process DDL or DML messages

      DDL operations may be performed on the source. Before you synchronize data in real time, you can configure rules to process different DDL messages based on your business requirements.

      Note

      For more information about the support of different destinations for DDL and DML operations on sources, see Supported DML and DDL operations.

O&M for real-time synchronization nodes

  • Configure alerting and monitoring settings for a real-time synchronization node

    • Resumable uploads are supported.

    • You can configure alerting and monitoring settings for a real-time synchronization node based on one of the following conditions: business delay, failover, support for DDL statements, and heartbeat check. For more information, see O&M for real-time synchronization nodes.

    • You can configure DataWorks to send alert notifications by email, text message, or DingTalk message to the specified alert recipient. This helps the alert recipient identify and troubleshoot exceptions at the earliest opportunity.

    • You can control alerting frequency. To prevent a large number of alerts from being generated within a short period of time, DataWorks allows you to control alerting frequency for real-time synchronization nodes. You can configure the related settings to enable DataWorks to send only one alert notification based on the alert rule within a specified period of time.

  • Specify the maximum number of dirty data records allowed and the impacts of dirty data records on a real-time synchronization node

    • If you do not allow the generation of dirty data and dirty data records are generated during data synchronization, the real-time synchronization node fails.

    • If you allow the generation of dirty data and specify the maximum number of dirty data records that are allowed, the number of generated dirty data records determines whether the node fails. If the number of generated dirty data records does not exceed the specified limit, the dirty data is ignored and the node continues to run. If the number of generated dirty data records exceeds the specified limit, the node fails.

    Note

    For more information about dirty data records, see Terms.