Overview - DataWorks - Alibaba Cloud Documentation Center

Data Integration is a stable, efficient, and scalable data synchronization service. It is designed to migrate and synchronize data between various heterogeneous data sources in complex network environments at a high speed and in a stable manner.

Limits

Data synchronization
Data Integration can synchronize structured, semi-structured, and unstructured data. Structured data sources include ApsaraDB RDS and PolarDB-X 1.0. Unstructured data, such as data in Object Storage Service (OSS) objects and text files, must be converted to structured data. Data Integration can synchronize only the data that can be abstracted to two-dimensional logical tables to MaxCompute. Data Integration cannot synchronize unstructured data that cannot be converted to structured data, such as data in MP3 files that are stored in OSS, to MaxCompute.
Network connectivity
Data Integration supports data synchronization and exchange in the same region or across specific regions. Data can be transmitted between specific regions over the classic network, but network connectivity cannot be ensured. If the network connectivity test for the classic network fails, we recommend that you establish network connections over the Internet.
Data transmission
Data Integration supports only data synchronization but not data consumption.
Data consistency
Data synchronization by using Data Integration supports only the at-least-once delivery mechanism. It does not support the exact-once delivery mechanism. This indicates that data synchronized to a destination may be duplicated. You can use a primary key and the capabilities of the destination to ensure the uniqueness of the synchronized data.

Note

When you configure a synchronization task, you must pay attention to the value precision of the data types of source and destination fields. Data may fail to be written to the destination or field values of a specific precision may be truncated in one of the following situations: The maximum value of the data type of a destination field is less than the maximum value of the data type of a source field, the minimum value of the data type of a destination field is greater than the minimum value of the data type of a source field, and the value precision of the data type of a destination field is lower than the value precision of the data type of a source field.

Batch synchronization

Note

The batch synchronization feature provided by DataWorks does not support data synchronization across time zones. If the data sources of a batch synchronization task reside in a different time zone from the resource group that is used to run the task, errors may occur during data synchronization.

Data Integration can be used to synchronize large amounts of offline data. Data Integration facilitates data transmission between diverse structured and semi-structured data sources. It provides reader and writer for the supported data sources and defines a data transmission channel between the sources and destinations based on simplified data types. Batch synchronization

Real-time synchronization

Note

The real-time synchronization feature provided by DataWorks does not support data synchronization across time zones. If the data sources of a real-time synchronization task reside in a different time zone from the resource group that is used to run the task, errors may occur during data synchronization.

A real-time synchronization task uses three basic plug-ins to read, convert, and write data. These plug-ins interact with each other based on an intermediate data format that is defined by the plug-ins.

You can use multiple conversion plug-ins to cleanse data in a source and use multiple write plug-ins to write data to a destination for a real-time synchronization task. In some business scenarios, you can use a full and incremental synchronization task to synchronize data from multiple tables in a database to a destination in real time. For more information, see Synchronize data in real time.

Full and incremental synchronization

In actual business scenarios, data cannot be synchronized by using only one or more simple batch or real-time synchronization tasks. Instead, multiple batch synchronization tasks, real-time synchronization tasks, and data processing tasks are required to synchronize data. In this case, complex configurations are required.

To resolve this issue, DataWorks provides a scenario-specific synchronization solution that allows you to synchronize data between different types of data sources by using simple configurations. For example, you can create a one-click real-time synchronization task to easily synchronize data to Elasticsearch, Hologres, or MaxCompute. This simplifies data synchronization.

Note

For example, a large amount of data is stored in your database, and you want to synchronize full and incremental data from your database to MaxCompute for analysis. You can use the traditional data synchronization method to perform full synchronization or perform incremental synchronization based on fields such as modify_time in tables in your database. However, these fields may not exist in database tables in an actual scenario. Therefore, you cannot use the Java Database Connectivity (JDBC) driver to extract data for incremental synchronization. To synchronize the full and incremental data to MaxCompute, you can create a one-click real-time synchronization task. After the synchronization is complete, the full and incremental data is automatically merged in MaxCompute. This simplifies data synchronization.

A full and incremental synchronization task has the following benefits:

Synchronizes full data at a time.
Synchronizes incremental data in real time.
Automatically merges incremental and full data on a regular basis and writes the merged data to the related partition in a table that is used to store full data.

For information about the capabilities that you can use when you configure a full and incremental synchronization feature, see Overview of the solution-based synchronization feature.

Data synchronization in complex network environments

Data Integration allows you to synchronize data between heterogeneous data sources in complex network environments. The following types of relationships may exist between a data source and a DataWorks workspace:

The data source and DataWorks workspace belong to the same Alibaba Cloud account and reside in the same region.
The data source and DataWorks workspace belong to different Alibaba Cloud accounts.
The data source and DataWorks workspace reside in different regions.
The data source does not belong to Alibaba Cloud.

Before you use a synchronization task to synchronize data, you must make sure that network connections are established between the exclusive resource group for Data Integration and the data sources. You can select network connectivity solutions based on the network environments in which the data sources are deployed to ensure the network connectivity between the resource group for Data Integration and the data sources. For more information, see Establish a network connection between a resource group and a data source.

Terms

parallelism
Parallelism indicates the maximum number of parallel threads that a synchronization task uses to read data from a source or write data to a destination.
throttling
Throttling indicates that the maximum transmission rate at which a synchronization task can transmit data.
dirty data
Dirty data indicates meaningless data and data that does not match the specified data type or leads to an exception during data synchronization. If an exception occurs when a single data record is written to the destination, the data record is considered dirty data. Data records that fail to be written to a destination are considered as dirty data. For example, when a synchronization task attempts to write VARCHAR-type data in a source to an INT-type field in a destination, a data conversion error occurs, and the data fails to be written to the destination. In this case, the data is dirty data. When you configure a synchronization task, you can control whether dirty data is allowed during data synchronization. You can also specify the maximum number of dirty data records that are allowed during data synchronization. If the number of generated dirty data records exceeds the upper limit that you specified, the synchronization task fails and exits.
- If dirty data is generated when a batch synchronization task or real-time synchronization task is run, the task may fail. The data that is synchronized to the destination before the task fails is not rolled back.
- To improve data synchronization efficiency, Data Integration allows you to write multiple data records to a destination at a time during data synchronization. If an exception occurs when you write a batch of data records to a destination, support for the rollback of the data records varies based on whether the destination supports the transaction mechanism. Data Integration does not support the transaction mechanism.
data source
A data source is the source of data that is processed by DataWorks. A data source can be a database or a data warehouse. DataWorks supports various types of data sources and data type conversion during data synchronization.
Before you create a synchronization task, you can add the data sources that you need to use on the Data Source page of the DataWorks console. When you create a synchronization task, you must select the added data sources to use the data sources as the source and destination of the task.