Data Integration is a stable, efficient, and scalable data synchronization service. It is designed to migrate and synchronize data between various heterogeneous data sources in complex network environments at a high speed and in a stable manner.
- Data synchronization
Data Integration can synchronize structured, semi-structured, and unstructured data. Structured data sources include ApsaraDB RDS and PolarDB-X 1.0. Unstructured data, such as data in Object Storage Service (OSS) objects and text files, must be converted to structured data. Data Integration can synchronize only the data that can be abstracted to two-dimensional logical tables to MaxCompute. Data Integration cannot synchronize unstructured data that cannot be converted to structured data, such as data in MP3 files that are stored in OSS, to MaxCompute.
- Network connectivity
Data Integration supports data synchronization and exchange in the same region or across specific regions. Data can be transmitted between specific regions over the classic network, but network connectivity cannot be ensured. If the network connectivity test for the classic network fails, we recommend that you establish network connections over the Internet.
- Data transmission
Data Integration supports only data synchronization but not data consumption.
- Data consistency
Data synchronization by using Data Integration supports only the at-least-once delivery mechanism. It does not support the exact-once delivery mechanism. This indicates that data synchronized to a destination may be duplicated. You can use a primary key and the capabilities of the destination to ensure the uniqueness of the synchronized data.
A real-time synchronization node uses three basic plug-ins to read, convert, and write data. These plug-ins interact with each other based on an intermediate data format that is defined by the plug-ins.
You can use multiple conversion plug-ins to cleanse data in a source and use multiple write plug-ins to write data to a destination for a real-time synchronization node. In some business scenarios, you can use a real-time synchronization solution to synchronize data from multiple source tables in a database to a destination in real time. For more information, see Synchronize data in real time.
Solution-based synchronizationIn actual business scenarios, data cannot be synchronized by using only one or more simple batch or real-time synchronization nodes. Instead, multiple batch synchronization nodes, real-time synchronization nodes, and data processing nodes are required to synchronize data. In this case, complex configurations are required.
For example, a large amount of data is stored in your database, and you want to synchronize full and incremental data from your database to MaxCompute for analysis. You can use the traditional data synchronization method to perform full synchronization or perform incremental synchronization based on fields such as modify_time in tables in your database. However, in an actual business scenario, the fields may not exist in tables in your database. In this case, you cannot use the Java Database Connectivity (JDBC) driver to extract data for incremental synchronization. You can use a one-click real-time synchronization to MaxCompute solution to synchronize full and incremental data from your database to MaxCompute in real time. After the synchronization, the full and incremental data is automatically merged in MaxCompute. This simplifies data synchronization.
- Synchronizes full data at a time.
- Synchronizes incremental data in real time.
- Automatically merges incremental and full data on a regular basis and writes the merged data to the related partition in a table that is used to store full data.
Data synchronization in complex network environments
- The data source and DataWorks workspace belong to the same Alibaba Cloud account and reside in the same region.
- The data source and DataWorks workspace belong to different Alibaba Cloud accounts.
- The data source and DataWorks workspace reside in different regions.
- The data source does not belong to Alibaba Cloud.
Parallelism indicates the maximum number of parallel threads that a data synchronization node uses to read data from a source or write data to a destination.
Throttling indicates that the maximum transmission rate at which a data synchronization node can transmit data.
- dirty dataDirty data indicates meaningless data and data that does not match the specified data type or leads to an exception during data synchronization. If an exception occurs when a single data record is written to the destination, the data record is considered dirty data. Data records that fail to be written to a destination are considered as dirty data. For example, when a data synchronization node attempts to write VARCHAR-type data in a source to an INT-type field in a destination, a data conversion error occurs, and the data fails to be written to the destination. In this case, the data is dirty data. When you configure a data synchronization node, you can control whether dirty data is allowed during data synchronization. You can also specify the maximum number of dirty data records that are allowed during data synchronization. If the number of generated dirty data records exceeds the upper limit that you specified, the data synchronization node fails and exits.
- If dirty data is generated when a batch synchronization node or real-time synchronization node is run, the node may fail. The data that is synchronized to the destination before the node fails is not rolled back.
- To improve data synchronization efficiency, Data Integration allows you to write multiple data records to a destination at a time during data synchronization. If an exception occurs when you write a batch of data records to a destination, support for the rollback of the data records varies based on whether the destination supports the transaction mechanism. Data Integration does not support the transaction mechanism.
- data source
A data source is the source of data that is processed by DataWorks. A data source can be a database or a data warehouse. DataWorks supports various types of data sources and data type conversion during data synchronization.
Before you create a data synchronization node, you can add the data sources that you need to use on the Data Source page of the DataWorks console. When you create a data synchronization node, you must select the added data sources to use the data sources as the source and destination of the node.