Data Integration is a stable, efficient, and scalable data synchronization service. It is designed to migrate and synchronize data between various heterogeneous data sources in complex network environments at a high speed and in a stable manner.

Billing

When you run nodes in Data Integration, you are charged the following fees:
  • Fees that are included in your DataWorks bill
    • Fees for using exclusive resource groups for Data Integration or the shared resource group for Data Integration. The shared resource group for Data Integration is used only for debugging.
    • Fees for using exclusive resource groups for scheduling or the shared resource group for scheduling.
    • Fees for the Internet traffic that is generated if data is transmitted over the Internet.
    • Fees for the DataWorks edition that you use.
    Note These fees are included in your DataWorks bill.
  • Fees that are not included in your DataWorks bill

    You may be charged other fees for the configurations of data synchronization nodes. For example, you may be charged the fees for using data sources, computing and storage features of compute engine instances, and network services such as Express Connect, Elastic IP Address (EIP), and EIP Bandwidth Plan in your data synchronization nodes. These fees are not charged by DataWorks. The bills for these fees are not generated in DataWorks. After you configure and run a data synchronization node, take note of the tasks and fees that are generated when you use the resources of other services.

Note For more information about the billable items of DataWorks, see Billing overview.

Limits

  • Data synchronization

    Data Integration can synchronize structured, semi-structured, and unstructured data. Structured data sources include ApsaraDB RDS and PolarDB-X 1.0. Unstructured data, such as data in Object Storage Service (OSS) objects and text files, must be converted to structured data. Data Integration can synchronize only the data that can be abstracted to two-dimensional logical tables to MaxCompute. Data Integration cannot synchronize unstructured data that cannot be converted to structured data, such as data in MP3 files that are stored in OSS, to MaxCompute.

  • Network connectivity
  • Data Integration supports data synchronization and exchange in the same region or across specific regions.

    Data can be transmitted between specific regions over the classic network, but network connectivity cannot be ensured. If the transmission over the classic network fails, we recommend that you transmit data over the Internet.

  • Data transmission

    Data Integration supports only data synchronization but not data consumption.

  • Data consistency

    Data synchronization by using Data Integration supports only the at-least-once delivery mechanism. It does not support the exact-once delivery mechanism. This indicates that data synchronized to a destination may be duplicated. You can use a primary key and the capabilities of the destination to ensure the uniqueness of the synchronized data.

Batch synchronization

Note The batch synchronization feature provided by DataWorks does not support data synchronization across time zones. If the data sources of a batch synchronization node reside in a different time zone from the resource group that is used to run the node, errors may occur during data synchronization.
Data Integration can be used to synchronize large amounts of offline data. Data Integration facilitates data transmission between diverse structured and semi-structured data sources. It provides Reader and Writer plug-ins for the supported data sources and defines a transmission channel between the sources and destinations based on simplified data types. Batch synchronization

Real-time synchronization

Note The real-time synchronization feature provided by DataWorks does not support data synchronization across time zones. If the data sources of a real-time synchronization node reside in a different time zone from the resource group that is used to run the node, errors may occur during data synchronization.

A real-time synchronization node uses three basic plug-ins to read, convert, and write data. These plug-ins interact with each other based on an intermediate data format that is defined by the plug-ins.

You can use multiple conversion plug-ins to cleanse data in a source and use multiple write plug-ins to write data to a destination for a real-time synchronization node. In some business scenarios, you can use a real-time synchronization solution to synchronize data from multiple source tables in a database to a destination in real time. For more information, see Synchronize data in real time.

Solution-based synchronization

In actual business scenarios, data cannot be synchronized by using only one or more simple batch or real-time synchronization nodes. Instead, multiple batch synchronization nodes, real-time synchronization nodes, and data processing nodes are required to synchronize data. In this case, complex configurations are required.
To resolve this issue, DataWorks provides scenario-oriented data synchronization solutions and allows you to synchronize data between different types of data sources by using simple configurations. For example, you can easily synchronize data to Elasticsearch, Hologres, or MaxCompute in real time by using the related solution. This simplifies data synchronization.
Note

For example, a large amount of data is stored in your database, and you want to synchronize full and incremental data from your database to MaxCompute for analysis. You can use the traditional data synchronization method to perform full synchronization or perform incremental synchronization based on fields such as modify_time in tables in your database. However, in an actual business scenario, the fields may not exist in tables in your database. In this case, you cannot use the Java Database Connectivity (JDBC) driver to extract data for incremental synchronization. You can use a one-click real-time synchronization to MaxCompute solution to synchronize full and incremental data from your database to MaxCompute in real time. After the synchronization, the full and incremental data is automatically merged in MaxCompute. This simplifies data synchronization.

A data synchronization solution has the following benefits:
  • Synchronizes full data at a time.
  • Synchronizes incremental data in real time.
  • Automatically merges incremental and full data on a regular basis and writes the merged data to the related partition in a table that is used to store full data.
For information about the capabilities provided by the solution-based synchronization feature, see Overview of the solution-based synchronization feature.

Data synchronization in complex network environments

Data Integration allows you to synchronize data between heterogeneous data sources in complex network environments. The following types of relationships may exist between a data source and a DataWorks workspace:
  • The data source and DataWorks workspace belong to the same Alibaba Cloud account and reside in the same region.
  • The data source and DataWorks workspace belong to different Alibaba Cloud accounts.
  • The data source and DataWorks workspace reside in different regions.
  • The data source does not belong to Alibaba Cloud.
Before you use a data synchronization node or solution to synchronize data, you must make sure that network connections are established between the exclusive resource group for Data Integration and the data sources. You can select network connectivity solutions based on the network environments in which the data sources are deployed to ensure the network connectivity between the resource group for Data Integration and the data sources. For more information, see Establish a network connection between a resource group and a data source.

Terms

  • parallelism

    Parallelism indicates the maximum number of parallel threads that a synchronization node uses to read data from a source or write data to a destination.

  • bandwidth throttling

    Bandwidth throttling indicates that a maximum transmission rate is specified for a synchronization node in Data Integration.

  • dirty data

    Dirty data indicates data that is meaningless to business, does not match the specified data type, or leads to an exception during data synchronization. If an exception occurs when a single data record is written to the destination, the data record is considered as dirty data. Data records that fail to be written to a destination are considered as dirty data. For example, when a data synchronization node attempts to write VARCHAR-type data in a source to an INT-type field in a destination, a data conversion error occurs, and the data fails to be written to the destination. In this case, the data is dirty data. When you configure a data synchronization node, you can control whether dirty data is allowed. You can also specify the maximum number of dirty data records that are allowed during data synchronization. If the number of generated dirty data records exceeds the upper limit that you specified, the data synchronization node fails.

  • data source

    A data source is a source from which data is processed by DataWorks. A data source can be a database or a data warehouse. DataWorks supports various types of data sources and data type conversion during data synchronization.

    Before you create a data synchronization node, you can add a source and a destination that you need to use on the Data Source page of the DataWorks console. When you create a synchronization node, you must select the source and the destination that you added.