Data Integration is a stable, efficient, and scalable data synchronization service. It is designed to migrate and synchronize data between various heterogeneous data sources in complex network environments at a high speed and in a stable manner.

Billing

When you run nodes in Data Integration, you are charged the following fees:
  • Fees that are included in your DataWorks bill
    • Fees for using exclusive resource groups for Data Integration or the shared resource group for Data Integration. The shared resource group for Data Integration is used only for debugging.
    • Fees for using exclusive resource groups for scheduling or the shared resource group for scheduling.
    • Fees for the Internet traffic that is generated if data is transmitted over the Internet.
    • Fees for the DataWorks edition that you use.
    Note These fees are included in your DataWorks bill.
  • Fees that are not included in your DataWorks bill

    You may be charged other fees for the configurations of data synchronization nodes. For example, you may be charged the fees for using data sources, computing and storage features of compute engine instances, and network services such as Express Connect, Elastic IP Address (EIP), and EIP Bandwidth Plan in your data synchronization nodes. These fees are not charged by DataWorks. The bills for these fees are not generated in DataWorks. After you configure and run a data synchronization node, take note of the tasks and fees that are generated when you use the resources of other services.

Note For more information about the billable items of DataWorks, see Overview.

Limits

  • Data Integration can synchronize structured, semi-structured, and unstructured data. Structured data sources include ApsaraDB RDS and PolarDB-X 1.0. Unstructured data, such as data in Object Storage Service (OSS) objects and text files, must be converted to structured data. Data Integration can synchronize only the data that can be abstracted to two-dimensional logical tables to MaxCompute. Data Integration cannot synchronize unstructured data that cannot be converted to structured data, such as data in MP3 files that are stored in OSS, to MaxCompute.
  • Data Integration supports data synchronization and exchange in the same region or across specific regions.

    Data can be transmitted between specific regions over the classic network, but network connectivity cannot be ensured. If the transmission over the classic network fails, we recommend that you transmit data over the Internet.

  • Data Integration supports only data synchronization but not data consumption.
  • Data synchronization by using Data Integration supports only the at-least-once delivery mechanism. It does not support the exact-once delivery mechanism. This indicates that data synchronized to a destination may be duplicated. You can use a primary key and the capabilities of the destination to ensure the uniqueness of the synchronized data.

Batch synchronization

Data Integration can be used to synchronize large amounts of offline data. Data Integration facilitates data transmission between diverse structured and semi-structured data sources. It provides readers and writers for the supported data sources and defines a transmission channel between sources and destinations based on simplified data types. Batch synchronization

Development modes of synchronization nodes

You can develop synchronization nodes in one of the following modes:
  • Codeless user interface (UI): Data Integration provides step-by-step instructions to help you configure a synchronization node. This mode is easy to use but provides only limited features. For more information, see Configure a synchronization node by using the codeless UI.
  • Code editor: You can write a JSON script to create a synchronization node. This mode supports advanced features to facilitate flexible and fine-grained configuration. It is suitable for experienced users and increases the cost of learning. For more information, see Create a synchronization node by using the code editor.
Note
  • The code that is generated for a synchronization node on the codeless UI can be converted to a script. This conversion is irreversible. After the conversion is complete, you cannot switch back to the codeless UI mode.
  • Before you write code, you must add data sources to DataWorks and create tables in the destination that are used to store the synchronized data.

Network connectivity

You can run a synchronization node on a resource group for Data Integration to synchronize data from a source to a destination. Before you run the synchronization node, make sure that the resource group for Data Integration is connected to the data sources.

Data synchronization

Data Integration allows you to synchronize data between heterogeneous data sources in various network environments. You can select the network connection solution based on the network environment in which the data sources reside to ensure the network connectivity between the resource group for Data Integration and the data sources. For more information, see Select a network connectivity solution.

Data Integration supports data sources that reside on the classic network, in virtual private clouds (VPCs), or in data centers.
  • Classic network: a network that is deployed and managed by Alibaba Cloud. The classic network is shared by Alibaba Cloud accounts.
  • VPC: a network that is created on Alibaba Cloud and provides an isolated network environment. You have full control over your VPC. For example, you can customize the IP address range, divide the VPC into multiple subnets, and configure route tables and gateways.

    A VPC is an isolated network for which you can specify custom values for parameters, such as the parameters for IP address range, subnets, and gateways. Data Integration provides the feature that automatically detects the reverse proxy for the following data sources based on the wide deployment of VPCs: ApsaraDB RDS for MySQL, ApsaraDB RDS for PostgreSQL, ApsaraDB RDS for SQL Server, PolarDB, PolarDB-X 1.0, HybridDB for MySQL, AnalyticDB for PostgreSQL, and AnalyticDB for MySQL 3.0. This feature frees you from purchasing an Elastic Compute Service (ECS) instance in your VPC to configure synchronization nodes for these data sources. Data Integration uses this feature to automatically detect and establish network connections between these data sources.

    When you configure synchronization nodes for other Alibaba Cloud data sources in a VPC, such as ApsaraDB RDS for PPAS, ApsaraDB for OceanBase, ApsaraDB for Redis, ApsaraDB for MongoDB, ApsaraDB for Memcache, Tablestore, and ApsaraDB for HBase data sources, you must purchase an ECS instance in the same VPC. This ECS instance is used to connect to the data sources.

  • Data center: a network that is deployed by yourself. This type of network is isolated from Alibaba Cloud networks.
For more information about the classic network and VPCs, see VPC FAQ.
Note You can connect to data sources over the Internet. However, the connection speed depends on the Internet bandwidth, and additional network connection expenses are required. We recommend that you do not connect to data sources over the Internet. For more information about the billing rules of Internet traffic generated during data synchronization, see Internet traffic generated by Data Integration.

Terms

  • parallelism

    Parallelism indicates the maximum number of parallel threads that a synchronization node uses to read data from a source or write data to a destination.

  • bandwidth throttling

    Bandwidth throttling indicates that a maximum transmission rate is specified for a synchronization node in Data Integration.

  • dirty data

    Dirty data indicates data that is meaningless to business, does not match the specified data type, or leads to an exception during data synchronization. If an exception occurs when a single data record is written to the destination, the data record is considered as dirty data. Data records that fail to be written to a destination are considered as dirty data. In most cases, dirty data is the data that does not match the specified data type. For example, you want to write VARCHAR-type data in a source to an INT-type field in a destination. A data conversion error occurs, and the data cannot be written to the destination. In this case, the data is dirty data.

    Dirty data cannot be written to a destination. When you configure a data synchronization node, you can specify whether dirty data can be generated. You can also specify the maximum number of dirty data records that can be generated during data synchronization. If the number of generated dirty data records exceeds the upper limit that you specify, the synchronization node fails.

  • data source

    A data source is a source from which data is processed by DataWorks. A data source can be a database or a data warehouse. DataWorks supports various types of data sources and data type conversion during data synchronization.

    Before you create a data synchronization node, you can add a source and a destination that you need to use on the Data Source page of the DataWorks console. When you create a synchronization node, you must select the source and the destination that you added.

References